scientific data management
TRANSCRIPT
![Page 1: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/1.jpg)
DatamanagementResponsible Conduct of Research
Seminar SeriesUC Berkeley
April 16, 2012
![Page 2: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/2.jpg)
Who are you?
Jeffery Loo, PhD
“Flying books”Installation by J. Ignacio Diaz de Rabago
UC Berkeley Library
![Page 3: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/3.jpg)
NSF data management plan
Requirement as of January 18,
2011
Your plans to organize, store, and share data
http://www.nsf.gov/bfa/dias/policy/dmp.jsp
![Page 4: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/4.jpg)
“My Data Management Plan – a satire”
Dr. C. Titus BrownAssistant ProfessorMichigan State University
Source
![Page 5: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/5.jpg)
Dear NSF,
I am happy to respond to your request for a 2-page Data Management Plan.
First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993, […] I have seen the need for a systematic plan to manage the data. It is nice to see NSF stepping up to the plate in such a timely manner, and I am happy to comply.
Now, as to my actual data management plan, here is how I plan to deal with research data in the future.
I will store all data on at least one, and possibly up to 50, hard drives in my lab.
The directory structure will be custom, not self-explanatory, and in no way documented or described. Students working with the data will be encouraged to make their own copies and modify them as they please, in order to ensure that no one can ever figure out what the actual real raw data is.
![Page 6: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/6.jpg)
Backups will rarely, if ever, be done.
When required to make the data available by my program manager, my collaborators, and ultimately by law, I will grudgingly do so by placing the raw data on an FTP site, named with UUIDs like 4e283d36-61c4-11df-9a26-edddf420622d. I will under no circumstances make any attempt to provide analysis source code, documentation for formats, or any metadata with the raw data. When requested (and ONLY when requested), I will provide an Excel spreadsheet linking the names to data sets with published results. This spreadsheet will likely be wrong -- but since no one will be able to analyze the data, that won't matter.
[….]
Note, we didn't use a version control system, either. […] And our repository is not publicly available - you have to beg for permission. Note, I only answer e-mail on every other Tuesday.
![Page 7: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/7.jpg)
Any design notes on the data analysis are in our private e-mail, and we will fight to the death -- up to and including ignoring FOIA requests -- to prevent you from obtaining them.
Meanwhile we will continue publishing exciting sounding (but irerproducible) analyses, and submitting grants based on them, because that's the only thing that the reviewers care about.
sincerely yours,
--titus
(representing every computational scientist in the world.)
![Page 8: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/8.jpg)
Data challenges
Distributed, uncoordinated effort
Concerns about data re-use
Data management may be ad lib“Can’t you ever relax?”
Informal data management practices
![Page 9: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/9.jpg)
Lots to do!
Ensure long-term access
Facilitate sharing
Prepare for future re-use
![Page 10: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/10.jpg)
Data activities in
the research workflow
Source:http://www2.lib.virginia.edu/brown/data/lifecycle.html
![Page 11: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/11.jpg)
Lots of different research products
Models and computational simulations Images, photographs, audio, and video
Instrument readings Maps
Software Artifacts and samples
Physical collections And more …
![Page 12: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/12.jpg)
Goal for this lunch hour
Review “first steps” in data management
Saving dataDescribing/documenting dataSharing dataData management planningData ethics
![Page 13: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/13.jpg)
Common sense versus common practice
![Page 14: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/14.jpg)
Saving data
![Page 15: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/15.jpg)
![Page 16: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/16.jpg)
Hall of fame anecdote
http://www.youtube.com/watch?v=J6HtRWyiL98
![Page 17: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/17.jpg)
Where do you store data safely?
Traditional storage not always sufficientPersonal computersDepartmental/university servers
Two additional types of storageArchives and repositoriesCloud storage (storing files in an online site)
![Page 18: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/18.jpg)
Archives and repositories
Special types of online storage sites
Long-term storage, management, and preservation
Search, download, and analytic functionalities
![Page 19: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/19.jpg)
Institutional archives and repositories
Merritthttp://merritt.cdlib.org/
Data repository management services at UCBhttp://ist.berkeley.edu/ds
![Page 20: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/20.jpg)
Public archive and repository
Long-term access, open to the public
GenBankhttp://www.ncbi.nlm.nih.gov/genbank/
![Page 21: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/21.jpg)
3rd party cloud storage
Amazon S3Google Docs
Dropbox
Beware of posting sensitive data/files
![Page 22: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/22.jpg)
Deciding on storage
Consider:Permanence Oversight Security
![Page 23: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/23.jpg)
Save for long-term access
Recommended file formats• Non-proprietary• Uncompressed and unencrypted (okay to encrypt sensitive
data)• Common usage by your research community• Standard representation (e.g., ASCII text, Unicode)
![Page 24: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/24.jpg)
1 2 3
Original master Local external storage
Remote external storage
UC Berkeley IST backup services
3rd party services (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite Free, Dropbox)
Email a copy to yourself
Backup 3 copies
![Page 25: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/25.jpg)
Describing anddocumenting data
(metadata)
![Page 26: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/26.jpg)
What countries have a five-pointed star on their national flag?
![Page 27: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/27.jpg)
DOI: 10.1126/science.1207745
“outsourcing” our memory
“we don’t remember information as well, when we expect to find it on a computer later”
![Page 28: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/28.jpg)
If we outsource our memory to computers …
We need good organization structures toFind data from the past quickly and completelyUnderstand data from the past
It helps toDocument and describe data“Assign metdata”
![Page 29: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/29.jpg)
What do you document?
Descriptivemetadata elements
Administrative metadata elements
Structural metadata elements
Title Creator or contact Date Experimental conditions MethodologyVersion
Dictionary or codebook to explain the data variables
Tools and software needed for processing or visualizing the data
File formats
File names
![Page 30: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/30.jpg)
How to record metadata
writemetadata
save asreadme.txt
store in file folder with data
Option 1
![Page 31: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/31.jpg)
Metadata form/file in an archive/repository
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html
Option 2
![Page 32: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/32.jpg)
Annotate
<title>Effect of salt on ice cream production efficiency</title> <temperature>0</temperature>
XML, a popular system for annotating datahttp://www.w3schools.com/xml/
Option 3
![Page 33: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/33.jpg)
Assign descriptive namesDescriptive file names
Descriptive folder names
Consider these elements:• project title• experimental
conditions and group• trial numbers• file version number
indicating data modifications• date or time stamps• author initials
data1.csv 75-celsius-trial_control_ver002.csv
Data > 1 > raw >> part A >> 110904 > readings
Project-title > Trial 1 >> Experimental >> Control > Trial 2 > Trial 3
![Page 34: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/34.jpg)
Australia
Brazil
Cape Verde
Ethiopia
United States of America
![Page 35: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/35.jpg)
Sharing data
![Page 36: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/36.jpg)
Historic data sharing
Anagrams to secure discoveriesVersus the “open science revolution” of journals
today
Galileo Newton Huygens Hooke
![Page 37: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/37.jpg)
Open scienceShare research data, products, and
communications openly
Potential benefitsProtects unique data that cannot be readily replicatedReinforces open scientific inquiryEncourages diversity of analysis and opinionPromotes new lines of researchMakes possible the testing of new or alternative
hypotheses and methods of analysisSupports studies on data collection methods and
measurementFacilitates the education of new researchersEnables the exploration of topics not envisioned by the
initial investigatorsPermits the creation of new datasets when data from
multiple sources are combinedProvides content for scientific education
![Page 38: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/38.jpg)
Data sharing examples
Crystal structure of M-PMV retroviral protease
![Page 39: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/39.jpg)
Private sector too!
Cross-sector data sharing for
Alzheimer’s researchhttp://www.adni-info.org
(News story)
![Page 40: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/40.jpg)
Increased citation rate
![Page 41: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/41.jpg)
Funding agency policies
NIH Data Sharing Policy
NSF Data Sharing PolicyData management plan for grant
applications
![Page 42: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/42.jpg)
Journal expectations
Data sharing as a term of publication
![Page 43: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/43.jpg)
How do I share?
Personal sharing
Share-upon-requestEmail me for a copy!
Self-archiveDownload from my personal website!
![Page 44: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/44.jpg)
Journal publishing
![Page 46: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/46.jpg)
Public archive or repository
The Ancient Agora of Athens
Ideal characteristicsPopular with national/global coverageSpecific to your disciplineOffers long-term preservation
Find an archive/repositoryAsk colleaguesSearch http://databib.lib.purdue.edu/
![Page 47: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/47.jpg)
Public versus institutionalarchives and repositories
Institutional archives/repositoriesMay restrict to a smaller audienceMay offer greater control of your data
Public archives/repositoriesCreate comprehensive dataset for a larger
research problem spaceDomain-specific archives/repositories may
provide better support
![Page 48: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/48.jpg)
Help others find your data
Berkeleywww.berkeley.edu/mystuff/super-data.csv
Stanfordwww.stanford.edu/mystuff/super-data.csv
file moves to
old URL is kaput
![Page 49: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/49.jpg)
DOI Digital object identifier
Resolve DOIby visiting http://dx.doi.org/ followed by DOI
File can move, but DOI remains the sameThe DOI record stores location details
Try permanent identifiers
![Page 50: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/50.jpg)
Generate permanent identifiers
request your free account, by emailing [email protected]
http://n2t.net/ezidSubscription through the UCB Library
![Page 51: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/51.jpg)
Final tips for sharing
Be selective
Recognize restrictions (privacy and confidentiality)
Online services for sharing among your teamResearch Hub3rd party services
![Page 52: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/52.jpg)
Data management planning
![Page 53: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/53.jpg)
What is a data management plan?
A plan for organizing, storing, and sharing data
![Page 54: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/54.jpg)
Planning associated with greater self-control for exercisemedical adherenceself-health exams sunscreen useschoolworkrefraining from a negative
behavior
Source: Townsend and Liu, 2012
Perhaps planning helps for data management
![Page 55: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/55.jpg)
Why have a plan?
Prepare for efficient and quality data collection that is safe and shareable
NSF and NIH requirements
![Page 56: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/56.jpg)
requirements
Data management plan≤ 2 pagesdescribes how data will be managed, disseminated, and shared
Plan undergoes peer review
![Page 57: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/57.jpg)
Writing an NSF data management plan
Specific requirements vary by NSF divisions
In general, describe:Types of research data and materials producedStandards for data format, content, and metadataPolicies for access and sharingPolicies for re-use, re-distribution, and derivativesPlans for archiving and preserving
You can explain why data will not be shared
Examples1 and 2
![Page 58: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/58.jpg)
NIH requirements
Timely data sharing encouraged
If requesting ≥ $500k per year, a plan is required
Describe how data will be sharedor why sharing is not possible
In the final progress report, describe data sharing actions taken
![Page 59: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/59.jpg)
Writing an NIH data sharing plan
A brief paragraph
Suggested topicsSchedule for sharingFormat of the dataDocumentation of the dataAnalytic tools providedData-sharing agreements (criteria and conditions)Mode of data sharing
there was a
beautiful scientist
![Page 60: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/60.jpg)
NIH plan example 1
The proposed research will involve a small sample (less than 20 subjects) recruited from clinical facilities in the New York City area with Williams syndrome. This rare craniofacial disorder is associated with distinguishing facial features, as well as mental retardation. Even with the removal of all identifiers,
we believe that it would be difficult if not impossible to protect the identities of subjects given the physical characteristics of subjects, the type of clinical data (including imaging) that we will be collecting, and the relatively restricted area from which we are recruiting subjects.
Therefore, we are not planning to share the data.
![Page 61: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/61.jpg)
NIH plan example 2This application requests support to collect public-use data from a survey of more than 22,000 Americans over the age of 50 every 2 years.
Data products from this study will be made available without cost to researchers and analysts. https://ssl.isr.umich.edu/hrs/
User registration is required in order to access
or download files. As part of the registration process, users must agree to the conditions of use governing access to the public release data, including restrictions against attempting to identify study participants, destruction of the data after analyses are completed, reporting responsibilities, restrictions on redistribution of the data to third parties, and proper acknowledgement of the data resource.
Registered users will receive user support, as well as information related to errors in the data, future releases, workshops, and publication lists. The information provided to
users will not be used for commercial purposes, and will not be redistributed to third parties.
![Page 62: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/62.jpg)
Library guidance
Guides, templates, exampleshttp://www.lib.berkeley.edu/sciences/data/guide
![Page 63: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/63.jpg)
Online service for building data plans
https://dmp.cdlib.org/
Step-by-step instructions for meeting funding
agency requirements
![Page 64: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/64.jpg)
Data ethics
![Page 65: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/65.jpg)
Study by Martinson et al., 2005Source - doi:10.1038/435737a
Motivated by increasing pressureto publish papers and win grants?
3247 respondents
0.3% admitted to falsification or “cooking” research data
About 1 in 3 confessed to committing at least one of 10 serious misbehaviors
![Page 66: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/66.jpg)
Citing data
![Page 67: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/67.jpg)
Prevent distortions and manipulations
Keep raw original data
Log all changes made
![Page 68: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/68.jpg)
Data licensing
Restrictions on data use, for example
No for-profit useNo re-sharingGive attribution
Check for license/terms of use
![Page 69: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/69.jpg)
Stay current with data requirements
Review for changes to policies byFunding agenciesUniversity regulationsFederal and state governments
![Page 70: Scientific data management](https://reader036.vdocuments.site/reader036/viewer/2022062704/555ac04fd8b42ab1128b4b44/html5/thumbnails/70.jpg)
Haiku summary
Data is precious Safely store and share widelyGood for your career