data repositories: recommendation, certification and models for cost recovery
TRANSCRIPT
| 1
Anita de Waard 0000-0002-9034-4119 VP Research Data Collaborations Elsevier RDM Services [email protected]
NSF Workshop February 28, March 1, 2017
Data Repositories: Recommendation, Certification and Models for Cost Recovery
| 2
Object of Study Raw
Data
Processed Data
Data With
Paper Curated Record
Method Analysis Tables/ Figures Curate
Methods Software
Four Types of Repositories:
Research Question
NOAA: 20 TB/ NASA streaming > 24 PB/day NASA Reverb: 12 PB Data NSSD: > 230 TB of digital data NSIDC: 1 PB data, : 1 PB total ALMA Telescope: 40 TB/day
Local Storage/ Instrument Repositories
Size: PB Nr of files: Trillions
Deep Blue (Umich): 80k MIT Dspace: 75 k HAL (France): 60 k D-Space Cambr: 1.5 k Of which data: hundreds
Institutional/Local Repositories
Size: GB Nr of files: Billions
Figshare: 1.2 M DataDryad: 3 k Dataverse: 58 k
Non-Domain Repositories
Size: MB Nr of files: Milliions
Domain Repositories
PetDB: 6 k PDB: 100 k NIST ASD: 170 k
Size: kB Nr of files: 100ks
Publication
| 3
Recommended vs Certified Data Repositories [1]
• Studied repositories recommended by 17 organisations: • Compiled list of 242 recommended repositories • Identified criteria for recommendation • Identified overlap between recommendations (Fig 1)
• Identified 5 certification schema’s: • Compiled list of 129 certified repositories • Identified criteria for certification • Identified overlap between recommended & certified repositories (Fig 2)
Figure 1: Most repositories are recommended by < 3 parties
Figure 2: Most recommended repositories are not certified
[1] All data is openly available at doi:10.17632/zx2kcyvvwm.1
| 4
Set Of Shared Criteria Between Recommendation and Certification of Repositories Umbrella Categories
Shared Meaning Recommended Repository Criteria Repository Cer8fica8on Scheme Criteria
Mission Explicit mission statement in providing long-‐term responsibility, persistence, and management of data(sets)
Community/Recogni8on
Evidence of use by downloads or cita<ons from an iden<fiable and ac<ve user community
Understand and meet the needs of the designated and defined target community
Legal and Contractual Compliance
Repository operates within a legal framework/Ensures compliance with legal regula<ons
When applicable, have contractual regula<ons governing the protec<on of human subjects
Contracts and agreements maintained with relevant par<es on relevant subjects
Access/Accessibility Public access to the scien<fic/repository designated community
Anonymous referees (including peer-‐reviewers) have access to the data before public release as indicated by policies
Technical Structure/Interface
The soIware system supports data organisa<on and searchability by both humans and computers. The interface is intui<ve and mobile user-‐friendly
The technical (infra)structure is appropriate, protec<ve, and secure
Retrievability Data need to have enough metadata. All data receive a persistent iden<fier
Preserva8on Long-‐term and formal preserva<on/succession plan for the data, even if the repository ceases to exist
If the data are retracted, the persistent iden<fier needs to be maintained
Preserva<on of data informa<on proper<es and metadata
Final report: Husen, Sean Edward; de Wilde, Zoë G.; de Waard, Anita; Cousijn, Helena (2017), “"Recommended versus Certified Repositories: Mind the Gap"”, Submitted for Revision Codata Data Science Journal, Feb 20, 2017
| 5
Debit Economy (like a pie)
• Single pile of ‘stuff’ gets divided: - Thing can only be for one person
at one time - “If you get more, I get less”
• Examples: - Money - Jobs - Samples, equipment, space, etc.
• Behaviors: - Hoarding, secrecy - (Cut-throat) competition - Winning by owning
(and not sharing)
Credit Economy (like a song)
• Credit comes from visibility: - The more you give away,
the more you benefit - “Only if I share do I really own”
(“You need me to do you!” JW) • Examples:
- Papers, citations - Good ideas (if credited) - Skills
• Behaviors: - Open access, citation game - Collaboration with top-X - Winning by sharing
(to enable priority & visibility)
Two Economies of Science [3]:
[3] Paula Stephan: “How Economics Shapes Science”, Harvard University Press, 2012: http://www.jstor.org/stable/j.ctt2jbqd1
<<
< D
AT
A ?
??
| 6
RDA Repository Cost Recovery IG • Interviewed 22 repositories & reported [2] • Different income streams:
1. Structurally funded 2. Mostly data access charges 3. Mostly data deposit fees 4. Membership fees (for deposits and/or access) 5. Serial project funding 6. Supported by host institution
• Different new models under considerations: • Sponsorships/services for the commercial sector • Contracts for specific services offered (hosting, archiving, curation) • Expanding the number of affiliated institutions • Deposit fees • More services for “national memory institutes”
• Some comments: • Some countries structurally fund repositories (not US!) • Some repositories embedded in scholarly practice • Hard to come up with new models: no time, no skill sets!
• Next step: OECD/GSF WG studies more in-depth, more countries: http://www.codata.org/working-groups/oecd-gsf-sustainable-business-models [2] Available at https://www.rd-alliance.org/final-report-income-streams-data-repositories.html
| 7
Thank you!
More on Elsevier’s RDM program and other interesting efforts: • https://www.hivebench.com • https://www.elsevier.com/physical-sciences/earth-and-planetary-sciences/the-2015-
international-data-rescue-award-in-the-geosciences • http://www.journals.elsevier.com/softwarex/ • https://www.elsevier.com/books-and-journals/content-innovation/data-base-linking • https://rd-alliance.org/groups/rdawds-publishing-data-services-wg.html • https://rd-alliance.org/bof-data-search.html • https://datasearch.elsevier.com/ • https://data.mendeley.com/ • https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data • https://www.force11.org/ • http://www.nationaldataservice.org/ • https://rd-alliance.org/ • https://www.elsevier.com/about/open-science/research-data
Anita de Waard, [email protected]