software repositories for research-- an environmental scan
TRANSCRIPT
![Page 1: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/1.jpg)
Software Repositories for
Research-- An
environmental scanMicah
AltmanMIT Libraries
Prepared for
Digital Preservation 2016
MilwaukeeNovember 2016
![Page 2: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/2.jpg)
DisclaimerThese opinions are my own, they are not the opinions of MIT, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the future!”-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx,
Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc.
2
![Page 3: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/3.jpg)
Related Publications• Altman M, Jackman S. “Nineteen Ways of Looking at
Statistical Software”. Journal of Statistical Software. 2011;42.
• Altman, Micah, and Gary King. "A proposed standard for the scholarly citation of quantitative data." D-lib 13, no. 3 (2007):
• Altman, M., Gill, J. and McDonald, M.P., 2004. Numerical issues in statistical computing for the social scientist. John Wiley & Sons.
Reprints available from:informatics.mit.edu
3
![Page 4: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/4.jpg)
Today’s Perspectives* Motivations *
* Methods ** Measures * * Musings *
* Merit *
4
![Page 5: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/5.jpg)
Motivations
5
![Page 6: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/6.jpg)
Why Software?
6
![Page 7: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/7.jpg)
What is SoftwareWorking definition: “Part of a computer system that consists of encoded information or computer instructions” (wikipedia) that is directly executable within a system.
Corollaries
Software generally is composed of instantiations of algorithms, heuristics, and fixed information (internal data).
The behavior and output of software generally depends on the execution context: execution environment (software, hardware, network, networked resources), configuration parameters, and dynamic inputs. 7
![Page 8: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/8.jpg)
Some Caution About Definitions
Software is often tightly coupled to data
Boundaries among software objects and systems are fuzzy & permeable
Usefulness of software is strongly dependent on the intent of the user, knowledge and capabilities of user (documentation matters), and execution context. 8
"... if they [philosophers] do ask and they want a definition, they do not want the most natural definition, e.g. of 'chair' they
do not want the definition 'something to sit on'. Why are they not satisfied with the normal definition of chair, or, to put the
question in another way, why do they wish to ask for the definition of a physical object?"
Source: "From the Minutes of the Moral Science Club, 23.2.1939" in Wittgenstein in Cambridge (2008)
![Page 9: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/9.jpg)
Research Questions
Characterizing Research Software Repositories and Related Practices
How is software related to research formally disseminated?
Which “repositories” (points for mid/long term publishing/access of software) are recognized at the discipline level?
What is the relative prevalence and affordances of “repositories” for software as compared to other established disciplinary repositories?
What practices, requirements, or standards for software curation and preservation are recognized at the disciplinary level?
9
![Page 10: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/10.jpg)
Methods
10
![Page 11: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/11.jpg)
Literature Review
Data Curation, Publication and Citation
Software significant properties, use cases
Software repositories
Software & scientific reproducibility
Software Engineering Methodology
11
![Page 12: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/12.jpg)
Web Research - PracticeReview of research repositories
Sources: OpenDOAR, Re3Data, Sherpajuliet
Goals: Estimate prevalence of repositories that accept research software; identify exemplar repositories, characterize feature sets by repository category
Methods: term-based queries; descriptive statistics; stratified content case studies
Review of Software Directories Sources: OpenHub, OSDir, DMOZ
Goals: Identify additional software repositories used in research
Methods: qualitative text analysis; descriptive statistics12
![Page 13: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/13.jpg)
Web Research - PoliciesReview of funder policies
Sources: Roarmap; US Federal Agency Websites
Goals: Estimate prevalence of funder policies on software curation; identify exemplar policies; identify recommended repositories
Methods: qualitative text analysis; descriptive statistics
Review of Journal PoliciesSource: Google Scholar, WoS, DOAJ, Software Sustainability Institute Index
Goals: Estimate prevalence of journals that publish research software; prevalence of software policies at journals exemplar policies; identify recommended repositories
Methods: qualitative text analysis; descriptive statistics13
![Page 14: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/14.jpg)
Measures
14
![Page 15: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/15.jpg)
Typical Prevalence of Software Repositories
15
![Page 16: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/16.jpg)
We Got NothingRE3 SherpaJuliet
16
![Page 17: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/17.jpg)
And the Nothing We got is Not that Great
17
![Page 18: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/18.jpg)
Some Exemplars and Promising Initiatives
• Citation and publisher policiesFORCE 11 Software Citation Principles
www.force11.org/software-citation-principles
ACM New Publication Policies on Software Reproducibility and Contributorshipwww.acm.org/publications/policies
PLOShttp://journals.plos.org/plosone/s/materials-and-software-sharing
18
• Long Term Access:- www.softwareheritage.org - www.softwarepreservationnetwork.org/- guides.github.com/activities/citable-code/- archive.org/details/softwarelibrary
• Software Journals:- www.journals.elsevier.com/softwarex/ - www.jstatsoft.org/ - http://openresearchsoftware.metajnl.com/
![Page 19: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/19.jpg)
Musings
19
![Page 20: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/20.jpg)
20
![Page 21: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/21.jpg)
21
![Page 22: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/22.jpg)
22
![Page 23: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/23.jpg)
Use Cases and Motivating Value
23
Historic / cultural - historical scholarship- “intrinsic value”
Replication and reproducibility - check claims made in research- reduced deliberate research fraud- check reliability (robustness) of results- check validity (accuracy)
Reuse - efficiency - increase speed of development- standards compliance- apply methodology to a different corpus- increased quality and dependability
Render other digital objects - renders other objects meaningful - see digital preservation use cases
Legal - record of licensing, ownership, copyright- manage legal risks/accountability- compliance with laws/funding mandates- reduce barriers to long-term access for other historic use, replication, reuse, rendering
Citation and attribution - track individual academic career- track software development/history- track institutional outputs- track funder outputs
![Page 24: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/24.jpg)
Repository Affordances
24
Authoring/Development
Discovery/Access
Collection Preservation Legal
creator Language specific authoring toolsBuild environment integrationVersioningDocumentation Project managementCollaboration
Attribution BackupsCommitment to long-term access
Access controlLicense templating
curator Project managementLicense templateMonitoringCollaboration
BrowsingSearchingPersistent IdentifiersVersion Ids
Collection PolicyPeer ReviewSelectionAnnotationMetdata
Preservation policyDocumentationFormat management
Access control License standardizationLegal guidance
institution Author, Funder IdentifiersMetrics
Author, Funder IdentifiersMetrics
Author, Funder IdentifiersMetricsComplianceAttribution
Preservation PolicyPreservation replicationAuditabilityCertification
License standardizationPrivacy Management
end-user BrowsingSearchingSearch engine integrationersistent IdentifiersVersion Ids
Selection criteriaAnnotationQuality Measures
Documentation Open licensingLicense discoverability
![Page 25: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/25.jpg)
Merit
25
![Page 26: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/26.jpg)
Preliminary findings: State of Software Curation
1.No comprehensive indices of software archives2.Orders of magnitude fewer software archives than data archives.
( Corollary: Institutional repositories offer little functionality for software archiving, even when nominally supported )
3.Very small proportion of funders have policies addressing software curation
4.There is little available advice for researchers who wish to curate, cite, & preserve software
5.Substantial reproducibility reproducibility failures related to software continue to be reported 26
“Nothing Exists” - Parmenides (ca. 500 BCE)
![Page 27: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/27.jpg)
Contrast with Data Curation -- Lack of Progress• Compliance
– Funder: data management plans, open data– Publishers: data access/archiving/citation
• Norms & practices– Joint data citation principles– Recognition of data in funder biosketches– Increased recognition of reproducibility gaps– Increased recognition of open data/open science
• Technical infrastructure– Data repository directories– Data citation indices– ORCID researcher identifier and registry
• Recognition– Data citation indices– Virtual branded archives– High-profile data publications
27
![Page 28: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/28.jpg)
Summing it all up… Software curation looks a lot like data curation a decade ago…
28
“How much slower would scientific progress be if the near universal standards for scholarly citation of articles and books had never been developed? Suppose shortly after publication only some printed works could be reliably found by other scholars; or if researchers were only permitted to read an article if they first committed not to criticize it, or were required to coauthor with the original author any work that built on the original. How many discoveries would never have been made if the titles of books and articles in libraries changed unpredictably, with no link back to the old title; if printed works existed in different libraries under different titles; if researchers routinely redistributed modified versions of other authors' works without changing the title or author listed; or if publishing new editions of books meant that earlier editions were destroyed? …
“Unfortunately, no such universal standards exist for citing quantitative data software, and so all the problems listed above exist now. Practices vary from field to field, archive to archive, and often from article to article.
The data software cited may no longer exist, may not be available publicly, or may have never been held by anyone but the investigator. Data software listed as available from the author are unlikely to be available for long and will not be available after the author retires or dies. … Data software are sometimes listed in the bibliography, sometimes in the text, sometimes not at all, and rarely with enough information to guarantee future access to the identical data set. Replicating published tables and figures even without having to rerun the original experiment, is often difficult or impossible”
-- Altman & King 2007
![Page 29: Software Repositories for Research-- An Environmental Scan](https://reader036.vdocuments.site/reader036/viewer/2022062412/58e8eeaa1a28ab1f248b4c73/html5/thumbnails/29.jpg)
Questions?Web:
Informatics.mit.edu
Email:
29