Threats to the integrity of evidence in the record of scholarship
Peter Burnhill
EDINA University of Edinburgh
Focus on two unintended consequences of the
Web/Internet
Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
①Digital back copy
is not in the
custody of libraries
Libraries boast of
‘e-collections’,
but they only have
‘e-connections’ Caroline Brazier,
Chief Librarian British Library
Scholarly Articles increasingly link to the Web-at-large
not just back to other Articles
Dark solid lines represents URIs to Web-at-large, from 1997/2011
arXiv PMC
Link Rot: Link stops working e.g. HTTP 404 “Not Found”
② Links to web resources suffer ‘Reference Rot’,
a combination of two factors:
+ Content Drift: What is at end of URI has changed, or gone!
1. Ensuring access, over the long term, to online journals
– Keepers Registry: a Jisc-funded service at EDINA
2. Remedy for reference rot: so what is cited is not lost
– Hiberlink: an Andrew Mellon-funded project at EDINA
To counter these threats to our scholarship,
focus on two initiatives
What’s this got to do the REF …?
Evidence base for REF 2014: four individual pieces of research output
for judgment in 36 Units of Assessment (UoAs): a grade point average for each institution
What of the evidence behind this: has to be reckoned as important?
The value of Research Power for each HEI is shown as a dot plot For REF2008 & REF2014
Cambridge
UCL, Oxford
Edinburgh
KCL, Nottingham
Imperial, Bristol. Leeds
Manchester
REF2014
Research Power: ‘a measure of quality multiplied by volume’ [the REF grade point average x FTE]
“The Scholarly Record has a fuzzy edge”
‘e-journals’
Websites, Databases, Repositories
‘book-length work’
‘Gov Docs’
Much of what is in the Scholarly Record
(& submitted as evidence to the REF) is digital …
conference proceedings
‘e-magazines’
‘e-newsmedia’
‘data as findings’
New ‘research objects’
… and online somewhere
National Science Library,
Chinese Academy of Sciences
Good News: we have some digital shelving
National Science Library,
Chinese Academy of Sciences
① Web-scale not-for-profit archiving agencies:
① National institutions (usually national libraries) …
① Consortia of university libraries & specialist centres …
Private LOCKSS Networks
… and you can now discover who is looking after what
.
We can derive two Key Performance Indicators
(KPIs)
‘Ingest Ratio’ = titles ‘ingested & archived’ by 1+ Keeper
/ ‘online serials’ in ISSN Register
‘KeepSafe Ratio’ = titles ingested by 3+ Keepers / ‘online serials’ in ISSN Register
Big Variation in Archival Status of Online Continuing Resources (assigned ISSN) by Country, July 2015
ISSNCount
Country
IngestRatio
%
KeepSafeRatio
%Archival
statusunknown
31757 USA 34.1 6.9 20911
14569 UK 44.6 20.2 8066
12118 France 4.8 1.4 11538
8655 Canada 7.7 0.2 7988
7189 Brazil 0.8 0.2 7130
7121 Germany 26.6 10.0 5228
6556 Spain 4.0 0.5 6296
5411 Netherlands 68.0 48.6 1729
5248 India 6.7 1.9 4899
5078 Australia 4.6 1.4 4846
4955 International 2.2 0.4 4847
3908 Finland 0.6 0.1 3884
3576 Italy 5.8 1.1 3368
3456 Denmark 2.1 0.4 3383
2700 NewZealand 4.0 0.1 2591
2693 Poland 8.8 0.9 2457
2251 Romania 3.6 0.0 2169
2187 Japan 6.3 3.8 2050
2153 CzechRep 2.7 0.1 2094
2070 RussiaFed 6.8 5.5 1929
1991 Norway 2.0 0.2 1950
1769 Argentina 1.1 0.8 1749
1688 Switzerland 15.8 3.7 1421
1627 Hungary 4.6 0.6 1553
1224 Slovenia 1.06 0.00 1211
1149 Croatia 1.74 0.00 1129
1092 Egypt 59.89 3.85 438
1071 KoreaS 6.44 3.45 1002
1053 Iran 1.52 0.47 1037
1015 Sweden 10.25 0.89 911
Others<
1,000each
165,949
Total
Ingest Ratio for Top 10 of ISSN assigned
What of the articles in journals and other serials that were submitted to REF2014? • Articles were > 80% of all submissions
For some Panels books and other types of output may be important , but journal articles also important!
Is the content of those journals being kept safe for future scholarship?
Estimate the Ingest Ratio and KeepSafe Ratio for each Unit of Assessment:
1. identify the e-journals, using ISSN in the metadata for the articles submitted & the ISSN Register
2. check archival status in the Keepers Registry
3. compute the %age of journals with at least some volumes on the ‘digital shelves’ by an organisation with archival intent
(an idea from Steven Carlyle-Davies, EDINA)
Many of Journals used in REF not known to be archived: => are at risk of loss Varies by Panel
Archiving of Journals in REF, by Unit of Assessment
Law Classics
Classics
The big publishers are paying to be archived,
by CLOCKSS & Portico
Elsevier Hindawi
T&F, OUP, etc
Wiley etc Springer
Karger
very many ‘at risk’ e-journals from many (small & not so small) publishers
BIG publishers
have acted but incompletely
This includes the ‘applied literature’ that has societal impact
“when links to web resources no longer point
to what was intended”
②That other unintended consequence of the Web
Reference Rot = Link Rot + Content Drift
Funding: Andrew W. Mellon
Foundation
What of the citations that act as evidence for those articles submitted to the REF?
Link Rot
Link Rot’ is known to be scary
Content Drift may be even scarier! When what is at end of cited URL has changed, or gone!!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a) Dynamic content as values on webpage changes over time
(b) Static content but very different (often
unrelated) web pages
Hiberlink analysed 1million URI links to Web-at-large not links to publisher & access platforms (DOI etc)
If a Memento cannot be found in a Web Archive within N days of the date of
publication, but URI is still active then risk of loss (& rot)
Methodology: answer to 2 questions
1. Do those links (URIs) still work? - on the ‘Live Web’’?
2. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
If Memento cannot be found in a Web Archive within N days of the date of publication, and URI not active on the Live Web,
then it is lost / rotten
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014)
Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot.
PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
Hiberlink Results: within 14 days of publication date …
PMC Elsevier
‘Not Archived’ 74.5% 75.2%
Of those ‘Not Archived’ % %
still ‘Live’ on the Web 80 67.3
‘No longer Live’ on the Web 20% 32.7%
1/5th & 1/3rd of articles have
Reference Rot within fortnight of publication
Most referenced URIs at risk of loss
Team at Harvard Law School establishing similar evidence “We documented a serious problem of reference rot:
• more than 70% of the URLs within the above mentioned [law] journals, and
• 50% of the URLs within U.S. Supreme Court opinions suffer reference rot
— meaning, again, that they do not produce the information originally cited.”
Jonathan Zittrain, Kendra Albert and Lawrence Lessig (2014).
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations.
Legal Information Management 14. doi:10.1017/S1472669614000255.
=> Content of All Citations Rot over Time!!
… leading to rotten references for the reader
Rot in References means a Defective Article!
undermines the integrity of the scholarly record
Hiberlink Remedy
As with fish, ‘Quick Freeze & Store’
• Snapshot & Save: Proactive/ Transactional archiving
• Turn a simple URI into a hiberlink URI
Snapshot URI + Original URI + DateTime [Robust Link syntax] http://robustlinks.mementoweb.org/spec/
No time to say more, so go to
http://hiberlink.org
Looking to the next REF: Open Access & Impact
Open Access has also to mean Assured Access: – “infrastructure … for the curation, integration, discovery,
presentation and preservation of digital collections”
1. Only accept articles that are on digital shelves under policy control of research libraries?
– What is paid for as open access should be kept safe • Check theKeepers.org but also a role for repositories
2. Only accept articles with citations to Web resources that have been archived & accessible to the reader?
• Check Hiberlink.org
- delivering capability & stewardship, nationally & internationally
- part of University of Edinburgh’s commitment to the Sector