![Page 1: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/1.jpg)
Reference Rot and E-Theses: Threat and Remedy
HiberlinkETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
Peter Burnhill EDINA, University of Edinburgh &
for the Hiberlink Team at University of Edinburgh & LANL Research Library
Centre for Service Delivery & Digital Expertise
![Page 2: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/2.jpg)
Overview
1. The Hiberlink Project & Reference Rot
2. Evidence of Threat of Reference Rot for the E-Thesis
• Our methods, data source & findings
3. Devising Remedy for Reference Rot in E-Thesis
• Proposals for intervention: plug-ins & infrastructural solutions
4. Next Steps: who (else) wants to take this work forward?
![Page 3: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/3.jpg)
Reference Rot = Link Rot + Content Drift“when links to web resources
no longer point to what they once did”
Investigating Reference Rot in Web-Based Scholarly Communication
![Page 4: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/4.jpg)
Link Rot
‘Link Rot’
![Page 5: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/5.jpg)
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org2000
http://dl00.org2004
http://dl00.org2005
http://dl00.org2008
(a) Dynamic contentas values on webpage changes over time
(b) Static contentbut very different (often unrelated) web pages
![Page 6: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/6.jpg)
An International Team at Workfunded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:Research Library: Martin Klein, (Rob Sanderson),
Harihar Shankar, Herbert Van de Sompel• University of Edinburgh:
Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill
Centre for Service Delivery & Digital Expertise
HiberlinkETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
![Page 7: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/7.jpg)
What we are doing in Hiberlink, 2013 - 2015
1. Creating evidence on extent of ‘Reference Rot’– Main focus has been on references (& URIs) made in Journal Articles
• Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC
– ETD2014 is opportunity to look at Reference Rot & the e-Thesis
2. Understanding the preparation/publication workflow – Identifying opportunity for productive intervention
3. Prototypes for pro-active archiving to enable remedy– Embedding such ‘solutions’ in existing tools & infrastructure
4. Raising awareness & seeking collaborative actions…. through events like this
![Page 8: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/8.jpg)
Evidence on the Threat of Reference Rot for E-Theses
![Page 9: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/9.jpg)
Retrieving thinking about the emerging e-Thesis in 1998
University Theses Online Group, 1994/99
Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in
‘E-Theses Developments in the UK’ 2003
![Page 10: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/10.jpg)
Retrieving thinking about the emerging e-Thesis in 1998
University Theses Online Group, 1994/99
Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in
‘E-Theses Developments in the UK’ 2003
4.
![Page 11: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/11.jpg)
Measuring the Extent of ‘Reference Rot’ in e-Theses
Data Source• Looked for corpus of e-Theses for our study period of 1997 – 2012
• Interested only in Doctoral Theses/Dissertations
• NDLTD Union Catalogue
Basic Method
a) Define selection and use information in the metadata record • Degree awarded (PhD etc); Department
• Date thesis was successfully defended
• Link to the full text of the Doctoral Thesis
b) Download selected e-Thesis from each Institution’s Repository
![Page 12: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/12.jpg)
7,500 E-Theses Downloaded from 5 US Institutions
In passing: note decline in numbers indicates ‘lag’ in ingest/availability of e-Theses
![Page 13: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/13.jpg)
Key Aspects of Methodology (Stage 1)
1. Convert those e-Theses from PDF into XML• pdftohtml –xml
2. Locate the references & extract each and every URL • Technical challenges: URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
![Page 14: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/14.jpg)
Key Aspects of Methodology (Stage 2)
47,067 URIs were extractedThese were partitioned into two types:
i. 1,086 publisher sites, representing very many references to online articles:
‘the scholarly record’• BTW, who does keep those articles in the Scholarly Record safe?
• Ask me for evidence on that!
ii. 45,981 URIs that linked ‘the Web at large’• to Web content required for scholarship
• inc. websites, software, blogs, videos, online debate etc
• to that which lacks ‘fixity’ and changes over time
Those c.46,000 are the focus for the Hiberlink Project
![Page 15: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/15.jpg)
Increase in Linking to ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time
URIs, by Year Thesis Defended (%), 1997 - 2010
50%
![Page 16: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/16.jpg)
But Wide ‘Between-Thesis’ Variation in Number of Web Links
1373
Count(Log10)
• 10% of Theses have 25 or more URIsMedian (average) increases from 4 to 5.5• 75% have 2+ URIs per Thesis
Focus on e-Theses defended from 2003
box plots of medians (averages)& quartiles, with ‘outliers’
![Page 17: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/17.jpg)
Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and
regarding an 2XX status code as ‘live’
![Page 18: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/18.jpg)
Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
Memento: a prior version, what the Original Resource was like at some time in the past.
![Page 19: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/19.jpg)
Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
• Archival check carried out in June 2014, using installed version of
Memento tool developed by LANL http://www.mementoweb.org/guide/quick-intro/
• A ‘Datetime’ version at or near the date the Thesis was defended
• Searching across several archives (not just Internet Archive)
Approach first used in pilot work at LANL; UoEdin Language Technology Group:
Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
![Page 20: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/20.jpg)
A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
Less than two-thirds of those links lead
to live content
Live on Web Not Found on ‘Live Web’ All
Count 29,122 16,860 45,982
% 63.3 36.7 100%
1st Order Indicator of ‘Reference Rot’ more than one
third of references to the Web subject to ‘rot’
After up to 50 redirects
![Page 21: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/21.jpg)
Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010
=> ‘On average’ 1/3rds of the links in an e-Thesis are ‘rotten’
![Page 22: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/22.jpg)
The older the citation, the less likely to be still on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined]
We can’t stop that process of rot: Web content changes over time,
Reference Rot is inevitable function of time
Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)
![Page 23: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/23.jpg)
Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
% Live on Web Not found on ‘Live Web’ All
Found to be Archived
47.6
Not Found 52.4
All 100%
There seems a 50:50 chance that referenced content is in the ‘Archived Web’.
Some content is being ‘co-incidentally harvested’ by routine web archiving.
=> half of those references are at ‘risk of loss’
![Page 24: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/24.jpg)
50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
![Page 25: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/25.jpg)
‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)
We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite
![Page 26: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/26.jpg)
We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
% Live on Web Not found on ‘Live Web’ All
Found to be Archived
29.3 18.3 47.6
Not Found 34.0 18.4 52.4
All 63.3 36.7 100%
18.4%‘not live & not found in archive’judged to be lost forever
34%‘live’ & ‘not in archive’
at is risk of loss
NB: The 34% ‘at risk’ could be saved by pro-active archiving
![Page 27: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/27.jpg)
Devising Remedy for Reference Rot in e-Theses
![Page 28: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/28.jpg)
Having demonstrated problem exists & is severe
• The Web changes over time: reference rot occurs (36.7%)
• Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success
• Seek pro-active ‘transactional archiving’ solutions
– focus on what is regarded by authors as important
• Thereby to remedy the integrity of the scholarly record
We aim to embed ‘solutions’ in existing tools & infrastructure
Our General Approach
![Page 29: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/29.jpg)
a) Understand the preparation/publication workflow – identifying where there can be productive intervention
b) Devise prototypes for pro-active archiving – writing & implementing code!
c) Propose/test infrastructure for temporal referencing – supporting & using the Memento protocol
We are embedding ‘solutions’ in existing tools & infrastructure
Strategy for Making Remedy
![Page 30: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/30.jpg)
Understanding 3 workflows: Rot or Remedy?
Identify the Actors
Extended length of stages in workflows magnify reference rot & affect
① Study -> Preparation - > (Review) -> Submission
② Post-Submission -> Examination -> (Revision) -> Award
③ Post-Award -> Deposit/Ingest -> Provide/Access -> Use
Doctoral Student (& Supervisor)
Faculty, Examiners& Supervisor
University & Library
Identify the best opportunities for Intervention to make Remedy
![Page 31: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/31.jpg)
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
3. HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses
LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel
UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz
‘Work in progress’ to effect Remedy
HiberlinkETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
![Page 32: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/32.jpg)
1. Hiberlink Plug-in - to help authors and middle-folk (publishers/librarians) do the right thing:
– Zotero - used by authors to manage references
https://www.zotero.org/
– Open Journal System (OJS) - used by OA publishers
https://pkp.sfu.ca/ojs/
‘Work in progress’ to effect Remedy (1)
![Page 33: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/33.jpg)
For use during preparation of thesis & before final submission but also
before deposit with Library (& maybe for repair by Library …)
Hiberlink Plug-in for Zotero ① Triggers archiving of referenced web content
② Returns Datetime URI for archived content
![Page 34: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/34.jpg)
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factor the HTML link that is returned
‘Work in progress’ to effect Remedy (2)
b) Augment Link with a set of Datetime & location pairs
a) Take simple URI - to French National Library (say)
![Page 35: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/35.jpg)
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
First two approaches support ‘perfect scenario’:
• All authors archive all their cited URIs
• e.g. (but not exclusively) with Hiberlink / Zotero
3. HiberActive
– Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses
– A notification hub, a component for the infrastructure
• testing workflow with ResourceSync, CORE & external archive programme
‘Work in progress’ to effect Remedy (3)
![Page 36: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/36.jpg)
Next Steps: who wants to take this work forward?to ensure references in e-Theses don’t rot
• Need to move from the ‘incidental Web archiving’ of cited URIs to pro-active archiving, by student/authors & by libraries
a) Offer to be an early adopter for these Hiberlink remedies
• The Hiberlink Plug-in for Zotero / HiberActive
Email: [email protected] Subject: Hiberlink ETD
b) Amend ‘Guidance for ETD Lifecycle Management’
![Page 37: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/37.jpg)
Thank you, Questions welcome
http://hiberlink.org #hiberlink
Email: [email protected]
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
![Page 38: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/38.jpg)
Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves.
Aside: We would all like to assume that our libraries are ensuring that online e-journal content is being kept safe
![Page 39: Reference Rot and E-Theses: Threat and Remedy](https://reader036.vdocuments.site/reader036/viewer/2022062410/56815c67550346895dca77d4/html5/thumbnails/39.jpg)
Evidence from The Keepers Registry is worrying!
① Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN
‘Ingest Ratio’ = titles being ingested by one or more Keeper/ ‘online serials’ in ISSN Register
= 23,268 / 136,965 [in March 2014] => 17%
* We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%
② Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate
③ User-centric Evidence, usage logs for the UK OpenURL Router*
=> over two thirds 68% (36,326 titles) held by none!