reference rot and e-theses: threat and remedy
DESCRIPTION
Reference Rot and E-Theses: Threat and Remedy. Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library. Centre for Service Delivery & Digital Expertise. Hiberlink ETD2014, Leicester UK July 25th 2014. - PowerPoint PPT PresentationTRANSCRIPT
Reference Rot and E-Theses: Threat and Remedy
HiberlinkETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
Peter Burnhill EDINA, University of Edinburgh &
for the Hiberlink Team at University of Edinburgh & LANL Research Library
Centre for Service Delivery & Digital Expertise
Overview
1. The Hiberlink Project & Reference Rot
2. Evidence of Threat of Reference Rot for the E-Thesis
• Our methods, data source & findings
3. Devising Remedy for Reference Rot in E-Thesis
• Proposals for intervention: plug-ins & infrastructural solutions
4. Next Steps: who (else) wants to take this work forward?
Reference Rot = Link Rot + Content Drift“when links to web resources
no longer point to what they once did”
Investigating Reference Rot in Web-Based Scholarly Communication
Link Rot
‘Link Rot’
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org2000
http://dl00.org2004
http://dl00.org2005
http://dl00.org2008
(a) Dynamic contentas values on webpage changes over time
(b) Static contentbut very different (often unrelated) web pages
An International Team at Workfunded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:Research Library: Martin Klein, (Rob Sanderson),
Harihar Shankar, Herbert Van de Sompel• University of Edinburgh:
Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill
Centre for Service Delivery & Digital Expertise
HiberlinkETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
What we are doing in Hiberlink, 2013 - 2015
1. Creating evidence on extent of ‘Reference Rot’– Main focus has been on references (& URIs) made in Journal Articles
• Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC
– ETD2014 is opportunity to look at Reference Rot & the e-Thesis
2. Understanding the preparation/publication workflow – Identifying opportunity for productive intervention
3. Prototypes for pro-active archiving to enable remedy– Embedding such ‘solutions’ in existing tools & infrastructure
4. Raising awareness & seeking collaborative actions…. through events like this
Evidence on the Threat of Reference Rot for E-Theses
Retrieving thinking about the emerging e-Thesis in 1998
University Theses Online Group, 1994/99
Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in
‘E-Theses Developments in the UK’ 2003
Retrieving thinking about the emerging e-Thesis in 1998
University Theses Online Group, 1994/99
Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in
‘E-Theses Developments in the UK’ 2003
4.
Measuring the Extent of ‘Reference Rot’ in e-Theses
Data Source• Looked for corpus of e-Theses for our study period of 1997 – 2012
• Interested only in Doctoral Theses/Dissertations
• NDLTD Union Catalogue
Basic Method
a) Define selection and use information in the metadata record • Degree awarded (PhD etc); Department
• Date thesis was successfully defended
• Link to the full text of the Doctoral Thesis
b) Download selected e-Thesis from each Institution’s Repository
7,500 E-Theses Downloaded from 5 US Institutions
In passing: note decline in numbers indicates ‘lag’ in ingest/availability of e-Theses
Key Aspects of Methodology (Stage 1)
1. Convert those e-Theses from PDF into XML• pdftohtml –xml
2. Locate the references & extract each and every URL • Technical challenges: URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
Key Aspects of Methodology (Stage 2)
47,067 URIs were extractedThese were partitioned into two types:
i. 1,086 publisher sites, representing very many references to online articles:
‘the scholarly record’• BTW, who does keep those articles in the Scholarly Record safe?
• Ask me for evidence on that!
ii. 45,981 URIs that linked ‘the Web at large’• to Web content required for scholarship
• inc. websites, software, blogs, videos, online debate etc
• to that which lacks ‘fixity’ and changes over time
Those c.46,000 are the focus for the Hiberlink Project
Increase in Linking to ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time
URIs, by Year Thesis Defended (%), 1997 - 2010
50%
But Wide ‘Between-Thesis’ Variation in Number of Web Links
1373
Count(Log10)
• 10% of Theses have 25 or more URIsMedian (average) increases from 4 to 5.5• 75% have 2+ URIs per Thesis
Focus on e-Theses defended from 2003
box plots of medians (averages)& quartiles, with ‘outliers’
Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and
regarding an 2XX status code as ‘live’
Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
Memento: a prior version, what the Original Resource was like at some time in the past.
Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
• Archival check carried out in June 2014, using installed version of
Memento tool developed by LANL http://www.mementoweb.org/guide/quick-intro/
• A ‘Datetime’ version at or near the date the Thesis was defended
• Searching across several archives (not just Internet Archive)
Approach first used in pilot work at LANL; UoEdin Language Technology Group:
Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
Less than two-thirds of those links lead
to live content
Live on Web Not Found on ‘Live Web’ All
Count 29,122 16,860 45,982
% 63.3 36.7 100%
1st Order Indicator of ‘Reference Rot’ more than one
third of references to the Web subject to ‘rot’
After up to 50 redirects
Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010
=> ‘On average’ 1/3rds of the links in an e-Thesis are ‘rotten’
The older the citation, the less likely to be still on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined]
We can’t stop that process of rot: Web content changes over time,
Reference Rot is inevitable function of time
Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)
Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
% Live on Web Not found on ‘Live Web’ All
Found to be Archived
47.6
Not Found 52.4
All 100%
There seems a 50:50 chance that referenced content is in the ‘Archived Web’.
Some content is being ‘co-incidentally harvested’ by routine web archiving.
=> half of those references are at ‘risk of loss’
50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)
We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite
We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
% Live on Web Not found on ‘Live Web’ All
Found to be Archived
29.3 18.3 47.6
Not Found 34.0 18.4 52.4
All 63.3 36.7 100%
18.4%‘not live & not found in archive’judged to be lost forever
34%‘live’ & ‘not in archive’
at is risk of loss
NB: The 34% ‘at risk’ could be saved by pro-active archiving
Devising Remedy for Reference Rot in e-Theses
Having demonstrated problem exists & is severe
• The Web changes over time: reference rot occurs (36.7%)
• Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success
• Seek pro-active ‘transactional archiving’ solutions
– focus on what is regarded by authors as important
• Thereby to remedy the integrity of the scholarly record
We aim to embed ‘solutions’ in existing tools & infrastructure
Our General Approach
a) Understand the preparation/publication workflow – identifying where there can be productive intervention
b) Devise prototypes for pro-active archiving – writing & implementing code!
c) Propose/test infrastructure for temporal referencing – supporting & using the Memento protocol
We are embedding ‘solutions’ in existing tools & infrastructure
Strategy for Making Remedy
Understanding 3 workflows: Rot or Remedy?
Identify the Actors
Extended length of stages in workflows magnify reference rot & affect
① Study -> Preparation - > (Review) -> Submission
② Post-Submission -> Examination -> (Revision) -> Award
③ Post-Award -> Deposit/Ingest -> Provide/Access -> Use
Doctoral Student (& Supervisor)
Faculty, Examiners& Supervisor
University & Library
Identify the best opportunities for Intervention to make Remedy
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
3. HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses
LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel
UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz
‘Work in progress’ to effect Remedy
HiberlinkETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
1. Hiberlink Plug-in - to help authors and middle-folk (publishers/librarians) do the right thing:
– Zotero - used by authors to manage references
https://www.zotero.org/
– Open Journal System (OJS) - used by OA publishers
https://pkp.sfu.ca/ojs/
‘Work in progress’ to effect Remedy (1)
For use during preparation of thesis & before final submission but also
before deposit with Library (& maybe for repair by Library …)
Hiberlink Plug-in for Zotero ① Triggers archiving of referenced web content
② Returns Datetime URI for archived content
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factor the HTML link that is returned
‘Work in progress’ to effect Remedy (2)
b) Augment Link with a set of Datetime & location pairs
a) Take simple URI - to French National Library (say)
1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
First two approaches support ‘perfect scenario’:
• All authors archive all their cited URIs
• e.g. (but not exclusively) with Hiberlink / Zotero
3. HiberActive
– Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses
– A notification hub, a component for the infrastructure
• testing workflow with ResourceSync, CORE & external archive programme
‘Work in progress’ to effect Remedy (3)
Next Steps: who wants to take this work forward?to ensure references in e-Theses don’t rot
• Need to move from the ‘incidental Web archiving’ of cited URIs to pro-active archiving, by student/authors & by libraries
a) Offer to be an early adopter for these Hiberlink remedies
• The Hiberlink Plug-in for Zotero / HiberActive
Email: [email protected] Subject: Hiberlink ETD
b) Amend ‘Guidance for ETD Lifecycle Management’
Thank you, Questions welcome
http://hiberlink.org #hiberlink
Email: [email protected]
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves.
Aside: We would all like to assume that our libraries are ensuring that online e-journal content is being kept safe
Evidence from The Keepers Registry is worrying!
① Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN
‘Ingest Ratio’ = titles being ingested by one or more Keeper/ ‘online serials’ in ISSN Register
= 23,268 / 136,965 [in March 2014] => 17%
* We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%
② Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate
③ User-centric Evidence, usage logs for the UK OpenURL Router*
=> over two thirds 68% (36,326 titles) held by none!