reference rot and e-theses: threat and remedy

39
Reference Rot and E-Theses: Threat and Remedy Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library Centre for Service Delivery & Digital Expertise

Upload: judd

Post on 09-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Reference Rot and E-Theses: Threat and Remedy. Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library. Centre for Service Delivery & Digital Expertise. Hiberlink ETD2014, Leicester UK July 25th 2014. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Reference Rot and E-Theses: Threat and Remedy

Reference Rot and E-Theses: Threat and Remedy

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Peter Burnhill EDINA, University of Edinburgh &

for the Hiberlink Team at University of Edinburgh & LANL Research Library

Centre for Service Delivery & Digital Expertise

Page 2: Reference Rot and E-Theses: Threat and Remedy

Overview

1. The Hiberlink Project & Reference Rot

2. Evidence of Threat of Reference Rot for the E-Thesis

• Our methods, data source & findings

3. Devising Remedy for Reference Rot in E-Thesis

• Proposals for intervention: plug-ins & infrastructural solutions

4. Next Steps: who (else) wants to take this work forward?

Page 3: Reference Rot and E-Theses: Threat and Remedy

Reference Rot = Link Rot + Content Drift“when links to web resources

no longer point to what they once did”

Investigating Reference Rot in Web-Based Scholarly Communication

Page 4: Reference Rot and E-Theses: Threat and Remedy

Link Rot

‘Link Rot’

Page 5: Reference Rot and E-Theses: Threat and Remedy

+ Content Drift: What is at end of URI has changed, or gone!

http://dl00.org2000

http://dl00.org2004

http://dl00.org2005

http://dl00.org2008

(a) Dynamic contentas values on webpage changes over time

(b) Static contentbut very different (often unrelated) web pages

Page 6: Reference Rot and E-Theses: Threat and Remedy

An International Team at Workfunded by the

Andrew W. Mellon Foundation

• Los Alamos National Laboratory:Research Library: Martin Klein, (Rob Sanderson),

Harihar Shankar, Herbert Van de Sompel• University of Edinburgh:

Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou

EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill

Centre for Service Delivery & Digital Expertise

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Page 7: Reference Rot and E-Theses: Threat and Remedy

What we are doing in Hiberlink, 2013 - 2015

1. Creating evidence on extent of ‘Reference Rot’– Main focus has been on references (& URIs) made in Journal Articles

• Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC

– ETD2014 is opportunity to look at Reference Rot & the e-Thesis

2. Understanding the preparation/publication workflow – Identifying opportunity for productive intervention

3. Prototypes for pro-active archiving to enable remedy– Embedding such ‘solutions’ in existing tools & infrastructure

4. Raising awareness & seeking collaborative actions…. through events like this

Page 8: Reference Rot and E-Theses: Threat and Remedy

Evidence on the Threat of Reference Rot for E-Theses

Page 9: Reference Rot and E-Theses: Threat and Remedy

Retrieving thinking about the emerging e-Thesis in 1998

University Theses Online Group, 1994/99

Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in

‘E-Theses Developments in the UK’ 2003

Page 10: Reference Rot and E-Theses: Threat and Remedy

Retrieving thinking about the emerging e-Thesis in 1998

University Theses Online Group, 1994/99

Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in

‘E-Theses Developments in the UK’ 2003

4.

Page 11: Reference Rot and E-Theses: Threat and Remedy

Measuring the Extent of ‘Reference Rot’ in e-Theses

Data Source• Looked for corpus of e-Theses for our study period of 1997 – 2012

• Interested only in Doctoral Theses/Dissertations

• NDLTD Union Catalogue

Basic Method

a) Define selection and use information in the metadata record • Degree awarded (PhD etc); Department

• Date thesis was successfully defended

• Link to the full text of the Doctoral Thesis

b) Download selected e-Thesis from each Institution’s Repository

Page 12: Reference Rot and E-Theses: Threat and Remedy

7,500 E-Theses Downloaded from 5 US Institutions

In passing: note decline in numbers indicates ‘lag’ in ingest/availability of e-Theses

Page 13: Reference Rot and E-Theses: Threat and Remedy

Key Aspects of Methodology (Stage 1)

1. Convert those e-Theses from PDF into XML• pdftohtml –xml

2. Locate the references & extract each and every URL • Technical challenges: URL broken/newline; underscore as image

• Use up to 15 regular expression for matching; regard as URI

UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou

Page 14: Reference Rot and E-Theses: Threat and Remedy

Key Aspects of Methodology (Stage 2)

47,067 URIs were extractedThese were partitioned into two types:

i. 1,086 publisher sites, representing very many references to online articles:

‘the scholarly record’• BTW, who does keep those articles in the Scholarly Record safe?

• Ask me for evidence on that!

ii. 45,981 URIs that linked ‘the Web at large’• to Web content required for scholarship

• inc. websites, software, blogs, videos, online debate etc

• to that which lacks ‘fixity’ and changes over time

Those c.46,000 are the focus for the Hiberlink Project

Page 15: Reference Rot and E-Theses: Threat and Remedy

Increase in Linking to ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time

URIs, by Year Thesis Defended (%), 1997 - 2010

50%

Page 16: Reference Rot and E-Theses: Threat and Remedy

But Wide ‘Between-Thesis’ Variation in Number of Web Links

1373

Count(Log10)

• 10% of Theses have 25 or more URIsMedian (average) increases from 4 to 5.5• 75% have 2+ URIs per Thesis

Focus on e-Theses defended from 2003

box plots of medians (averages)& quartiles, with ‘outliers’

Page 17: Reference Rot and E-Theses: Threat and Remedy

Methodology (Stage 3): to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and

regarding an 2XX status code as ‘live’

Page 18: Reference Rot and E-Theses: Threat and Remedy

Methodology (Stage 3): to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’

ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

Memento: a prior version, what the Original Resource was like at some time in the past.

Page 19: Reference Rot and E-Theses: Threat and Remedy

Methodology (Stage 3): to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

• Archival check carried out in June 2014, using installed version of

Memento tool developed by LANL http://www.mementoweb.org/guide/quick-intro/

• A ‘Datetime’ version at or near the date the Thesis was defended

• Searching across several archives (not just Internet Archive)

Approach first used in pilot work at LANL; UoEdin Language Technology Group:

Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou

Page 20: Reference Rot and E-Theses: Threat and Remedy

A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

Less than two-thirds of those links lead

to live content

Live on Web Not Found on ‘Live Web’ All

Count 29,122 16,860 45,982

% 63.3 36.7 100%

1st Order Indicator of ‘Reference Rot’ more than one

third of references to the Web subject to ‘rot’

After up to 50 redirects

Page 21: Reference Rot and E-Theses: Threat and Remedy

Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010

=> ‘On average’ 1/3rds of the links in an e-Thesis are ‘rotten’

Page 22: Reference Rot and E-Theses: Threat and Remedy

The older the citation, the less likely to be still on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined]

We can’t stop that process of rot: Web content changes over time,

Reference Rot is inevitable function of time

Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)

Page 23: Reference Rot and E-Theses: Threat and Remedy

Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to be Archived

47.6

Not Found 52.4

All 100%

There seems a 50:50 chance that referenced content is in the ‘Archived Web’.

Some content is being ‘co-incidentally harvested’ by routine web archiving.

=> half of those references are at ‘risk of loss’

Page 24: Reference Rot and E-Theses: Threat and Remedy

50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’

Page 25: Reference Rot and E-Theses: Threat and Remedy

‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)

We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite

Page 26: Reference Rot and E-Theses: Threat and Remedy

We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to be Archived

29.3 18.3 47.6

Not Found 34.0 18.4 52.4

All 63.3 36.7 100%

18.4%‘not live & not found in archive’judged to be lost forever

34%‘live’ & ‘not in archive’

at is risk of loss

NB: The 34% ‘at risk’ could be saved by pro-active archiving

Page 27: Reference Rot and E-Theses: Threat and Remedy

Devising Remedy for Reference Rot in e-Theses

Page 28: Reference Rot and E-Theses: Threat and Remedy

Having demonstrated problem exists & is severe

• The Web changes over time: reference rot occurs (36.7%)

• Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success

• Seek pro-active ‘transactional archiving’ solutions

– focus on what is regarded by authors as important

• Thereby to remedy the integrity of the scholarly record

We aim to embed ‘solutions’ in existing tools & infrastructure

Our General Approach

Page 29: Reference Rot and E-Theses: Threat and Remedy

a) Understand the preparation/publication workflow – identifying where there can be productive intervention

b) Devise prototypes for pro-active archiving – writing & implementing code!

c) Propose/test infrastructure for temporal referencing – supporting & using the Memento protocol

We are embedding ‘solutions’ in existing tools & infrastructure

Strategy for Making Remedy

Page 30: Reference Rot and E-Theses: Threat and Remedy

Understanding 3 workflows: Rot or Remedy?

Identify the Actors

Extended length of stages in workflows magnify reference rot & affect

① Study -> Preparation - > (Review) -> Submission

② Post-Submission -> Examination -> (Revision) -> Award

③ Post-Award -> Deposit/Ingest -> Provide/Access -> Use

Doctoral Student (& Supervisor)

Faculty, Examiners& Supervisor

University & Library

Identify the best opportunities for Intervention to make Remedy

Page 31: Reference Rot and E-Theses: Threat and Remedy

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factoring the HTML link

3. HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses

LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel

UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz

‘Work in progress’ to effect Remedy

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Page 32: Reference Rot and E-Theses: Threat and Remedy

1. Hiberlink Plug-in - to help authors and middle-folk (publishers/librarians) do the right thing:

– Zotero - used by authors to manage references

https://www.zotero.org/

– Open Journal System (OJS) - used by OA publishers

https://pkp.sfu.ca/ojs/

‘Work in progress’ to effect Remedy (1)

Page 33: Reference Rot and E-Theses: Threat and Remedy

For use during preparation of thesis & before final submission but also

before deposit with Library (& maybe for repair by Library …)

Hiberlink Plug-in for Zotero ① Triggers archiving of referenced web content

② Returns Datetime URI for archived content

Page 34: Reference Rot and E-Theses: Threat and Remedy

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factor the HTML link that is returned

‘Work in progress’ to effect Remedy (2)

b) Augment Link with a set of Datetime & location pairs

a) Take simple URI - to French National Library (say)

Page 35: Reference Rot and E-Theses: Threat and Remedy

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factoring the HTML link

First two approaches support ‘perfect scenario’:

• All authors archive all their cited URIs

• e.g. (but not exclusively) with Hiberlink / Zotero

3. HiberActive

– Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses

– A notification hub, a component for the infrastructure

• testing workflow with ResourceSync, CORE & external archive programme

‘Work in progress’ to effect Remedy (3)

Page 36: Reference Rot and E-Theses: Threat and Remedy

Next Steps: who wants to take this work forward?to ensure references in e-Theses don’t rot

• Need to move from the ‘incidental Web archiving’ of cited URIs to pro-active archiving, by student/authors & by libraries

a) Offer to be an early adopter for these Hiberlink remedies

• The Hiberlink Plug-in for Zotero / HiberActive

Email: [email protected] Subject: Hiberlink ETD

b) Amend ‘Guidance for ETD Lifecycle Management’

Page 37: Reference Rot and E-Theses: Threat and Remedy

Thank you, Questions welcome

http://hiberlink.org #hiberlink

Email: [email protected]

ETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Page 38: Reference Rot and E-Theses: Threat and Remedy

Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/

But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves.

Aside: We would all like to assume that our libraries are ensuring that online e-journal content is being kept safe

Page 39: Reference Rot and E-Theses: Threat and Remedy

Evidence from The Keepers Registry is worrying!

① Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN

‘Ingest Ratio’ = titles being ingested by one or more Keeper/ ‘online serials’ in ISSN Register

= 23,268 / 136,965 [in March 2014] => 17%

* We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%

② Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate

③ User-centric Evidence, usage logs for the UK OpenURL Router*

=> over two thirds 68% (36,326 titles) held by none!