reference rot and e-theses: threat and remedy

Post on 09-Feb-2016

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Reference Rot and E-Theses: Threat and Remedy. Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library. Centre for Service Delivery & Digital Expertise. Hiberlink ETD2014, Leicester UK July 25th 2014. - PowerPoint PPT Presentation

TRANSCRIPT

Reference Rot and E-Theses: Threat and Remedy

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Peter Burnhill EDINA, University of Edinburgh &

for the Hiberlink Team at University of Edinburgh & LANL Research Library

Centre for Service Delivery & Digital Expertise

Overview

1. The Hiberlink Project & Reference Rot

2. Evidence of Threat of Reference Rot for the E-Thesis

• Our methods, data source & findings

3. Devising Remedy for Reference Rot in E-Thesis

• Proposals for intervention: plug-ins & infrastructural solutions

4. Next Steps: who (else) wants to take this work forward?

Reference Rot = Link Rot + Content Drift“when links to web resources

no longer point to what they once did”

Investigating Reference Rot in Web-Based Scholarly Communication

Link Rot

‘Link Rot’

+ Content Drift: What is at end of URI has changed, or gone!

http://dl00.org2000

http://dl00.org2004

http://dl00.org2005

http://dl00.org2008

(a) Dynamic contentas values on webpage changes over time

(b) Static contentbut very different (often unrelated) web pages

An International Team at Workfunded by the

Andrew W. Mellon Foundation

• Los Alamos National Laboratory:Research Library: Martin Klein, (Rob Sanderson),

Harihar Shankar, Herbert Van de Sompel• University of Edinburgh:

Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou

EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill

Centre for Service Delivery & Digital Expertise

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

What we are doing in Hiberlink, 2013 - 2015

1. Creating evidence on extent of ‘Reference Rot’– Main focus has been on references (& URIs) made in Journal Articles

• Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC

– ETD2014 is opportunity to look at Reference Rot & the e-Thesis

2. Understanding the preparation/publication workflow – Identifying opportunity for productive intervention

3. Prototypes for pro-active archiving to enable remedy– Embedding such ‘solutions’ in existing tools & infrastructure

4. Raising awareness & seeking collaborative actions…. through events like this

Evidence on the Threat of Reference Rot for E-Theses

Retrieving thinking about the emerging e-Thesis in 1998

University Theses Online Group, 1994/99

Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in

‘E-Theses Developments in the UK’ 2003

Retrieving thinking about the emerging e-Thesis in 1998

University Theses Online Group, 1994/99

Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in

‘E-Theses Developments in the UK’ 2003

4.

Measuring the Extent of ‘Reference Rot’ in e-Theses

Data Source• Looked for corpus of e-Theses for our study period of 1997 – 2012

• Interested only in Doctoral Theses/Dissertations

• NDLTD Union Catalogue

Basic Method

a) Define selection and use information in the metadata record • Degree awarded (PhD etc); Department

• Date thesis was successfully defended

• Link to the full text of the Doctoral Thesis

b) Download selected e-Thesis from each Institution’s Repository

7,500 E-Theses Downloaded from 5 US Institutions

In passing: note decline in numbers indicates ‘lag’ in ingest/availability of e-Theses

Key Aspects of Methodology (Stage 1)

1. Convert those e-Theses from PDF into XML• pdftohtml –xml

2. Locate the references & extract each and every URL • Technical challenges: URL broken/newline; underscore as image

• Use up to 15 regular expression for matching; regard as URI

UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou

Key Aspects of Methodology (Stage 2)

47,067 URIs were extractedThese were partitioned into two types:

i. 1,086 publisher sites, representing very many references to online articles:

‘the scholarly record’• BTW, who does keep those articles in the Scholarly Record safe?

• Ask me for evidence on that!

ii. 45,981 URIs that linked ‘the Web at large’• to Web content required for scholarship

• inc. websites, software, blogs, videos, online debate etc

• to that which lacks ‘fixity’ and changes over time

Those c.46,000 are the focus for the Hiberlink Project

Increase in Linking to ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time

URIs, by Year Thesis Defended (%), 1997 - 2010

50%

But Wide ‘Between-Thesis’ Variation in Number of Web Links

1373

Count(Log10)

• 10% of Theses have 25 or more URIsMedian (average) increases from 4 to 5.5• 75% have 2+ URIs per Thesis

Focus on e-Theses defended from 2003

box plots of medians (averages)& quartiles, with ‘outliers’

Methodology (Stage 3): to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and

regarding an 2XX status code as ‘live’

Methodology (Stage 3): to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

• Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’

ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

Memento: a prior version, what the Original Resource was like at some time in the past.

Methodology (Stage 3): to discover answer to 2 questions

i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?

ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?

• Archival check carried out in June 2014, using installed version of

Memento tool developed by LANL http://www.mementoweb.org/guide/quick-intro/

• A ‘Datetime’ version at or near the date the Thesis was defended

• Searching across several archives (not just Internet Archive)

Approach first used in pilot work at LANL; UoEdin Language Technology Group:

Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou

A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

Less than two-thirds of those links lead

to live content

Live on Web Not Found on ‘Live Web’ All

Count 29,122 16,860 45,982

% 63.3 36.7 100%

1st Order Indicator of ‘Reference Rot’ more than one

third of references to the Web subject to ‘rot’

After up to 50 redirects

Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010

=> ‘On average’ 1/3rds of the links in an e-Thesis are ‘rotten’

The older the citation, the less likely to be still on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined]

We can’t stop that process of rot: Web content changes over time,

Reference Rot is inevitable function of time

Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)

Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to be Archived

47.6

Not Found 52.4

All 100%

There seems a 50:50 chance that referenced content is in the ‘Archived Web’.

Some content is being ‘co-incidentally harvested’ by routine web archiving.

=> half of those references are at ‘risk of loss’

50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’

‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis)

We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite

We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]

% Live on Web Not found on ‘Live Web’ All

Found to be Archived

29.3 18.3 47.6

Not Found 34.0 18.4 52.4

All 63.3 36.7 100%

18.4%‘not live & not found in archive’judged to be lost forever

34%‘live’ & ‘not in archive’

at is risk of loss

NB: The 34% ‘at risk’ could be saved by pro-active archiving

Devising Remedy for Reference Rot in e-Theses

Having demonstrated problem exists & is severe

• The Web changes over time: reference rot occurs (36.7%)

• Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success

• Seek pro-active ‘transactional archiving’ solutions

– focus on what is regarded by authors as important

• Thereby to remedy the integrity of the scholarly record

We aim to embed ‘solutions’ in existing tools & infrastructure

Our General Approach

a) Understand the preparation/publication workflow – identifying where there can be productive intervention

b) Devise prototypes for pro-active archiving – writing & implementing code!

c) Propose/test infrastructure for temporal referencing – supporting & using the Memento protocol

We are embedding ‘solutions’ in existing tools & infrastructure

Strategy for Making Remedy

Understanding 3 workflows: Rot or Remedy?

Identify the Actors

Extended length of stages in workflows magnify reference rot & affect

① Study -> Preparation - > (Review) -> Submission

② Post-Submission -> Examination -> (Revision) -> Award

③ Post-Award -> Deposit/Ingest -> Provide/Access -> Use

Doctoral Student (& Supervisor)

Faculty, Examiners& Supervisor

University & Library

Identify the best opportunities for Intervention to make Remedy

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factoring the HTML link

3. HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses

LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel

UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz

‘Work in progress’ to effect Remedy

HiberlinkETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

1. Hiberlink Plug-in - to help authors and middle-folk (publishers/librarians) do the right thing:

– Zotero - used by authors to manage references

https://www.zotero.org/

– Open Journal System (OJS) - used by OA publishers

https://pkp.sfu.ca/ojs/

‘Work in progress’ to effect Remedy (1)

For use during preparation of thesis & before final submission but also

before deposit with Library (& maybe for repair by Library …)

Hiberlink Plug-in for Zotero ① Triggers archiving of referenced web content

② Returns Datetime URI for archived content

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factor the HTML link that is returned

‘Work in progress’ to effect Remedy (2)

b) Augment Link with a set of Datetime & location pairs

a) Take simple URI - to French National Library (say)

1. Hiberlink Plug-in - to enable pro-active archiving

2. Missing Link - re-factoring the HTML link

First two approaches support ‘perfect scenario’:

• All authors archive all their cited URIs

• e.g. (but not exclusively) with Hiberlink / Zotero

3. HiberActive

– Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses

– A notification hub, a component for the infrastructure

• testing workflow with ResourceSync, CORE & external archive programme

‘Work in progress’ to effect Remedy (3)

Next Steps: who wants to take this work forward?to ensure references in e-Theses don’t rot

• Need to move from the ‘incidental Web archiving’ of cited URIs to pro-active archiving, by student/authors & by libraries

a) Offer to be an early adopter for these Hiberlink remedies

• The Hiberlink Plug-in for Zotero / HiberActive

Email: edina@ed.ac.uk Subject: Hiberlink ETD

b) Amend ‘Guidance for ETD Lifecycle Management’

Thank you, Questions welcome

http://hiberlink.org #hiberlink

Email: edina@ed.ac.ukHiberlink

ETD2014, Leicester UK July 25th 2014

Funded by the Andrew W. Mellon Foundation

Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/

But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves.

Aside: We would all like to assume that our libraries are ensuring that online e-journal content is being kept safe

Evidence from The Keepers Registry is worrying!

① Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN

‘Ingest Ratio’ = titles being ingested by one or more Keeper/ ‘online serials’ in ISSN Register

= 23,268 / 136,965 [in March 2014] => 17%

* We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%

② Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate

③ User-centric Evidence, usage logs for the UK OpenURL Router*

=> over two thirds 68% (36,326 titles) held by none!

top related