web archiving in the uk: why, by whom, for whom?...what is web archiving? “deliberate and...
TRANSCRIPT
Web archiving in the UK: why, by whom, for whom?
Dr Peter WebsterWebster Research and Consulting@pj_webster / @WebsterRandC
The web its own archive?
Open UK Web Archive 2004-13 comparison.@anjacks0n http://britishlibrary.typepad.co.uk/webarchive/2014/10/what-is-still-on-the-web-after-10-years-of-archiving-.html
Disappearing predictably
Disappearing unpredictably
What is web archiving?
“deliberate and purposive preservation of
web material” (Brügger, 2011)
• micro or macro
• element, page, website, web sphere, whole
• harvesting, screen capture, file delivery
• public, restricted, or no access
[ archive.org ]
National libraries
• 16 of 28 member states within EU
• Sweden the first (1996)
• US (Library of Congress), Canada, Australia,
New Zealand, Singapore, Japan, Chile
• some with legal deposit provision: Denmark
(2005); France (2006), UK (2013)
Legal deposit web archiving: characteristics
• broad domain crawl, plus selective
• definition of the nation varies
• types of content included varies
• access restrictions
• indemnity against legal risks
Selective harvesting
• in absence of NPLD, based on permissions
• part of the case for obtaining NPLD law
• key resources, eg. government, media
• events: elections, Olympics, Eurovision
• themes: political extremism, climate change
Why archive your own web?
• part of orderly management of closure
• fulfilment of legal obligation
• management of risk
• part of the corporate record
• as a service for future scholars
Government records
A lost archive?
A lost archive?
A lost archive?
Web archives in the UK
Temporal scope Content scope Access
Open UKWA 2004-present Selective Online
Legal Deposit UKWA
2013-present Comprehensive (for UK)
Onsite
JISC UK Domain Dataset
1996-2013 Comprehensive (for .uk)
Index only
UK Government Web Archive
1996-present UK government Online
Parliamentary Web Archive
2009-present UK parliament Online
Univ. of Oxford 2011-present University sites Online
Tricky areas
• IPR (including third parties)
• personal data
• the right to be forgotten
• database-driven content
• embedded streaming media
Outsourcing providers
Not-for-profit
• Archive-IT (part of Internet Archive)
• Internet Memory Research
Commercial
• Hanzo Archives [UK]
• OIA (Offline Web Archive) [Germany]
• Pagefreezer [Canada/Netherlands]
Ways to use the archived web
• URL search -> single page• Full-text search -> single page• Visualisation -> trend -> page
Changing aesthetic
gov.ie, captured by archive.org, 15 August 2000
Full-text search
webarchive.org.uk/shine - https://github.com/ukwa/shine/
Visualising trends: ngram
Ways to use the archived web
• URL search -> single page• Full-text search -> single page• Visualisation -> trend -> page
• Direct access to WARC• Derived datasets• API access
Derived datasets from the BL
From JISC UK Web Domain Dataset (1996-2010)
• File format profile• Geo-index• Crawled URL Index (CDX)• Host Link Graph
Public domain at data.webarchive.org.uk
[ http://www.webarchive.org.uk/ukwa/visualisation/ukwa.ds.2/fmt ]
[Wikimedia Commons, CC BY SA 2.0, by Brian (of Toronto)]
A media firestorm
[https://web.archive.org/web/20080211003812/http://www.newsoftheworld.co.uk/1002_sharia.shtml]
UK Host Link Graph (1996-2010)
2008 | newsimg.bbc.co.uk | youtube.com | 45
2008 | archbishopofyork.org.uk | flickr.com | 1
2002 | secularism.org.uk | geocities.com | 1
Public domain at: data.webarchive.org.uk
[https://web.archive.org/web/20080211003812/http://www.newsoftheworld.co.uk/1002_sharia.shtml]
Questions ? Peter Webster
@pj_webster / @WebsterRandC
websterresearchconsulting.com