web archiving profile - wadl 2013

14
Web Archiving Profile Overview Ahmed AlSum PhD Candidate Old Dominion University Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013 Indianapolis, Indiana, USA

Upload: ahmed-alsum

Post on 20-Jun-2015

1.534 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Web Archiving Profile - WADL 2013

Web Archiving Profile

OverviewAhmed AlSumPhD Candidate

Old Dominion University

Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013

July 25-26, 2013Indianapolis, Indiana, USA

Page 2: Web Archiving Profile - WADL 2013

What is the problem?• Web Archives are blackbox, it just accessible

through textbox search (full-text or URI-lookup)• We need to profile/characterize the web archives

around the world such as:o Ageo Top-level domainso Languageso Growth rate

Page 3: Web Archiving Profile - WADL 2013

Why• To optimize the query routing for Memento

Aggregator.• To determine the missing parts of the web.

Page 4: Web Archiving Profile - WADL 2013

WhoFull text URI-lookup

Internet Archive x

Library of Congress x

Icelandic Web Archive x

Library and Archives Canada x x

British Library x x

UK National Library x x

Portuguese Web Archive x x

Web Archive of Catalonia x x

Croatian Web Archive x x

Archive of the Czech Web x x

National Taiwan University x x

Archive IT x x

Page 5: Web Archiving Profile - WADL 2013

How• Sampling from different sources• Retrieve the TimeMap from each archive• Analyze the TimeMaps

Page 6: Web Archiving Profile - WADL 2013

URIs Samples Sources

Web1. DMOZ – Random

sample2. DMOZ – TLD %2 of

each TLD from DMOZ (.com, .org, .jp, etc 52 TLD)

3. DMOZ – Languages 100 URIs for each Languages (24 lang.)

Web Archives4. Top 1-Gram from

Bing5. Top 1000 queries

term by Yahoo in 9 languages

User requests6. IA Wayback Machine Log

files7. Memento aggregator log

files* We used hostnames only

Page 7: Web Archiving Profile - WADL 2013

General Coverage

Page 8: Web Archiving Profile - WADL 2013

Web Archive Growth Rate

Page 9: Web Archiving Profile - WADL 2013

TLD Sample Coverage

Page 10: Web Archiving Profile - WADL 2013

TLD per archive (TLD Sample)

Page 11: Web Archiving Profile - WADL 2013

TLD per archive (Fulltext search)

Page 12: Web Archiving Profile - WADL 2013

TLD across archives

Page 13: Web Archiving Profile - WADL 2013

Languages distribution per

archive

Page 14: Web Archiving Profile - WADL 2013

Query Routing Evaluation