web archiving profile - wadl 2013
TRANSCRIPT
Web Archiving Profile
OverviewAhmed AlSumPhD Candidate
Old Dominion University
Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013
July 25-26, 2013Indianapolis, Indiana, USA
What is the problem?• Web Archives are blackbox, it just accessible
through textbox search (full-text or URI-lookup)• We need to profile/characterize the web archives
around the world such as:o Ageo Top-level domainso Languageso Growth rate
Why• To optimize the query routing for Memento
Aggregator.• To determine the missing parts of the web.
WhoFull text URI-lookup
Internet Archive x
Library of Congress x
Icelandic Web Archive x
Library and Archives Canada x x
British Library x x
UK National Library x x
Portuguese Web Archive x x
Web Archive of Catalonia x x
Croatian Web Archive x x
Archive of the Czech Web x x
National Taiwan University x x
Archive IT x x
How• Sampling from different sources• Retrieve the TimeMap from each archive• Analyze the TimeMaps
URIs Samples Sources
Web1. DMOZ – Random
sample2. DMOZ – TLD %2 of
each TLD from DMOZ (.com, .org, .jp, etc 52 TLD)
3. DMOZ – Languages 100 URIs for each Languages (24 lang.)
Web Archives4. Top 1-Gram from
Bing5. Top 1000 queries
term by Yahoo in 9 languages
User requests6. IA Wayback Machine Log
files7. Memento aggregator log
files* We used hostnames only
General Coverage
Web Archive Growth Rate
TLD Sample Coverage
TLD per archive (TLD Sample)
TLD per archive (Fulltext search)
TLD across archives
Languages distribution per
archive
Query Routing Evaluation