web archiving profile - wadl 2013

Web Archiving Profile

OverviewAhmed AlSumPhD Candidate

Old Dominion University

Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013

July 25-26, 2013Indianapolis, Indiana, USA

What is the problem?• Web Archives are blackbox, it just accessible

through textbox search (full-text or URI-lookup)• We need to profile/characterize the web archives

around the world such as:o Ageo Top-level domainso Languageso Growth rate

Why• To optimize the query routing for Memento

Aggregator.• To determine the missing parts of the web.

WhoFull text URI-lookup

Internet Archive x

Library of Congress x

Icelandic Web Archive x

Library and Archives Canada x x

British Library x x

UK National Library x x

Portuguese Web Archive x x

Web Archive of Catalonia x x

Croatian Web Archive x x

Archive of the Czech Web x x

National Taiwan University x x

Archive IT x x

How• Sampling from different sources• Retrieve the TimeMap from each archive• Analyze the TimeMaps

URIs Samples Sources

Web1. DMOZ – Random

sample2. DMOZ – TLD %2 of

each TLD from DMOZ (.com, .org, .jp, etc 52 TLD)

3. DMOZ – Languages 100 URIs for each Languages (24 lang.)

Web Archives4. Top 1-Gram from

Bing5. Top 1000 queries

term by Yahoo in 9 languages

User requests6. IA Wayback Machine Log

files7. Memento aggregator log

files* We used hostnames only

General Coverage

Web Archive Growth Rate

TLD Sample Coverage

TLD per archive (TLD Sample)

TLD per archive (Fulltext search)

TLD across archives

Languages distribution per

archive

Query Routing Evaluation

web archiving profile - wadl 2013

Education