-- martin klein & michael l. nelson old dominion university

9
Evaluating Methods to Rediscover Missing Web Pages from Web Infrastructure -- Martin Klein & Michael L. Nelson Old Dominion University

Upload: stephany-lee

Post on 17-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: -- Martin Klein & Michael L. Nelson Old Dominion University

Evaluating Methods to Rediscover Missing Web Pages from Web

Infrastructure

-- Martin Klein & Michael L. NelsonOld Dominion University

Page 2: -- Martin Klein & Michael L. Nelson Old Dominion University

Looks familiar?

Page 3: -- Martin Klein & Michael L. Nelson Old Dominion University

“Moved” but not lostReasons for “404”

Change in website structure Original webpage relocated in the same website

Server/domain name issues Original webpage captured by other websites

Page 4: -- Martin Klein & Michael L. Nelson Old Dominion University

“Moved” but not lost

Page 5: -- Martin Klein & Michael L. Nelson Old Dominion University

Rediscovering Missing WebpagesSearch-based solutions

URLLexical Signature (LS)TitleSocial bookmarking tagsLink Neighbourhood

Lexical Signature (LNLS)

Page 6: -- Martin Klein & Michael L. Nelson Old Dominion University

EvaluationCorpus

500 random samples from Open Directory Project

“Pretend” to be missing

Search Engines:Google/Yahoo/MSN

MetricPercentage of webpages rediscovered from the

top-N search results (N=1, 2-10, 11-100)

Page 7: -- Martin Klein & Michael L. Nelson Old Dominion University

ResultsLS

Majority either rediscovered in top-10 or undiscovered Yahoo!: 67.6% top-1, 7.5% top-2-10, 22% undiscovered

TitleSimilar distribution but with more webpages

rediscovered Google: 69.3% top-1, 8.1% top-2-10, 19.7% undiscovered

Unquoted better than quoted

Tags and LNLSPoor performance from both

Page 8: -- Martin Klein & Michael L. Nelson Old Dominion University

ResultsCombining LS and Title

Better performance than any single method Yahoo! uniformly outperforms the rest 76.4% top-1, 7.8% top-2-10, 13.6% undiscovered

Title analysisLength of 3~6 words most frequent and well-

performingFurther improvement by removing stopwords

Page 9: -- Martin Klein & Michael L. Nelson Old Dominion University

Research InsightsCommon but non-trivial problem

Simple methodology

Detailed, multi-step evaluation