-- martin klein & michael l. nelson old dominion university
TRANSCRIPT
![Page 1: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/1.jpg)
Evaluating Methods to Rediscover Missing Web Pages from Web
Infrastructure
-- Martin Klein & Michael L. NelsonOld Dominion University
![Page 2: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/2.jpg)
Looks familiar?
![Page 3: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/3.jpg)
“Moved” but not lostReasons for “404”
Change in website structure Original webpage relocated in the same website
Server/domain name issues Original webpage captured by other websites
![Page 4: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/4.jpg)
“Moved” but not lost
![Page 5: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/5.jpg)
Rediscovering Missing WebpagesSearch-based solutions
URLLexical Signature (LS)TitleSocial bookmarking tagsLink Neighbourhood
Lexical Signature (LNLS)
![Page 6: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/6.jpg)
EvaluationCorpus
500 random samples from Open Directory Project
“Pretend” to be missing
Search Engines:Google/Yahoo/MSN
MetricPercentage of webpages rediscovered from the
top-N search results (N=1, 2-10, 11-100)
![Page 7: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/7.jpg)
ResultsLS
Majority either rediscovered in top-10 or undiscovered Yahoo!: 67.6% top-1, 7.5% top-2-10, 22% undiscovered
TitleSimilar distribution but with more webpages
rediscovered Google: 69.3% top-1, 8.1% top-2-10, 19.7% undiscovered
Unquoted better than quoted
Tags and LNLSPoor performance from both
![Page 8: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/8.jpg)
ResultsCombining LS and Title
Better performance than any single method Yahoo! uniformly outperforms the rest 76.4% top-1, 7.8% top-2-10, 13.6% undiscovered
Title analysisLength of 3~6 words most frequent and well-
performingFurther improvement by removing stopwords
![Page 9: -- Martin Klein & Michael L. Nelson Old Dominion University](https://reader036.vdocuments.site/reader036/viewer/2022083009/5697bf9f1a28abf838c94b9c/html5/thumbnails/9.jpg)
Research InsightsCommon but non-trivial problem
Simple methodology
Detailed, multi-step evaluation