temporal anchor text as proxy for user queries
TRANSCRIPT
![Page 1: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/1.jpg)
Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries
![Page 2: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/2.jpg)
Web Archiving 1/2
The Web is a major source of published
information
Content on the Web evolves and changes
continuously
Many initiatives aim to archive the Web
Petabytes of archived data
![Page 3: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/3.jpg)
Web Archiving 2/2
Web archives are incomplete
Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
Depth-first crawl, focus only on selected web sites
Breadth-first crawl, focus on the entire domain,
but not in depth
![Page 4: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/4.jpg)
Reconstruct Queries
Our study: evolution of anchor text over time
to reconstruct what was important in the past
Information that would be similar to user queries
Inspiration:
Document titles can be used as an approximation
of user queries [Jin et al.]
Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]
![Page 5: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/5.jpg)
Queries in the Past
User queries have usually not been preserved
Impossible to reconstruct which queries the
user would have used to search the archive
However, web archives contain more than the
Web page content
E.g., page source, different timestamps (archive
date, last-modified date), link structure
![Page 6: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/6.jpg)
Link evidence and anchor Text
Link information represents the source URL, destination URL, and the anchor text
Anchor text is a short text describing the destination page
Has been shown to improve search effectiveness in a large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’
![Page 7: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/7.jpg)
Data: Dutch Web Archive
National Library of the Netherlands (KB)
Depth-first (selective) Web archive
Since 2007
10+ TB
8,000+ websites
Our snapshot
2009-2012
![Page 8: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/8.jpg)
Link Processing
Filtering text/html pages
~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
![Page 9: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/9.jpg)
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
![Page 10: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/10.jpg)
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record
![Page 11: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/11.jpg)
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
![Page 12: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/12.jpg)
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Archive-date
(YYYYMM)
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
![Page 13: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/13.jpg)
Link Processing
Filtering Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
URL normalization; get host of
the source and the destination
Clean spam e.g., rolex watches
Cleaning
![Page 14: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/14.jpg)
Link Processing
Filtering Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
Cleaning URL normalization; get host of the source
and the destination
Clean spam e.g., rolex watches
Partitioning Based on one-year and one-month granularity
![Page 15: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/15.jpg)
Link Processing
Filtering Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
Cleaning URL normalization; get host of the source
and the destination
Clean spam e.g., rolex watches
Partitioning Based on one-year and one-month granularity
Deduplication
Remove duplicate links; due to crawling
frequency
Same source, destination, and anchor text
![Page 16: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/16.jpg)
Hosts Evolution
Important hosts overtime
Aggregate links based on the target host
keep unique source hosts
Multiple pages from same host linking to the same
target host are counted as one
Rank hosts based on number of source hosts
linking to them
![Page 17: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/17.jpg)
% of new hosts over the years
% New hosts in 2012 not in {2009, 2010, and
2011}
![Page 18: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/18.jpg)
Anchor Text Evolution
Measure the importance of anchor text a over
time in time-partitioned links
Aggregate by anchor text
Compute the archive-based popularity
Normalize by Maximum
![Page 19: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/19.jpg)
% new anchor text over years
Anchor text is new in specific partition if does
not appear in the previous partitions
Based on one-year granularity
59% new anchor text
Based on one-month granularity
34% new anchor text
![Page 20: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/20.jpg)
WikiStats
Views aggregation of Wikipedia (WP) pages
From Jan 2008 to Jan 2015
We focus on
Feb 2009 to Dec 2012
Similar to the period of our snapshot of the Dutch
Web archive
Keep WP titles viewed >= 1,000 times
![Page 21: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/21.jpg)
Matching anchor text to WP titles
Pre-process WP titles like the anchor text
Lowercase
Stop-words removing
One-year and one-month granularity partitions
Collect titles by exact match with the anchors
Assume anchor popularity equals WP page
popularity
![Page 22: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/22.jpg)
Ranked anchor text with WP match
Different rank cut-off
% overlap decreases while cut-off increases
~56 % in top- 1k has a match
![Page 23: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/23.jpg)
Examples of popular anchor text (with match)
Major cities in the Netherlands
E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
Social web sites
E.g., twitter, linkedin, flickr, and vimeo
Major Dutch daily newspapers
E.g., de Volkskrant, Telegraaf, and Trouw
Dutch public broadcasting
uitzending gemist
Government web service
E.g., belastingdienst
![Page 24: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/24.jpg)
Discussion
Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
Unfortunately we found only few examples
with our current analysis
E.g., ‘‘canon’’ *
However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history
![Page 25: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/25.jpg)
Limitations & Future Work
Exact text matching between anchor text and
WP title
E.g., filmpje does not match WP title filmpje!
Additional pre-processing
Stemming, stopping, generalize from exact match to
match with low edit distance
Our analysis is based on depth-first crawl of
few thousand of Dutch websites
Breadth-first crawl such as [CommonCrawl]
![Page 26: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/26.jpg)
References
[Masanés06] J. Masanés. Web Archiving. Springer, 2006
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/
![Page 27: Temporal Anchor Text as Proxy for user Queries](https://reader031.vdocuments.site/reader031/viewer/2022022200/58a34ecb1a28ab62248b6f07/html5/thumbnails/27.jpg)
Limitations & Future Work
Exact text matching between anchor text and
WP title
E.g., filmpje does not match WP title filmpje!
Additional pre-processing
Stemming, stopping, generalize from exact match to
match with low edit distance
Our analysis is based on depth-first crawl of
few thousand of Dutch websites
Breadth-first crawl such as [CommonCrawl]