using hyperlink structure information for web search
TRANSCRIPT
Using Hyperlink structure Using Hyperlink structure information for web searchinformation for web search
Hyperlink structure informationHyperlink structure information
Hyperlink analysis for the web by Monika R. Henzinger, Google Inc.
Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas
How are hyperlinks useful?How are hyperlinks useful?
Assumptionsa)Assumption 1. A hyperlink from pageA to page B is a recommendation ofpage B by the author of page A.b) Assumption 2. If page A and page B are
connectedby a hyperlink, then they might be onthe same topic.c) Pages pointed by many pages are of higher quality
than pages pointed to by fewer pages.
main uses of hyperlink analysis
crawling (collecting the pages) ranking (rank the returned results) Compute the geographic scope of a web
page Find mirrored host Compute the statistics of web pages and
search engine Major search engine use hyperlink analysis
but do not want to disclose the algorithms
CrawlingCrawling
Collect web pagesStart with a set of pages, recursively visit
the hyperlinks
Traditional IRTraditional IR
Vector model or Boolean modelDoes not work well in the web because: Web
page authors manipulate the ranking.The power of hyperlink analysis
comesfrom the fact that it uses the content
of other pages to rank the current page.
Connectivity-Based Ranking
(rank using hyperlink analysis)query-independent schemes,
which assign a score to a page independent of a given query;
query-dependent schemes, which assign a score to a page in the context of a given query.
ModelModel
Web pages as graph, page as node, hyperlink as edge.
Directed graph: link graph. Used for finding related pages
Undirected graph: co-citation graph. Used for categorizing related pages.
Query-independent RankingQuery-independent Ranking
Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page.
PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.
Query-dependent RankingQuery-dependent Ranking
Build query-specific graph: neighborhood graph.
a) Start set of documents matching the query
b) Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set.
Perform the hyperlink analysis.
Query-dependent Query-dependent Ranking(continued)Ranking(continued)
a) Indegree-based approach. (the number of documents hyper linking to a document in the start set)
b) Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic)
c) HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.
Problems of HITSProblems of HITS
Small additions to neighborhood graph may considerably change the scores of hub and auth.
Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.
Structural web search using a Structural web search using a graph-based discovery graph-based discovery
systemsystem WebSUBDUE: SUBDUE is the engine for knowledge
discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches.
Data preparation: Crawler written in Perl to build the labeled graph for the web site.
Labeled graph is feed into SUDUE system. Query can be modeled as labeled graph as well. Search the sub graph in the graph Make comparison with existing search engine: AltaVista
Find all pages that link to a Find all pages that link to a page containing the term page containing the term
subduesubdue
Jobs in computer scienceJobs in computer science
Find hubs and authorities Find hubs and authorities pages on “algorithm”pages on “algorithm”
ConclusionConclusion
Hyperlink structure information is valuable information.
Use of hyperlink information to enhance normal web search in crawling, ranking etc.
Use of hyperlink information to support structural search, which is still missing in existing search engine.