using hyperlink structure information for web search

21
Using Hyperlink Using Hyperlink structure information structure information for web search for web search

Upload: sheena-blankenship

Post on 16-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Hyperlink structure information for web search

Using Hyperlink structure Using Hyperlink structure information for web searchinformation for web search

Page 2: Using Hyperlink structure information for web search

Hyperlink structure informationHyperlink structure information

Hyperlink analysis for the web by Monika R. Henzinger, Google Inc.

Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas

Page 3: Using Hyperlink structure information for web search

How are hyperlinks useful?How are hyperlinks useful?

Assumptionsa)Assumption 1. A hyperlink from pageA to page B is a recommendation ofpage B by the author of page A.b) Assumption 2. If page A and page B are

connectedby a hyperlink, then they might be onthe same topic.c) Pages pointed by many pages are of higher quality

than pages pointed to by fewer pages.

Page 4: Using Hyperlink structure information for web search

main uses of hyperlink analysis

crawling (collecting the pages) ranking (rank the returned results) Compute the geographic scope of a web

page Find mirrored host Compute the statistics of web pages and

search engine Major search engine use hyperlink analysis

but do not want to disclose the algorithms

Page 5: Using Hyperlink structure information for web search

CrawlingCrawling

Collect web pagesStart with a set of pages, recursively visit

the hyperlinks

Page 6: Using Hyperlink structure information for web search

Traditional IRTraditional IR

Vector model or Boolean modelDoes not work well in the web because: Web

page authors manipulate the ranking.The power of hyperlink analysis

comesfrom the fact that it uses the content

of other pages to rank the current page.

Page 7: Using Hyperlink structure information for web search

Connectivity-Based Ranking

(rank using hyperlink analysis)query-independent schemes,

which assign a score to a page independent of a given query;

query-dependent schemes, which assign a score to a page in the context of a given query.

Page 8: Using Hyperlink structure information for web search

ModelModel

Web pages as graph, page as node, hyperlink as edge.

Directed graph: link graph. Used for finding related pages

Undirected graph: co-citation graph. Used for categorizing related pages.

Page 9: Using Hyperlink structure information for web search
Page 10: Using Hyperlink structure information for web search
Page 11: Using Hyperlink structure information for web search

Query-independent RankingQuery-independent Ranking

Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low-quality pages and the quality of a page pointed to by the same number of high-quality page.

PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.

Page 12: Using Hyperlink structure information for web search

Query-dependent RankingQuery-dependent Ranking

Build query-specific graph: neighborhood graph.

a) Start set of documents matching the query

b) Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set.

Perform the hyperlink analysis.

Page 13: Using Hyperlink structure information for web search
Page 14: Using Hyperlink structure information for web search

Query-dependent Query-dependent Ranking(continued)Ranking(continued)

a) Indegree-based approach. (the number of documents hyper linking to a document in the start set)

b) Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic)

c) HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.

Page 15: Using Hyperlink structure information for web search

Problems of HITSProblems of HITS

Small additions to neighborhood graph may considerably change the scores of hub and auth.

Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.

Page 16: Using Hyperlink structure information for web search

Structural web search using a Structural web search using a graph-based discovery graph-based discovery

systemsystem WebSUBDUE: SUBDUE is the engine for knowledge

discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches.

Data preparation: Crawler written in Perl to build the labeled graph for the web site.

Labeled graph is feed into SUDUE system. Query can be modeled as labeled graph as well. Search the sub graph in the graph Make comparison with existing search engine: AltaVista

Page 17: Using Hyperlink structure information for web search
Page 18: Using Hyperlink structure information for web search

Find all pages that link to a Find all pages that link to a page containing the term page containing the term

subduesubdue

Page 19: Using Hyperlink structure information for web search

Jobs in computer scienceJobs in computer science

Page 20: Using Hyperlink structure information for web search

Find hubs and authorities Find hubs and authorities pages on “algorithm”pages on “algorithm”

Page 21: Using Hyperlink structure information for web search

ConclusionConclusion

Hyperlink structure information is valuable information.

Use of hyperlink information to enhance normal web search in crawling, ranking etc.

Use of hyperlink information to support structural search, which is still missing in existing search engine.