artificial intelligence 15-381 web spidering & hw1 preparation

28
Web Spidering & HW1 Preparation Jaime Carbonell [email protected] 22 January 2002 Today's Agenda Finish A*, B*, Macrooperators Web Spidering as Search How to acquire the web in a box Graph theory essentials Algorithms for web spidering Some practical issues

Upload: kort

Post on 12-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Artificial Intelligence 15-381 Web Spidering & HW1 Preparation. Jaime Carbonell [email protected] 22 January 2002 Today's Agenda Finish A*, B*, Macrooperators Web Spidering as Search How to acquire the web in a box Graph theory essentials Algorithms for web spidering Some practical issues. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Artificial Intelligence 15-381Web Spidering & HW1 Preparation

Jaime [email protected]

22 January 2002

Today's AgendaFinish A*, B*, Macrooperators

Web Spidering as Search How to acquire the web in a box Graph theory essentials Algorithms for web spidering Some practical issues

Page 2: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Search Engines on the Web

Revising the Total IR Scheme

1. Acquire the collection, i.e. all the documents

[Off-line process]

2. Create an inverted index (IR lecture, later)

[Off-line process]

3. Match queries to documents (IR lecture)

[On-line process, the actual retrieval]

4. Present the results to user

[On-line process: display, summarize, ...]

Page 3: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Acquiring a Document CollectionDocument Collections and Sources

Fixed, pre-existing document collection

e.g., the classical philosophy works

Pre-existing collection with periodic updates

e.g., the MEDLINE biomedical collection

Streaming data with temporal decay

e.g., the Wall-Street financial news feed

Distributed proprietary document collections

Distributed, linked, publicly-accessible documentse.g. the Web

Page 4: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

:Properties of Graphs I (1)

Definitions:

Graph

a set of nodes n and a set of edges (binary links) v between the nodes.

Directed graph

a graph where every edge has a pre-specified direction.

Page 5: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Properties of Graphs I (2)

Connected graph

a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.

The web graph

the directed graph where n = {all web pages} and v = {all HTML-defined links from one web page to another}.

Page 6: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Properties of Graphs I (3)

Tree

a connected graph without any loops and with a unique path between any two nodes

Spanning tree of graph G

a tree constructed by including all n in G, and a subset of v such that G remains connected, but all loops are eliminated.

Page 7: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Properties of Graphs I (4)

Forest

a set of trees (without inter-tree links)

k-Spanning forest

Given a graph G with k connected subgraphs, the set of k trees each of which spans a different connected subgraphs.

Page 8: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Graph G = <n, v>

Page 9: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Directed Graph Example

Page 10: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Tree

Page 11: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Web Graph

<href …>

<href …>

<href …>

<href …>

<href …>

<href …>

<href …>

HTML references are linksWeb Pages are nodes

Page 12: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

More Properties of Graphs

Theorem 1: For every connected graph G, there exists a spanning tree.

Proof: Depth-first search starting at any node in G builds the spanning tree.

Page 13: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Properties of Graphs

Theorem 2: For every G with k disjoint connected subgraphs, there exists a k-spanning forest.

Proof: Each connected subgraph has a spanning tree (Theorem 1), and the set of k spanning trees (being disjoint) define a k-spanning forest.

Page 14: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Properties of Web Graphs

Additional Observations

The web graph at any instant of time contains k-connected subgraphs (but we do not know the value of k, nor do we know a-priori the structure of the web-graph).

If we knew every connected web subgraph, we could build a k-web-spanning forest, but this is a very big "IF."

Page 15: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Graph-Search Algorithms IPROCEDURE SPIDER1(G)

Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>

While STACK is not empty,

URLcurr := pop(STACK)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)

Return COLLECTION

What is wrong with the above algorithm?

Page 16: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Depth-first Search

1

2

3 4

5

6

7numbers = order inwhich nodes arevisited

Page 17: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Graph-Search Algorithms II (1)

SPIDER1 is Incorrect

What about loops in the web graph?

=> Algorithm will not halt

What about convergent DAG structures?

=> Pages will replicated in collection

=> Inefficiently large index

=> Duplicates to annoy user

Page 18: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Graph-Search Algorithms II (2)

SPIDER1 is Incomplete

Web graph has k-connected subgraphs.

SPIDER1 only reaches pages in the the connected web subgraph where ROOT page lives.

Page 19: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

A Correct Spidering Algorithm

PROCEDURE SPIDER2(G)Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>

While STACK is not empty,

| Do URLcurr := pop(STACK)

| Until URLcurr is not in COLLECTION

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 20: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

A More Efficient Correct AlgorithmPROCEDURE SPIDER3(G)

Let ROOT := any URL from GInitialize STACK <stack data structure>Let STACK := push(ROOT, STACK)Initialize COLLECTION <big file of URL-page pairs>

| Initialize VISITED <big hash-table>

While STACK is not empty,

| Do URLcurr := pop(STACK)

| Until URLcurr is not in VISITED

| insert-hash(URLcurr, VISITED)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 21: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Graph-Search Algorithms VA More Complete Correct Algorithm

PROCEDURE SPIDER4(G, {SEEDS})| Initialize COLLECTION <big file of URL-page pairs>| Initialize VISITED <big hash-table>

| For every ROOT in SEEDS| Initialize STACK <stack data structure>| Let STACK := push(ROOT, STACK)

While STACK is not empty,

Do URLcurr := pop(STACK)

Until URLcurr is not in VISITED

insert-hash(URLcurr, VISITED)

PAGE := look-up(URLcurr)

STORE(<URLcurr, PAGE>, COLLECTION)

For every URLi in PAGE,

push(URLi, STACK)Return COLLECTION

Page 22: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Completeness Observations

Completeness is not guaranteed

In k-connected web G, we do not know k

Impossible to guarantee each connected subgraph is sampled

Better: more seeds, more diverse seeds

Page 23: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Completeness Observations

Search Engine Practice

Wish to maximize subset of web indexed.

Maintain (secret) set of diverse seeds

(grow this set opportunistically, e.g. when X complains his/her page not indexed).

Register new web sites on demand

New registrations are seed candidates.

Page 24: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

To Spider or not to Spider? (1) User Perceptions

Most annoying: Engine finds nothing(too small an index, but not an issue since 1997 or so).Somewhat annoying: Obsolete links=> Refresh Collection by deleting dead links

(OK if index is slightly smaller)=> Done every 1-2 weeks in best enginesMildly annoying: Failure to find new site=> Re-spider entire web=> Done every 2-4 weeks in best engines

Page 25: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

To Spider or not to Spider? (2)

Cost of Spidering

Semi-parallel algorithmic decomposition

Spider can (and does) run in hundreds of severs simultaneously

Very high network connectivity (e.g. T3 line)

Servers can migrate from spidering to query processing depending on time-of-day load

Running a full web spider takes days even with hundreds of dedicated servers

Page 26: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Current Status of Web Spiders

Historical Notes

WebCrawler: first documented spider

Lycos: first large-scale spider

Top-honors for most web pages spidered: First Lycos, then Alta Vista, then Google...

Page 27: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Current Status of Web Spiders )

Enhanced Spidering

In-link counts to pages can be established during spidering.

Hint: In SPIDER4, store <URL, COUNT> pair in VISITED hash table.

In-link counts are the basis for GOOGLE’s page-rank method

Page 28: Artificial Intelligence 15-381 Web Spidering & HW1 Preparation

Current Status of Web Spiders

Unsolved ProblemsMost spidering re-traverses stable web graph=> on-demand re-spidering when changes occurCompleteness or near-completeness is still a major issueCannot Spider JAVA-triggered or local-DB stored information