1 cs 430 / info 430 information retrieval lecture 15 web search 1

31
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

Post on 22-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

1

CS 430 / INFO 430 Information Retrieval

Lecture 15

Web Search 1

Page 2: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

2

Course Administration

Assignment 2

Grades have been sent out

Assignment 3

Has been posted

Midterm has been graded

If you have any questions please come to office hours

Page 3: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

3

Web Search

Goal

Provide information discovery for large amounts of open access material on the web

Challenges

• Volume of material -- several billion items, growing steadily

• Items created dynamically or in databases

• Great variety -- length, formats, quality control, purpose, etc.

• Inexperience of users -- range of needs

• Economic models to pay for the service

• Mischievous Web sites

Page 4: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

4

Strategies

Subject hierarchies

• Use of human indexing -- Yahoo! (original)

Web crawling + automatic indexing

• General -- Infoseek, Lycos, AltaVista, Google, Yahoo! (current)

Mixed models

• Human directed web crawling and automatic indexing -- iVia/NSDL

Page 5: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

5

Components of Web Search Service

Components

• Web crawler

• Indexing system

• Search system

• Advertising system

Considerations

• Economics

• Scalability

• Legal issues

Page 6: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

6

Lectures and Classes

Discussion 6 Web crawling

Lecture 15 Web crawling

Discussion 7 Ranking Web documents

Lecture 16 Graphical methods of ranking

Lecture 17 Building a large-scale search engine

Discussion 8 File systems

Lecture 18 Advertising, spam, & Web business

Lectures 23-25 User interface considerations

Page 7: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

7

Web Searching: Architecture

Build indexSearch

Index to all Web pages

• Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog.)

• The central index is implemented as a single system on a very large number of computers

Examples: Google, Yahoo!

Web servers with Web pages

Crawl

Web pages retrieved by crawler

Page 8: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

8

What is a Web Crawler?

Web Crawler

• A program for downloading web pages.

• Given an initial set of seed URLs, it recursively downloads every page that is linked from pages in the set.

• A focused web crawler downloads only those pages whose content satisfies some criterion.

Also known as a web spider

Page 9: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

9

Simple Web Crawler Algorithm

Basic Algorithm

Let S be set of URLs to pages waiting to be indexed. Initially S is is a set of known seeds.

Take an element u of S and retrieve the page, p, that it references.

Parse the page p and extract the set of URLs L it has links to.

Update S = S + L - u

Repeat as many times as necessary.

[Large production crawlers may run continuously]

Page 10: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

10

Not so Simple…

Performance -- How do you crawl 10,000,000,000 pages?

Politeness -- How do you avoid overloading servers?

Legal -- What if the owner of a page does not want the crawler to index it?

Failures -- Broken links, time outs, spider traps.

Strategies -- How deep do we go? Depth first or breadth first?

Implementations -- How do we store and update S and the other data structures needed?

Page 11: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

11

What to Retrieve

No web crawler retrieves everything

• All crawlers retrieve – HTML (leaves and nodes in the tree)– ASCII clear text (only as leaves in the tree)

• Most retrieve– PDF– PostScript,…

Importance metrics (Discussion Class 6)– Interest driven– Popularity driven– Location driven

Page 12: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

12

Robots Exclusion

The Robots Exclusion Protocol

A Web site administrator can indicate which parts of the site should not be visited by a robot, by providing a specially formatted file on their site, in http://.../robots.txt.

The Robots META tag

A Web author can indicate if a page may or may not be indexed, or analyzed for links, through the use of a special HTML META tag

See: http://www.robotstxt.org/wc/exclusion.html

Page 13: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

13

Robots Exclusion

Example file: /robots.txt

# Disallow allow all robotsUser-agent: *Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappearDisallow: /foo.html

# To allow CybermapperUser-agent: cybermapperDisallow:

Page 14: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

14

Extracts from:http://www.nytimes.com/robots.txt

# robots.txt, www.nytimes.com 3/24/2005User-agent: *Disallow: /collegeDisallow: /reutersDisallow: /cnetDisallow: /partnersDisallow: /archivesDisallow: /indexesDisallow: /thestreetDisallow: /nytimes-partnersDisallow: /financialtimesAllow: /2004/Allow: /2005/Allow: /services/xml/

User-agent: Mediapartners-Google*Disallow:

Page 15: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

15

The Robots META tag

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required.

Note that currently only a few robots implement this.

In this simple example:

<meta name="robots" content="noindex, nofollow">

a robot should neither index this document, nor analyze it for links.

http://www.robotstxt.org/wc/exclusion.html#meta

Page 16: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

16

High Performance Web Crawling

The web is growing fast:

• To crawl a billion pages a month, a crawler must download about 400 pages per second.

• Internal data structures must scale beyond the limits of main memory.

Politeness:

• A web crawler must not overload the servers that it is downloading from.

Page 17: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

17

Example: Mercator and Heritrix Crawlers

Altavista was a research project and production Web search engine developed by Digital Equipment Corporation.

Mercator was a high-performance crawler for production and research. Mercator was developed by Allan Heydon, Marc Njork, Raymie Stata and colleagues at Compaq Systems Research Center (continuation of work of Digital's AltaVista group).

Heritrix is a high-performance, open-source crawler developed by Raymie Stata and colleagues at the Internet Archive. (Stata is now at Yahoo!)

Mercator and Heritrix are described together, but there are major implementation differences.

Page 18: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

18

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 19: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

19

Mercator/Heritrix: Design Goals

Broad crawling: Large, high-bandwidth crawls to sample as much of the Web as possible given the time, bandwidth, and storage resources available.

Focused crawling: Small- to medium-sized crawls (usually less than 10 million unique documents) in which the quality criterion is complete coverage of selected sites or topics.

Continuous crawling: Crawls that revisit previously fetched pages, looking for changes and new pages, even adapting its crawl rate based on parameters and estimated change frequencies.

Experimental crawling: Experiment with crawling techniques, such as choice of what to crawl, order of crawled, crawling using diverse protocols, and analysis and archiving of crawl results.

Page 20: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

20

Mercator/Heritrix

Design parameters

• Extensible. Many components are plugins that can be rewritten for different tasks.

• Distributed. A crawl can be distributed in a symmetric fashion across many machines.

• Scalable. Size of within memory data structures is bounded.

• High performance. Performance is limited by speed of Internet connection (e.g., with 160 Mbit/sec connection, downloads 50 million documents per day).

• Polite. Options of weak or strong politeness.

• Continuous. Will support continuous crawling.

Page 21: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

21

Mercator/Heritrix: Main Components

Scope: Determines what URIs are ruled into or out of a certain crawl. Includes the seed URIs used to start a crawl, plus the rules to determine which discovered URIs are also to be scheduled for download.

Frontier: Tracks which URIs are scheduled to be collected, and those that have already been collected. It is responsible for selecting the next URI to be tried, and prevents the redundant rescheduling of already-scheduled URIs.

Processor Chains: Modular Processors that perform specific, ordered actions on each URI in turn. These include fetching the URI, analyzing the returned results, and passing discovered URIs back to the Frontier.

Page 22: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

22

Mercator

Notation URL Frontier: Set S of URLs to be crawledDNS: Domain Name ServiceRIS: Rewind Input Stream

Page 23: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

23

Mercator/Heritrix: Main Components

• Crawling is carried out by multiple worker threads, e.g., 500 threads for a big crawl.

• The URL frontier stores the list of absolute URLs to download.

• The DNS resolver resolves domain names into IP addresses.

• Protocol modules download documents using appropriate protocol (e.g., HTML).

• Link extractor extracts URLs from pages and converts to absolute URLs.

• URL filter and duplicate URL eliminator determine which URLs to add to frontier.

Page 24: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

24

Mercator/Heritrix: The URL Frontier

A repository with two pluggable methods: add a URL, get a URL.

Most web crawlers use variations of breadth-first traversal, but ...

• Most URLs on a web page are relative (about 80%).

• A single FIFO queue, serving many threads, would send many simultaneous requests to a single server.

Weak politeness guarantee: Only one thread allowed to contact a particular web server.

Stronger politeness guarantee: Maintain n FIFO queues, each for a single host, which feed the queues for the crawling threads by rules based on priority and politeness factors.

Page 25: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

25

Building a Web Crawler: Links are not Easy to Extract and Record

• Relative/Absolute• Many strings are equivalent and can refer to the same file• Mirrors duplicate Web sites or parts of Web sites• CGI parameters• Dynamic generation of pages• Server-side scripting• Server-side image maps• Links buried in scripting code

Keeping track of the URLs that have been visited is a major component of a crawler

Page 26: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

26

Mercator/Heritrix: Duplicate URL Elimination

Duplicate URLs are not added to the URL Frontier

Requires efficient data structure to store the URL Set, all URLs that have been seen and to check a new URL.

In memory:

Represent URL by 8-byte checksum. Maintain in-memory hash table of URLs.

Requires 5 Gigabytes for 1 billion URLs.

Disk based:

Combination of disk file and in-memory cache with batch updating to minimize disk head movement.

Page 27: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

27

Example: Mercator's URL-Seen Test

All URLs that have been seen are stored in the URL Set.

• Represented by fixed size check sum

• In memory cache with least recently used replacement

Heydon and Najork reported in 1999 a cache hit rate of 75.7% using an in-memory cache with 218 entries and a table of recently added URLs.

Page 28: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

28

Mercator/Heritrix: Domain Name Lookup

Resolving domain names to IP addresses is a major bottleneck of web crawlers.

Approach:

• Separate DNS resolver and cache on each crawling computer.

• Create multi-threaded version of DNS code (BIND).

In Mercator, these changes reduced DNS loop-up from 70% to 14% of each thread's elapsed time.

Page 29: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

29

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 30: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

30

Research Topics in Web Crawling

• How frequently to crawl and what strategies to use.

• Identification of anomalies and crawling traps.

• Strategies for crawling based on the content of web pages (focused and selective crawling).

• Duplicate detection.

Page 31: 1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1

31

Further Reading

Heritrixhttp://crawler.archive.org/

Allan Heydon and Marc Najork, Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4):219-229, December 1999. http://research.microsoft.com/~najork/mercator.pdf