copy of presentation 1

Post on 06-Apr-2018

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 1/16

PRESENTED BY :

Buddaraju Akhila Devi

(502)

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 2/16

Content :

What is web crawler?

History

How does web crawler work?

Crawling Applications

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 3/16

What is web crawler?

 Also known as a Web spider or Web robot.

engines to download pages from the web for later processing by a search engine that will

index the downloaded pages to provide fastsearches.

Crawl or visit web pages and download them

Starting from one page ±determine whichpage(s) to go to next

Create a copy of visited pages for later usage(indexing)

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 4/16

History Of Web Crawlers:

1994, by Brian Pinkerton.

1994, web application with few sites in database.

1994, DealerNet and Starwave contributed for advertising and searching.

1995, America online started to use with futuredevelopment.

1995, Spidey introduced with design changes.

1996, extended functionality from pure search tohuman edited guide : GNN Select.

1997, Excite and Web Crawler.

2001, InfoSpace worked on Meta-search

engine.

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 5/16

Architecture Of Web Crawler:

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 6/16

How does web crawler work?

It starts with a list of URLs to visit, called

the seeds. As the crawler visits these

URLs, it identifies all the hyperlinks in thepage and adds them to the list of visited

URLs, called the crawl frontier .

URLs from the frontier are recursivelyvisited according to a set of policies.

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 7/16

Uses: Difficulties:

Collect information.

Textual analysis.

Updates data.

 Automate

maintenance.

Harvesting e-mail

addresses.Mirroring,

visualization.

Illegal uses.

Large volume.

Dynamic page

generation.

Fast rate of change.

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 8/16

Policies Used:

Selection policy-which pages to storeRe-visit policy-when to check for changes

Politeness policy-to avoid overloading

sites

Parallelization policy-distributed web

crawlers

 Behavior Of Web Crawler :

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 9/16

Selection Policy:

Not random sample pages

Select relevant pages by prioritizing

Calculates popularity Abiteboul-OPIC( On-line Page Importance

Computation Algorithm)

Daneshpajouh-discovers good seeds

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 10/16

Re-visit Policy:

Cost functions used are Freshness and

 Age

Coffman redefined web crawler in terms of freshness:

³Crawler minimizes pages remain

outdated´

Problems are: single-server polling

system, multiple queue

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 11/16

Politeness Policy:

Crawler retrieve data faster than human

searchers

Overhead if single crawler used withmultiple requests

Cost of crawlers includes n/w resources,

server overload, poor crawlers, personal

crawlers

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 12/16

Parallelization Policy:

Runs multiple processes in parallel

Objective: Maximize download rate,

minimizing overhead Avoid repeated downloads of same page

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 13/16

Examples( Crawler Architectures):

RBSE first published web crawler Slurp( Yahoo)

Bingbot( Microsoft). Replaced by Msnbot

FAST Crawler( Distributed Crawler)

Googlebot in C++ and PythonPolyBot in C++ and Python

World Wife Web Worm( Indexing by grep

UNIX cmd)WebFountain in C++( distributed controller m/c)

WebRACE in Java

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 14/16

Open Source Crawlers:

 Aspseek in C++

Datapark under GNU

GNU Wget under GPLGRUB used by Wikia Search

Heritix in Java

ICDL in C++ and many more

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 15/16

Demo Program:

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 16/16

NUTCH:

Is a Open Source web crawler 

Nutch Web Search Application

Maintain DB of pages and linksPages have scores, assigned by analysis

Fetches high-scoring, out-of-date pages

Distributed search front end

Based on Lucene

http://lucene.apache.org/nutch/

top related