copy of presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 1/16

PRESENTED BY :

Buddaraju Akhila Devi

(502)



Content :

What is web crawler?

History

How does web crawler work?

Crawling Applications



What is web crawler?

Also known as a Web spider or Web robot.

engines to download pages from the web for later processing by a search engine that will

index the downloaded pages to provide fastsearches.

Crawl or visit web pages and download them

Starting from one page ±determine whichpage(s) to go to next

Create a copy of visited pages for later usage(indexing)



History Of Web Crawlers:

1994, by Brian Pinkerton.

1994, web application with few sites in database.

1994, DealerNet and Starwave contributed for advertising and searching.

1995, America online started to use with futuredevelopment.

1995, Spidey introduced with design changes.

1996, extended functionality from pure search tohuman edited guide : GNN Select.

1997, Excite and Web Crawler.

2001, InfoSpace worked on Meta-search

engine.



Architecture Of Web Crawler:



How does web crawler work?

It starts with a list of URLs to visit, called

the seeds. As the crawler visits these

URLs, it identifies all the hyperlinks in thepage and adds them to the list of visited

URLs, called the crawl frontier .

URLs from the frontier are recursivelyvisited according to a set of policies.



Uses: Difficulties:

Collect information.

Textual analysis.

Updates data.

Automate

maintenance.

Harvesting e-mail

addresses.Mirroring,

visualization.

Illegal uses.

Large volume.

Dynamic page

generation.

Fast rate of change.



Policies Used:

Selection policy-which pages to storeRe-visit policy-when to check for changes

Politeness policy-to avoid overloading

sites

Parallelization policy-distributed web

crawlers

Behavior Of Web Crawler :



Selection Policy:

Not random sample pages

Select relevant pages by prioritizing

Calculates popularity Abiteboul-OPIC( On-line Page Importance

Computation Algorithm)

Daneshpajouh-discovers good seeds



Re-visit Policy:

Cost functions used are Freshness and

Age

Coffman redefined web crawler in terms of freshness:

³Crawler minimizes pages remain

outdated´

Problems are: single-server polling

system, multiple queue



Politeness Policy:

Crawler retrieve data faster than human

searchers

Overhead if single crawler used withmultiple requests

Cost of crawlers includes n/w resources,

server overload, poor crawlers, personal

crawlers



Parallelization Policy:

Runs multiple processes in parallel

Objective: Maximize download rate,

minimizing overhead Avoid repeated downloads of same page



Examples( Crawler Architectures):

RBSE first published web crawler Slurp( Yahoo)

Bingbot( Microsoft). Replaced by Msnbot

FAST Crawler( Distributed Crawler)

Googlebot in C++ and PythonPolyBot in C++ and Python

World Wife Web Worm( Indexing by grep

UNIX cmd)WebFountain in C++( distributed controller m/c)

WebRACE in Java



Open Source Crawlers:

Aspseek in C++

Datapark under GNU

GNU Wget under GPLGRUB used by Wikia Search

Heritix in Java

ICDL in C++ and many more



Demo Program:



NUTCH:

Is a Open Source web crawler

Nutch Web Search Application

Maintain DB of pages and linksPages have scores, assigned by analysis

Fetches high-scoring, out-of-date pages

Distributed search front end

Based on Lucene

http://lucene.apache.org/nutch/

copy of presentation 1

Documents