copy of presentation 1

16
PRESENTED BY : Buddaraju Akhila Devi (502)

Upload: madhoolika-varma-siruvuri

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 1/16

PRESENTED BY :

Buddaraju Akhila Devi

(502)

Page 2: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 2/16

Content :

What is web crawler?

History

How does web crawler work?

Crawling Applications

Page 3: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 3/16

What is web crawler?

 Also known as a Web spider or Web robot.

engines to download pages from the web for later processing by a search engine that will

index the downloaded pages to provide fastsearches.

Crawl or visit web pages and download them

Starting from one page ±determine whichpage(s) to go to next

Create a copy of visited pages for later usage(indexing)

Page 4: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 4/16

History Of Web Crawlers:

1994, by Brian Pinkerton.

1994, web application with few sites in database.

1994, DealerNet and Starwave contributed for advertising and searching.

1995, America online started to use with futuredevelopment.

1995, Spidey introduced with design changes.

1996, extended functionality from pure search tohuman edited guide : GNN Select.

1997, Excite and Web Crawler.

2001, InfoSpace worked on Meta-search

engine.

Page 5: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 5/16

Architecture Of Web Crawler:

Page 6: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 6/16

How does web crawler work?

It starts with a list of URLs to visit, called

the seeds. As the crawler visits these

URLs, it identifies all the hyperlinks in thepage and adds them to the list of visited

URLs, called the crawl frontier .

URLs from the frontier are recursivelyvisited according to a set of policies.

Page 7: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 7/16

Uses: Difficulties:

Collect information.

Textual analysis.

Updates data.

 Automate

maintenance.

Harvesting e-mail

addresses.Mirroring,

visualization.

Illegal uses.

Large volume.

Dynamic page

generation.

Fast rate of change.

Page 8: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 8/16

Policies Used:

Selection policy-which pages to storeRe-visit policy-when to check for changes

Politeness policy-to avoid overloading

sites

Parallelization policy-distributed web

crawlers

 Behavior Of Web Crawler :

Page 9: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 9/16

Selection Policy:

Not random sample pages

Select relevant pages by prioritizing

Calculates popularity Abiteboul-OPIC( On-line Page Importance

Computation Algorithm)

Daneshpajouh-discovers good seeds

Page 10: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 10/16

Re-visit Policy:

Cost functions used are Freshness and

 Age

Coffman redefined web crawler in terms of freshness:

³Crawler minimizes pages remain

outdated´

Problems are: single-server polling

system, multiple queue

Page 11: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 11/16

Politeness Policy:

Crawler retrieve data faster than human

searchers

Overhead if single crawler used withmultiple requests

Cost of crawlers includes n/w resources,

server overload, poor crawlers, personal

crawlers

Page 12: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 12/16

Parallelization Policy:

Runs multiple processes in parallel

Objective: Maximize download rate,

minimizing overhead Avoid repeated downloads of same page

Page 13: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 13/16

Examples( Crawler Architectures):

RBSE first published web crawler Slurp( Yahoo)

Bingbot( Microsoft). Replaced by Msnbot

FAST Crawler( Distributed Crawler)

Googlebot in C++ and PythonPolyBot in C++ and Python

World Wife Web Worm( Indexing by grep

UNIX cmd)WebFountain in C++( distributed controller m/c)

WebRACE in Java

Page 14: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 14/16

Open Source Crawlers:

 Aspseek in C++

Datapark under GNU

GNU Wget under GPLGRUB used by Wikia Search

Heritix in Java

ICDL in C++ and many more

Page 15: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 15/16

Demo Program:

Page 16: Copy of Presentation 1

8/3/2019 Copy of Presentation 1

http://slidepdf.com/reader/full/copy-of-presentation-1 16/16

NUTCH:

Is a Open Source web crawler 

Nutch Web Search Application

Maintain DB of pages and linksPages have scores, assigned by analysis

Fetches high-scoring, out-of-date pages

Distributed search front end

Based on Lucene

http://lucene.apache.org/nutch/