search engine and web crawler

21
Seminar Report on Mehta Ishani 130040701003

Upload: ishmecse13

Post on 16-Jul-2015

107 views

Category:

Internet


2 download

TRANSCRIPT

Page 1: Search engine and web crawler

Seminar Report on

Mehta Ishani

130040701003

Page 2: Search engine and web crawler

Search Engine and Web Crawler

2

Abstract

The World Wide Web is a rapidly growing and changing information source. Due to the dynamic nature of the Web, it becomes harder to find

relevant and recent information.

Search engines are the primary gateways of information access on the Web.

Today search engines are becoming necessity of most of the people in day to day life for navigation on internet or for finding anything. Search engine

answer millions of queries every day. Whatever comes in our mind, we just enter the keyword or combination of keywords to trigger the search and get

relevant result in seconds without knowing the technology behind it. I searched for ―search engine‖ and it returned 68,900 results. In addition with

this, the engine returned some sponsored results across the side of the page, as well as some spelling suggestion. All in 0.36 seconds. And for popular

queries the engine is even faster. For example, searches for World Cup or dance shows (both recent events) took less than .2 seconds each.

To engineer a search engine is a challenging task. Web crawler is an indispensable part of search engine. A web crawler is a program that, given

one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues

to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to

collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web

pages, such as web data mining, comparison shopping engines, and so on.

Page 3: Search engine and web crawler

Search Engine and Web Crawler

3

Introduction to Search Engine

Search engine is a tool that allows people to find information over World

Wide Web. Search engine is a website that you can use to look up web pages, like yellow pages for the Internet. A web search engine is a software

system that is designed to search for information on the World Wide Web.

Assume you are reading a book and want to find references to a specific word in the book. What do you do? You turn the pages to the end and look

in the index! You will then locate the word in the index, find the page numbers mentioned there and flip to the corresponding pages.

Search Engines also work in a similar way.

Figure 1 telephone directory

Search engines are constantly building and updating their index to the World

Wide Web. They do this by using ―spiders‖ that ―crawl‖ the web and fetch web pages. Then the words used in these web pages are added to the index along with where the words came from. [1]

Page 4: Search engine and web crawler

Search Engine and Web Crawler

4

How stuff works?

A search engine operates in the following order: 1. Web crawling

2. Indexing 3. Searching

Web search engines work by storing information about many web pages.

These pages are retrieved by a Web crawler (sometimes also known as a

spider) — an automated Web crawler which follows every link on the site.

The search engine then analyzes the contents of each page to determine how it should be indexed (for example, words can be extracted from the titles,

page content, headings, or special fields called meta tags).

Figure 2 working flow of search engine

When a user enters a query into a search engine (typically by using

keywords), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary

Page 5: Search engine and web crawler

Search Engine and Web Crawler

5

containing the document's title and sometimes parts of the text. The index is built from the information stored with the data.

From 2007 the Google.com search engine has allowed one to search by date

by clicking 'Show search tools' in the leftmost column of the initial search results page, and then selecting the desired date range.

Most search engines support the use of the boolean operators AND, OR and

NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search.

As well, natural language queries allow the user to type a question in the

same form one would ask it to a human. A site like this would be ask.com.

The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or

authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first.

Search engines that do not accept money for their search results make

money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one

of these ads. [2]

Page 6: Search engine and web crawler

Search Engine and Web Crawler

6

Major Search Engines - A Comparison

Today there are many search engines available to web searchers. What

makes one search engine different from another? Following are some important measure.[3]

The contents of that database are a crucial factor determining whether or

not you will succeed in finding the information we need. Because when we are doing searching, we are not actually searching the Web directly.

Rather, we are searching the cache of the web or database that contains information about all the Web sites visited by that search engine’s spider

or crawler.

Size is also one important measure. How many Web pages has the spider

visited, scanned, and stored in the database? Some of the larger Search

Engines have databases that are covering over three billion Web pages, while the databases of smaller Search Engines cover half a billion or less

Another important measure is how up to date the database is. As we know that the Web is continuously changing and growing. New Websites

appear, old sites vanish, and existing sites modify their content. So the information stored in the database will become out of date unless Search

engine’s spider keep up with these changes.

In addition with these, the ranking algorithm used by the Search Engine

determines whether the most relevant search results appear or not towards the top of results list.

Figure 3. Google logo.

Google has been in the search game a long time, it has the highest share market of Search Engine (about 81%) [3].

1) Web Crawler-based service provides both comprehensive coverage of the

Web along with great relevancy. 2) Google is much better than the other engines at determining whether a

link is an artificial link or true editorial link

Page 7: Search engine and web crawler

Search Engine and Web Crawler

7

3) Google gives much importance to Sites which add fresh content on a regular basis. This is why Google likes blogs, especially popular ones.

4) Google prefer informational pages to commercial sites. 5) A page on a site or sub domain of a site with significant age or link can

rank much better than it should, even with no external citations. 6) It has aggressive duplicate content filters that filter out many pages with

similar content. 7) Crawl depth determined not only by link quantity, but also link quality.

Excessive low quality links may make your site less likely to be crawled deep or even included in the index.

8) In addition we can search for twelve different file formats, cached pages, images, news and Usenet group postings.

Figure 4 Yahoo logo.

Yahoo has been in the search game for many years [3].

1) It shares the second largest share market of the search engine

(about12%). 2) When it comes to counting back lings, Yahoo is the most accurate search

engine 3) Yahoo is better than MSN but near as good as Google at determining

whether a link is artificial or natural. 4) Crawl rate of the Yahoo's spiders is at least 3 times faster than Google‟s

Spiders.

5) Yahoo! tends to prefer commercial pages to informational pages as comparing with Google.

6) At Yahoo search engine "exact matching" is given more importance than "concept matching" which makes them slightly more susceptible to

spamming. 7) Yahoo! gives more importance to meta keywords and description tags.

Page 8: Search engine and web crawler

Search Engine and Web Crawler

8

Figure. 5. MSN logo.

1) MSN has the share of 3% of the total search engine market [3]. 2) MSN Search uses its own Web database and also has separate News,

Images, and Local databases. 3) Its strengths include: this large unique database, its query building

Search Builder" and Boolean searching, cached copies of Web pages including date cached, and automatic local search options.

4) The spider crawls only the beginning of the pages (as opposed to the other two search engine which crawl the entire content) and also the

number of pages found in its index or database is extremely low. 5) It is bad at determining if a link is natural or artificial in nature. 6) Due to sucking at link analysis they place too much weight on the page

content. 7) New sites that are generally untrusted in other systems can rank quickly

in MSN Search. But it also makes them more susceptible to spam. 8) Another downside of this search engine is its habit of supplying the

results based on geo-targeting, which makes it extremely hard to determine if the results we see are the same ones everybody sees.

Figure. 6. ASK Jeeves logo.

1) The Ask search engine has the lowest share (about 1%) out of the total

search engine market [3]. 2) Ask is a topical search site. It gives more importance to sites that are

linked to topical communities 3) Ask is more susceptible to spamming.

4) Ask is smaller and more specialized than other search engines, it is wise to approach this engine more from a networking or marketing

perspective.

Page 9: Search engine and web crawler

Search Engine and Web Crawler

9

Figure 7 live search logo

1) Lunched in sept 2006 2) Live Search (formerly Windows Live Search) is the name of Microsoft's

web search engine, successor to MSN Search, designed to compete with the industry leaders Google and Yahoo.

3) It also allows the user to save searches and see them updated automatically on Live.com.

Figure 8 bing logo

1) Lunched in july 2009 by Microsoft. Use msn search.

2) Things like 'wiki' suggestions, 'visual search', and 'related searches' might

be very useful to you.

Page 10: Search engine and web crawler

Search Engine and Web Crawler

10

Introduction to Web Crawler

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. Also known

as a "spider" or a "bot" (short for "robot") Spider – programs like a browser to download the web page.

Crawler – programs automatically follow the links of web pages.

Robots - It had automated computer program can visit websites. It will be

guided by search engine algorithms It can combine the tasks of crawler & spider helpful of the indexing the web pages and through the search engines .

[4]

Why Crawlers?

Figure 9 result of searching term web crawler in Google.

Crawling: gathering pages from the internet, in order to index them It has 2 main objectives:

• fast gathering • efficient gathering [5]

Internet has a wide expanse of

Information. Finding relevant information requires an efficient

mechanism. Web Crawlers provide that

scope to the search engine.

Page 11: Search engine and web crawler

Search Engine and Web Crawler

11

Features

Features a crawler must provide

Robustness: The Web contains servers that create spider traps, which are

generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be

designed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development.

Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness

policies must be respected.

Features a crawler should provide

Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines.

Scalable: The crawler architecture should permit scaling up the crawl

rate by adding extra machines and bandwidth.

Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network bandwidth.

Quality: Given that a significant fraction of all web pages are of poor

utility for serving user query needs, the crawler should be biased towards fetching ―useful‖ pages first.

Freshness: In many applications, the crawler should operate in

continuous mode: it should obtain fresh copies of previously fetched pages.

Extensible: Crawlers should be designed to be extensible in many ways

– to cope with new data formats, new fetch protocols, and so on.

This demands that the crawler architecture be modular. [5]

Page 12: Search engine and web crawler

Search Engine and Web Crawler

12

Architecture of Crawler

Flow of basic sequential crawler

Web crawlers are mainly used to index the links of all the visited

pages for later processing by a search engine. Such search engines rely on massive collections of web pages that are acquired with the

help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Despite the numerous

applications for Web crawlers, at the core they are all fundamentally the same. Following is the process by which Web crawlers work [6]:

1) Download the Web page.

2) Parse through the downloaded page and retrieve all the links. 3) For each link retrieved, repeat the process.

Figure 10 shows the flow of a basic sequential crawler. The crawler

maintains a list of unvisited URLs called the frontier.

The list is initialized with seed URLs which may be provided by a user or another program. Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the

URL through HTTP, parsing the retrieved page to extract the URLs and application specific information, and finally adding the unvisited

URLs to the frontier.

Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page

corresponding to the URL. The crawling process may be terminated when a certain number of pages have been crawled. If the crawler is

ready to crawl another page and the frontier is empty, the situation signals a dead-end for the crawler. The crawler has no new page to

fetch and hence it stops. [6]

Page 13: Search engine and web crawler

Search Engine and Web Crawler

13

Figure 10 Flow of a basic sequential crawler

The multi-threaded crawler model needs to deal with an empty

frontier just like a sequential crawler [6].

Page 14: Search engine and web crawler

Search Engine and Web Crawler

14

Figure 11 A multi-threaded crawler model

High level architecture

Here, the multi-threaded downloader downloads the web pages from the WWW, and using some parsers the web pages are decomposed

into URLs, contents, title etc.

The URLs are queued and sent to the downloader using some scheduling algorithm. The downloaded data are stored in a database

[7].

Page 15: Search engine and web crawler

Search Engine and Web Crawler

15

Figure 12 high level architecture of web crawler

The design of the downloader scheduler algorithm is crucial as too

many objects will exhaust many resources and make the system slow, too small number of downloader will degrade the system

performance. The scheduler algorithm is as follows: [7]

1) System allocates a pre-defined number of downloader objects 2) User input a new URL to start crawler.

3) If any downloader is busy and there are new URLs to be processed, then a check is made to see if any downloader object is free. If true assign new URL to it and set its status as busy; else go

to 6. 4) After the downloader object downloads the contents of web pages

set its status as free. 5) If any downloader object runs longer than an upper time limit,

abort it. Set its status as free. 6) If there are more than predefined number of downloader or if all

the downloader objects are busy then allocate new threads and distribute the downloader to them

7) Continue allocating the new threads and free threads to the downloader until the number of downloader becomes less than the

threshold value, provided the number of threads being used be kept under a limit.

8) Goto 3.

Page 16: Search engine and web crawler

Search Engine and Web Crawler

16

Crawling Strategies There are mainly four types of crawling strategies as below [8]:

1) Breadth-First Crawling

Figure 13 breath first crawling

This algorithm starts at the root URL and searches the all the

neighbour URL at the same level. If the goal is reached, then it is

reports success and the search terminates. If it is not, search proceeds

down to the next level sweeping the search across the neighbour URL

at that level and so on until the goal is reached. When all the URLs are

searched, but the objective is not met then it is reported as failure.

2) Depth-First Crawling

Figure 14 depth first crawling

It starts at the root URL and traverse deeper through the child URL. If there are more than one child, then priority is given to the left most

child and traverse deep until no more child is available. It is

Page 17: Search engine and web crawler

Search Engine and Web Crawler

17

backtracked to the next unvisited node and then continues in a similar manner

3) Repetitive Crawling

once page have been crawled,some systems requrie the process to be

repeated periodically so that indexes are kept updated.which may be achieved by launching a second crawl in parallel,to overcome this

problem we should constantly update the ―Index List.‖

4) Targeted Crawling

Here main objective is to retrieve the greatest number of pages relating to a particular subject by using the ―Minimum Bandwidth‖.

most search engines use crawling process heuristics in order to target certain type of page on specific topic.

Crawling Policies

The characteristics of web that make crawling difficult:

1) Its Large Volume 2) Its Fast Rate of Change

To remove these dificulties the web crawler is having the following

policies. [5]

A Selection Policy that states which page to download. A Re-Visit Policy that states when to check for changes in pages. A Politeness Policy that states how to avoid overloading web sites.

A Parallelization Policy that states how to coordinate distributed Web Crawlers.

Page 18: Search engine and web crawler

Search Engine and Web Crawler

18

Implementation

I have developed Web crawler application java works on Windows operating system. It makes use net bins or any java compactable IDE to run.

For database connectivity it uses my sql - wamp server interface. The currently proposed web crawler uses breadth first search crawling to search the links. The proposed web crawler is deployed on a client machine.

Once the start the IDE and run the program, an automated browsing process

is initiated. The HTML page contents of rediffmail.com homepage are given to the parser. The parser puts it in a suitable format as described above and

the list of URLs in the HTML page are listed and stored in the frontier. The URLs are picked up from the frontier and each URL is assigned to a

downloader. The status of downloader whether busy or free can be known. After the page is downloaded it is added to the database and then the

particular downloader is set as free (i.e. released). The implementation details are given in table 1.

Figure 15 main program of web crawler application

Page 19: Search engine and web crawler

Search Engine and Web Crawler

19

Figure 16 output in IDE

Table 1: Functionality of the web crawler application on client machine.

Feature Support

Search for a search string Yes

Help manual No

Integration with other applications Yes

Specifying case sensitivity for a search string No

Specifying start URL Yes

Support for Breadth First crawling Yes

Check for Validity of URL specified Yes

Page 20: Search engine and web crawler

Search Engine and Web Crawler

20

Figure 17 webpage content on database

Page 21: Search engine and web crawler

Search Engine and Web Crawler

21

Conclusion

Web Crawler forms the back-bone of applications that facilitate Web

information Retrieval. In this report I have presented the architecture and implementation details of my crawling system which can be deployed on the

client machine to browse the web concurrently and autonomously. It combines the simplicity of asynchronous downloader and the advantage of

using multiple threads. It reduces the consumption of resources as it is not implemented on the mainframe servers as other crawlers also reducing

server management. The proposed architecture uses the available resources efficiently to make up the task done by high cost mainframe servers.

A major open issue for future work is a detailed study of how the system could become even more distributed, retaining though quality of the content

of the crawled pages. Due to dynamic nature of the Web, the average freshness or quality of the page downloaded need to be checked, the crawler

can be enhanced to check this and also detect links written in JAVA scripts or VB scripts and also provision to support file formats like XML, RTF,

PDF, Microsoft word and Microsoft PPT can be done.

References

[1] ―basic search handout‖ url: WWW.digitallearn.org

[2] ―web search engine‖ url: www.wikipedia.org

[3] Krishan Kant Lavania, Sapna Jain, Madhur Kumar Gupta, and Nicy Sharma,

―Google: A Case Study (Web Searching and Crawling)‖, International

Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 2013

[4] ―web crawler‖ url: www.wikipedia.org

[5] ―web crawling and indexes‖ Online edition, April 1, 2009 Cambridge

University Press.

[6] ―Crawling the web‖ G. Pant, P. Srinivasan, F. Menczer

[7] Rajashree Shettar, Dr. Shobha G, ―Web Crawler On Client Machine‖,

IMECS 2008, Vol II ,19-21 March, 2008, Hong Kong

[8] Rashmi Janbandhu, Prashant Dahiwale, M.M.Raghuwanshi, ―Analysis of

Web Crawling Algorithms‖ International Journal on Recent and Innovation

Trends in Computing and Communication ISSN: 2321-8169 Volume: 2

Issue: 3