18363882 search engine with web crawler

8/14/2019 18363882 Search Engine With Web Crawler

1/23

Search Engine with WebCrawler

Abstract

Project : SEARCH ENGINE WITH WEB CRAWLER

Front End : Core java, JSP.

Back End : File system & My sql server

Web server : Tomcat web server

.

This project is an attempt to implement a search engine with web

crawler

so as to demonstrate its contribution to the human for performing the

searching

in web in a faster way. A search engine is an information retrieval

system

designed to help find information stored on a computer system. The

most

public, visible form of a search engine is a Web search engine which

searches

for the information on theWorld WideWeb. Search engines provide an

interface

to a group of items that enables the users to specify the criteria about

an item of interest and have the engine find the matching items. A web

crawler(Web spider or Web robot)is a program or automated script

which

browses the World Wide Web in a methodical, automated manner. This

process is called Web crawling or spidering. Search engines use

spidering

as a means of providing up-to-date data. Web crawler starts with a list

of


2/23

URLs to visit, called the seeds. As the crawler visits these URLs, it

identifies

all the hyperlinks in the page and adds them to the list of URLs to visit,

called the crawl frontier. Web crawler was the Internets first search

engine

that performed keyword searches in both the names and texts of pages

on the

World WideWeb. Webcrawlers search engine performed two basic

functions.

First, it compiled an ongoing index of web addresses (URLs).

Webcrawler

retrieved and marked a document, analyzed the content of both its

title and

its full text, registered the relevant links it contained, and then stored

the

information in its database. When the user submitted a query in the

form

of one or more keywords, Webcrawler compared it with the information

in

its index and reported back any matches.Webcrawlers second function

was

searching the Internet in real time for the sites that matched a given

query.

It was carried out using exactly the same process, following links from

one

page to another.

Contents

1 Introduction

1.1 The Motivation

2 System Study


3/23

2.1 Proposed System

2.2 Technologies

2.2.1 Java

2.2.2 JDBC: (Java Database Connection)

2.2.3 Overview of the JDBC Process

2.2.4 Java Server Pages (JSP)

2.2.5 Advantages of JSP

2.2.6 JSP Architecture

3 Modules

3.1 Administrator Side

3.1.1 Page Settings

3.1.2 Log Settings .

3.2 Search

3.3 Web Service

4 Working

4.1 Steps used in the implementation of Search engine

5 System Design

5.1 Data Flow Daigram

5.2 Data Base Design

6 Conclusion

References

Chapter 1

Introduction

Most people find what theyre looking for on the World Wide Web byusing

search engines like Yahoo!, Alta Vista, or Google. It is the search

engines

that finally bring your website to the notice of the prospective

customers.


4/23

Hence it is better to know how these search engines actually work and

how

they present information to the customer initiating a search. When you

ask

a search engine to locate the information, it is actually searching

through

the index which it has created and not actually searching through the

Web.

Different search engines produce different rankings because not every

search

engine uses the same algorithm to search through the indices. Many

leading

search engines use a form of software program called the spiders or

crawlers

to find information on the Internet and store it for search results in

giant

databases or indexes. Some spiders record every word on a Web site

for their

respective indexes, while others only report certain keywords listed in

title

tags or meta tags.Search Engines use spiders to index the websites.

When

you submit your website pages to a search engine by completing their

required

submission page, the search engine spider will index your entire site.

A spider is an automated program that is run by the search engine

system.

Search engine indexing collects, parses, and stores the data to

facilitate fast

and accurate information retrieval. Spiders are unable to index pictures

or


5/23

read text that is contained within graphics, relying too heavily on such

elements

was a consideration for the online marketers. Webcrawler was the

Internets first search engine that performed keyword searches in both

the

names and texts of pages on the World Wide Web. It won quick

popularity

and loyalty among surfers looking for information. During the Webs

infancy,

Webcrawler was born in January 1994. It was developed by Brian

Pinker-

ton, a computer student at the University of Washington, to cope with

the

complexity of theWeb. Pinkertons application, Webcrawler, could

automatically

scan the individual sites on the Web, register their content, and create

an index that surfers could query with keywords to find Web sites

relevant

to their interests.

1.1 The Motivation

Primarily it is due to the interest in the area of Information retrieval.

Nowadays,

there are many search engines like Google, Yahoo, Altavista etc. We

are trying to develop a Search engine with some of the facilities like

text

search, news search etc of the current search engines.

Chapter 2


6/23

System Study

2.1 Proposed System

In our proposed system,Search engine is implemented using web

crawler. In

our search engine user can search for text queries. When a query is

submitted

this will search in the downloaded web pages and the ranked URLs are

listed

to the user. The ranking is based on the number of searched words

presentin each web page. The user can also have the option for news fetching

using

yahoo API.

2.2 Technologies

Selection of programming language depends on the system we needs.

Since

the application is web based system JAVA and its technologies are mostsuitable. In the development of this application, JSP is used for the

design

of web pages for both user and administrator.

2.2.1 Java

Java was introduced by Sun Microsystems in 1995 and instantly

created

a new sense of the interactive possibilities of the web. Originally it wascalled Oak. It was mainly developed for the development of software

for

consumer electronic devices. Both of the major web browsers include a

Java


7/23

Virtual Machine(JVM).Almost all major operating system developers

(IBM,

Microsoft and others) have added Java compiler as part of their product

offerings.

It is a platform independent language. It is the first programming

language that is not guide to any particular hardware or operating

systems.

Programs developed in Java can be executed anywhere on any system.

The

internet helped to propel Java into the forefront of programming, and

Java,

in turn, has had a profound effect on the internet.Java is a true Object

oriented

Language. It is a programming language expressly designed for use in

the distributed environment of the internet.

The object model in Java is simple and easy to extend. It can also be

used to build small application modules or applets for use as part of a

web

page. Applets make it possible for a web page users interact with the

page.

Java could be easily incorporated into the web system.The programs

you

create are portable in a network. The output is Byte code. Byte code is

a highly optimized set of instruction designed to be executed by the

Java

runtime system. It is a code understood by any processor. Translating a

Java program into Byte code helps to make it easier to run a program

in a

wide variety of environments.

The major features of Java are:

Mainly Java is Platform-independent and portable.java programs can be


8/23

moved from one computer system to another, anywhere and anytime.

Changes

and upgrades in operating systems, processors and system resources

will not

force any changes in Java programs.

Secondly Java is a true Object oriented Language. Almost everything in

Java

is an object. All program code and data reside within objects and

classes.

It provides many safe guards to ensure reliable code. Java makes

memory

management much easier. It has strict compile time and runtime

checking

for data types. The object model in Java and easy to extend.

Java is designed as a distributed language for creating applications on

networks

as it handles. It has the ability to share both data and programs. Java

applications can open and access remote objects on internet as easily

as they

can do in local system. It is a small and simple language. It was

designed to

be easy for the professional programmer to learn and use effectively.

Java environment includes a large number of development tools and

hundreds

of classes and methods are part of the Java Standard Library (JSL),

also known as the Application Programming Interface (API).The

development

tools, part of the Java are used as the front end for designing the GUI

for the end users. It is a general purpose programming language which

sup-


9/23

ports multi-threaded programs. This means that we need not wait for

an

application to finish one task before beginning another.

2.2.2 JDBC: (Java Database Connection)

Practically every J2EE application saves, retrieves and manipulates

information

stored in a database-using web services provided by a J2EE

component.

A J2EE component supplies database access using Java data objects

contained

in the JDBC application programming interface (API).

Sun Microsystems, inc. met the challenge in 1996 with the creation of

the JDBC driver and JDBCAPI. The JDBC driver developed by the Sun

Microsystems. Inc. wasnt a driver at all. It was a specification that

described

the detail functionality of a JDBC driver. The specification required

a JDBC driver to be a translator that converted low-level proprietary

DBMS

messages to low-level messages understood by the JDBC API and user .

This

meant java programmers could use high-level java data objects

defined in the

JDBC API to write a routine that interacted with the DBMS. JDBC driver

created by DBMS manufacturers have to

Open a connection between the DBMS and the J2EE component.

Translates low-level equivalents of SQL statements sent by the J2EEcomponent

into messages that can be processed by the DBMS.

Returns data that conforms the JDBC specification to the JDBC driver.


10/23

Return information such as error messages that conforms to the JDBC

specification

to the JDBC driver.

Provide Transaction Management routines that conforms to the JDBC

specification.

Close the connection between the DBMS and the J2EE component.

2.2.3 Overview of the JDBC Process

This process is divided into 5 routines. These include.

Perform connection and authentication to a database server

Manage transactions

Move SQL statement to a database engine for preprocessing and

execution

Execute stored procedures

Inspect and modify the result from SELECT statements

2.2.4 Java Server Pages (JSP)

JSP is technology based on the JAVA language and enables the

development

of dynamic websites. JSP was developed by Sun Microsystems to allow

server side development. Based on the Java programming language

JSP offers

proven portability, open standards. A JSP document can share date

among users, access databases and do all the things that require

server intervention.

A JSP documents get compiled into Java byte code, a binary

format with fast and efficient run time capabilities. JSP pages separate

the

page logic from its design and display. JSP technology is part of the

Java

technology family.


11/23

JSP pages are not restricted to any specific platform or web server. The

JSP

specification represents a broad spectrum of industry input.

A servlet is a program written in the Java programming language that

runs

on the server, as opposed to the browser (applets pages are compiled

into

servlets, so theoretically you could write servlets to support your web-

based

applications. However, JSP technology was designed to simplify the

process

of creating pages by separating web presentation from web content. In

many

applications, the response sent to the client is a combination of

template data

and dynamically-generated data. In this situation, it is much easier to

work

with JSP pages than to do everything with servlets.

The JSP 2.1 specification is an important part of the Java EE 5 Platform.

There are a number of JSP technology implementations for different

web servers. JSP technology is the result of industry collaboration

and is designed to be an open, industry-standard method supporting

numerous

servers, browsers and tools. JSP technology speeds development

with reusable components and tags, instead of relying heavily on

scripting

within the page itself. All JSP implementations support a Java

programming

language-based scripting language, which provides inherent scalability

and support for complex operations.

A JSP page is a page created by the web developer that includes JSP


12/23

technology-specific and custom tags, in combination with other static

(HTML

or XML) tags. A JSP page has the extension .jsp or .jspx; this signals to

the

web server that the JSP engine will process elements on this page. Us

JSP

pages are typically compiled into Java platform servlet classes. As a

result,

JSP pages require a Java virtual machine that supports the Java

platform

servlet specification. Pages built using JSP technology are typically

implemented

using a translation phase that is performed once, the first time the

page is called. The page is compiled into a Java Servlet class and

remains in

server memory, so subsequent calls to the page have very fast

response times.

. JSP specification does support creation of XML documents. For simple

XML generation, the XML tags may be included as static template

portions

of the JSP page. The JSP 2.0 specification describes a mapping between

JSP

pages and XML documents.

2.2.5 Advantages of JSP


13/23

Scripting: The different server side languages like ASP have one

common

drawback they depend on somewhat weak programming languages for

processing.

But JSP uses the powerful and fully object oriented java language

for processing. Write once Run anywhere: JSP technology brings the

write

once run anywhere method to interactive web pages. JSP pages can be

easily

moved across platforms with out any changes.

2.2.6 JSP Architecture

The source code of a JSP page is essentially just HTML sprinkled here

and

there with either special JSP tags and or Java code enclosed in these

tags.

The files extension is .jsp rather than the usual .html or .htm, and it

tells the

server that this document requires special handling. The special

handling,

accomplished with a web server extension or plug in, involves four

steps.

1.The JSP engine parses the page and creates a Java source file. 2.It

then

compiles the file produced in step1 into java class file. The class file

created in

step2 is a servlet. 3. The servlet engine loads the servlet class forexecution.

4.The servlet executes and stream back the results back to the results

to the


14/23

requester. Step1 and step2 occur only once, when first deploy or

update the

JSP. The servlet engine performs step3 only upon the first request of

that

servlet since the last server restart. After that the class loader loads

the

class once and is available for the life of that JVM. Finally some

application

servers provide page caching, which can further improve the

performance and

reduce the cost of executing the request.

Chapter 3Modules

There are 3 modules for a search engine with web crawler. 1.

Administrator

Side 2. Search 3. Web service

3.1 Administrator Side

In this module,administrator downloads the web pages and save them

in a

file.Administrator also have the function to keeps track of the details

about

searching and can set the page details.


15/23

This has a login session, by typing the correct username and password

on

the corresponding field we can enter into administrator side. Username

and

password are stored in the database. Only the authorized people can

log on

to the administrator side.

This has two sub modules.

3.1.1 Page Settings

In this, administrator can set the font type and color of the content,

and

background color of the selected page. By using this administrator can

set

any color and font of the content from a selected list of colors and

fonts. The

background color can also selected from a list of colors.Administrator

write

the selected colors and font to the database and change the values

according

to the data read from the database.For this administrator first write a

particular

color and font to the database, then when a change in page settings

8

is occured, it will update in the database.

3.1.2 Log Settings

Administrator can keeps the details of searching. It keep the searched

word,

time and date of searching. Details are stored in the database. When

go


16/23

to the corresponding page we can see the log table containing the

details of

searching.

There is a logout session for the administrator side.by this we can

succeessfully

logout from the administrator side.

3.2 Search

When a query is given by the user, search engine will check for the

corresponding

index file.If it is not present, make an index file with that query as

filename.

Then check all the web pages for the given query and add the address

of

that URLs to that index file.The count of given query term present in

all

the pages are counted and it is recorded into a database.

The ranking is based on the preference of count of the query

term.Then, list

out the URLs from the database in the descending order.

3.3 Web Service

Web service includes the facilities for getting the instant news. The

news

searching is done by means of xml parsing. This is mainly fetched from

yahoo.

Chapter 4Working

4.1 Steps used in the implementation of Search


17/23

engine

The Steps involved in the implementation of search engine with web

crawler:

1. The necessary URLs are first downloaded in to the cache by theadministrator.

2. When the user submit the query, independent cache for individual

index

terms are created after checking whether it is present or not.

3. The web pages are searched for finding the index terms and list out

the URLs containing the corresponding index terms are recorded into

the

database.

4. The count of the given query term in each webpage is also recorded

into

the database.

5. Finally,the ranked URLs are listed out in the decreasing order from

the

database.

Chapter 5

System Design

5.1 Data Flow Daigram

A Data Flow Diagram (DFD) or a bubble chart is a graphical tool for

structured

analysis. It was De Macro in 1978 and Gane and Sarson in 1979 who

introduced DFD. DFD models a system transforms the data and

creates,

output-data-flows which go by suing external entities from which data

flows


18/23

to a process, which to other processes or external entities or files. Data

in

files many also flow to processes as inputs.

There are various symbols used in DFD. Bubbles represent the process.

Named arrows indicate the dataflow. External entities are represented

by

rectangles and are outside the system such as venders or customers

with

whom the system interacts. They either supply or consume data.

Entities

supplying data are known as sources and those that consume data are

called

sinks. Data are stored in a data store by a process in the system. Each

component in a DFD is labeled in with a descriptive name. Process

names

are further identified with a number.

DFDs can be heirarchially organized, which help in partitioning and

anslyzing large sytems. As a first step, one Data Flow Diagram can

depict

an entire system. Which gives the system overview. It is called Context

Diagram of level 0 DF. The context Diagram can be further expanded.

The

successive expansion of DFD from the context diagram tho those

giving more

details is known as leveling of DFD. Thus of top down approach is used,

starting with an overview and then working out the details.

The main merit of DFD is that it can provide an overview of what data a

system would process, what transformation of data are done, what files

are

used, and where the results flow.

The data flow diagram of Search Engine With Web Crawler has been


19/23

represented as a hierarchical DFD contest level DFD was drawn first.

Then

the processes were decomposed into several elementary levels and

were represented in the order of importance.


20/23


21/23

5.2 Data Base Design


22/23


23/23

Chapter 6Conclusion

Nowadays, there are many search engines like Google, Yahoo, Altavista

etc.

We are trying to develop a Search engine with some of the facilities

like text

search, news search etc of the current search engines.Still there arelimitations

in our search engine.

References

[1] Baeza Yaetes: Modern Information Retrieval

[2] http://www.searchenginewatch.com/

[3] http://www.webcrawler.com/

18363882 search engine with web crawler

Documents