18363882 search engine with web crawler

Upload: gavmeis

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 18363882 Search Engine With Web Crawler

    1/23

    Search Engine with WebCrawler

    Abstract

    Project : SEARCH ENGINE WITH WEB CRAWLER

    Front End : Core java, JSP.

    Back End : File system & My sql server

    Web server : Tomcat web server

    .

    This project is an attempt to implement a search engine with web

    crawler

    so as to demonstrate its contribution to the human for performing the

    searching

    in web in a faster way. A search engine is an information retrieval

    system

    designed to help find information stored on a computer system. The

    most

    public, visible form of a search engine is a Web search engine which

    searches

    for the information on theWorld WideWeb. Search engines provide an

    interface

    to a group of items that enables the users to specify the criteria about

    an item of interest and have the engine find the matching items. A web

    crawler(Web spider or Web robot)is a program or automated script

    which

    browses the World Wide Web in a methodical, automated manner. This

    process is called Web crawling or spidering. Search engines use

    spidering

    as a means of providing up-to-date data. Web crawler starts with a list

    of

  • 8/14/2019 18363882 Search Engine With Web Crawler

    2/23

    URLs to visit, called the seeds. As the crawler visits these URLs, it

    identifies

    all the hyperlinks in the page and adds them to the list of URLs to visit,

    called the crawl frontier. Web crawler was the Internets first search

    engine

    that performed keyword searches in both the names and texts of pages

    on the

    World WideWeb. Webcrawlers search engine performed two basic

    functions.

    First, it compiled an ongoing index of web addresses (URLs).

    Webcrawler

    retrieved and marked a document, analyzed the content of both its

    title and

    its full text, registered the relevant links it contained, and then stored

    the

    information in its database. When the user submitted a query in the

    form

    of one or more keywords, Webcrawler compared it with the information

    in

    its index and reported back any matches.Webcrawlers second function

    was

    searching the Internet in real time for the sites that matched a given

    query.

    It was carried out using exactly the same process, following links from

    one

    page to another.

    Contents

    1 Introduction

    1.1 The Motivation

    2 System Study

  • 8/14/2019 18363882 Search Engine With Web Crawler

    3/23

    2.1 Proposed System

    2.2 Technologies

    2.2.1 Java

    2.2.2 JDBC: (Java Database Connection)

    2.2.3 Overview of the JDBC Process

    2.2.4 Java Server Pages (JSP)

    2.2.5 Advantages of JSP

    2.2.6 JSP Architecture

    3 Modules

    3.1 Administrator Side

    3.1.1 Page Settings

    3.1.2 Log Settings .

    3.2 Search

    3.3 Web Service

    4 Working

    4.1 Steps used in the implementation of Search engine

    5 System Design

    5.1 Data Flow Daigram

    5.2 Data Base Design

    6 Conclusion

    References

    Chapter 1

    Introduction

    Most people find what theyre looking for on the World Wide Web byusing

    search engines like Yahoo!, Alta Vista, or Google. It is the search

    engines

    that finally bring your website to the notice of the prospective

    customers.

  • 8/14/2019 18363882 Search Engine With Web Crawler

    4/23

    Hence it is better to know how these search engines actually work and

    how

    they present information to the customer initiating a search. When you

    ask

    a search engine to locate the information, it is actually searching

    through

    the index which it has created and not actually searching through the

    Web.

    Different search engines produce different rankings because not every

    search

    engine uses the same algorithm to search through the indices. Many

    leading

    search engines use a form of software program called the spiders or

    crawlers

    to find information on the Internet and store it for search results in

    giant

    databases or indexes. Some spiders record every word on a Web site

    for their

    respective indexes, while others only report certain keywords listed in

    title

    tags or meta tags.Search Engines use spiders to index the websites.

    When

    you submit your website pages to a search engine by completing their

    required

    submission page, the search engine spider will index your entire site.

    A spider is an automated program that is run by the search engine

    system.

    Search engine indexing collects, parses, and stores the data to

    facilitate fast

    and accurate information retrieval. Spiders are unable to index pictures

    or

  • 8/14/2019 18363882 Search Engine With Web Crawler

    5/23

    read text that is contained within graphics, relying too heavily on such

    elements

    was a consideration for the online marketers. Webcrawler was the

    Internets first search engine that performed keyword searches in both

    the

    names and texts of pages on the World Wide Web. It won quick

    popularity

    and loyalty among surfers looking for information. During the Webs

    infancy,

    Webcrawler was born in January 1994. It was developed by Brian

    Pinker-

    ton, a computer student at the University of Washington, to cope with

    the

    complexity of theWeb. Pinkertons application, Webcrawler, could

    automatically

    scan the individual sites on the Web, register their content, and create

    an index that surfers could query with keywords to find Web sites

    relevant

    to their interests.

    1.1 The Motivation

    Primarily it is due to the interest in the area of Information retrieval.

    Nowadays,

    there are many search engines like Google, Yahoo, Altavista etc. We

    are trying to develop a Search engine with some of the facilities like

    text

    search, news search etc of the current search engines.

    Chapter 2

  • 8/14/2019 18363882 Search Engine With Web Crawler

    6/23

    System Study

    2.1 Proposed System

    In our proposed system,Search engine is implemented using web

    crawler. In

    our search engine user can search for text queries. When a query is

    submitted

    this will search in the downloaded web pages and the ranked URLs are

    listed

    to the user. The ranking is based on the number of searched words

    presentin each web page. The user can also have the option for news fetching

    using

    yahoo API.

    2.2 Technologies

    Selection of programming language depends on the system we needs.

    Since

    the application is web based system JAVA and its technologies are mostsuitable. In the development of this application, JSP is used for the

    design

    of web pages for both user and administrator.

    2.2.1 Java

    Java was introduced by Sun Microsystems in 1995 and instantly

    created

    a new sense of the interactive possibilities of the web. Originally it wascalled Oak. It was mainly developed for the development of software

    for

    consumer electronic devices. Both of the major web browsers include a

    Java

  • 8/14/2019 18363882 Search Engine With Web Crawler

    7/23

    Virtual Machine(JVM).Almost all major operating system developers

    (IBM,

    Microsoft and others) have added Java compiler as part of their product

    offerings.

    It is a platform independent language. It is the first programming

    language that is not guide to any particular hardware or operating

    systems.

    Programs developed in Java can be executed anywhere on any system.

    The

    internet helped to propel Java into the forefront of programming, and

    Java,

    in turn, has had a profound effect on the internet.Java is a true Object

    oriented

    Language. It is a programming language expressly designed for use in

    the distributed environment of the internet.

    The object model in Java is simple and easy to extend. It can also be

    used to build small application modules or applets for use as part of a

    web

    page. Applets make it possible for a web page users interact with the

    page.

    Java could be easily incorporated into the web system.The programs

    you

    create are portable in a network. The output is Byte code. Byte code is

    a highly optimized set of instruction designed to be executed by the

    Java

    runtime system. It is a code understood by any processor. Translating a

    Java program into Byte code helps to make it easier to run a program

    in a

    wide variety of environments.

    The major features of Java are:

    Mainly Java is Platform-independent and portable.java programs can be

  • 8/14/2019 18363882 Search Engine With Web Crawler

    8/23

    moved from one computer system to another, anywhere and anytime.

    Changes

    and upgrades in operating systems, processors and system resources

    will not

    force any changes in Java programs.

    Secondly Java is a true Object oriented Language. Almost everything in

    Java

    is an object. All program code and data reside within objects and

    classes.

    It provides many safe guards to ensure reliable code. Java makes

    memory

    management much easier. It has strict compile time and runtime

    checking

    for data types. The object model in Java and easy to extend.

    Java is designed as a distributed language for creating applications on

    networks

    as it handles. It has the ability to share both data and programs. Java

    applications can open and access remote objects on internet as easily

    as they

    can do in local system. It is a small and simple language. It was

    designed to

    be easy for the professional programmer to learn and use effectively.

    Java environment includes a large number of development tools and

    hundreds

    of classes and methods are part of the Java Standard Library (JSL),

    also known as the Application Programming Interface (API).The

    development

    tools, part of the Java are used as the front end for designing the GUI

    for the end users. It is a general purpose programming language which

    sup-

  • 8/14/2019 18363882 Search Engine With Web Crawler

    9/23

    ports multi-threaded programs. This means that we need not wait for

    an

    application to finish one task before beginning another.

    2.2.2 JDBC: (Java Database Connection)

    Practically every J2EE application saves, retrieves and manipulates

    information

    stored in a database-using web services provided by a J2EE

    component.

    A J2EE component supplies database access using Java data objects

    contained

    in the JDBC application programming interface (API).

    Sun Microsystems, inc. met the challenge in 1996 with the creation of

    the JDBC driver and JDBCAPI. The JDBC driver developed by the Sun

    Microsystems. Inc. wasnt a driver at all. It was a specification that

    described

    the detail functionality of a JDBC driver. The specification required

    a JDBC driver to be a translator that converted low-level proprietary

    DBMS

    messages to low-level messages understood by the JDBC API and user .

    This

    meant java programmers could use high-level java data objects

    defined in the

    JDBC API to write a routine that interacted with the DBMS. JDBC driver

    created by DBMS manufacturers have to

    Open a connection between the DBMS and the J2EE component.

    Translates low-level equivalents of SQL statements sent by the J2EEcomponent

    into messages that can be processed by the DBMS.

    Returns data that conforms the JDBC specification to the JDBC driver.

  • 8/14/2019 18363882 Search Engine With Web Crawler

    10/23

    Return information such as error messages that conforms to the JDBC

    specification

    to the JDBC driver.

    Provide Transaction Management routines that conforms to the JDBC

    specification.

    Close the connection between the DBMS and the J2EE component.

    2.2.3 Overview of the JDBC Process

    This process is divided into 5 routines. These include.

    Perform connection and authentication to a database server

    Manage transactions

    Move SQL statement to a database engine for preprocessing and

    execution

    Execute stored procedures

    Inspect and modify the result from SELECT statements

    2.2.4 Java Server Pages (JSP)

    JSP is technology based on the JAVA language and enables the

    development

    of dynamic websites. JSP was developed by Sun Microsystems to allow

    server side development. Based on the Java programming language

    JSP offers

    proven portability, open standards. A JSP document can share date

    among users, access databases and do all the things that require

    server intervention.

    A JSP documents get compiled into Java byte code, a binary

    format with fast and efficient run time capabilities. JSP pages separate

    the

    page logic from its design and display. JSP technology is part of the

    Java

    technology family.

  • 8/14/2019 18363882 Search Engine With Web Crawler

    11/23

    JSP pages are not restricted to any specific platform or web server. The

    JSP

    specification represents a broad spectrum of industry input.

    A servlet is a program written in the Java programming language that

    runs

    on the server, as opposed to the browser (applets pages are compiled

    into

    servlets, so theoretically you could write servlets to support your web-

    based

    applications. However, JSP technology was designed to simplify the

    process

    of creating pages by separating web presentation from web content. In

    many

    applications, the response sent to the client is a combination of

    template data

    and dynamically-generated data. In this situation, it is much easier to

    work

    with JSP pages than to do everything with servlets.

    The JSP 2.1 specification is an important part of the Java EE 5 Platform.

    There are a number of JSP technology implementations for different

    web servers. JSP technology is the result of industry collaboration

    and is designed to be an open, industry-standard method supporting

    numerous

    servers, browsers and tools. JSP technology speeds development

    with reusable components and tags, instead of relying heavily on

    scripting

    within the page itself. All JSP implementations support a Java

    programming

    language-based scripting language, which provides inherent scalability

    and support for complex operations.

    A JSP page is a page created by the web developer that includes JSP

  • 8/14/2019 18363882 Search Engine With Web Crawler

    12/23

    technology-specific and custom tags, in combination with other static

    (HTML

    or XML) tags. A JSP page has the extension .jsp or .jspx; this signals to

    the

    web server that the JSP engine will process elements on this page. Us

    JSP

    pages are typically compiled into Java platform servlet classes. As a

    result,

    JSP pages require a Java virtual machine that supports the Java

    platform

    servlet specification. Pages built using JSP technology are typically

    implemented

    using a translation phase that is performed once, the first time the

    page is called. The page is compiled into a Java Servlet class and

    remains in

    server memory, so subsequent calls to the page have very fast

    response times.

    . JSP specification does support creation of XML documents. For simple

    XML generation, the XML tags may be included as static template

    portions

    of the JSP page. The JSP 2.0 specification describes a mapping between

    JSP

    pages and XML documents.

    2.2.5 Advantages of JSP

  • 8/14/2019 18363882 Search Engine With Web Crawler

    13/23

    Scripting: The different server side languages like ASP have one

    common

    drawback they depend on somewhat weak programming languages for

    processing.

    But JSP uses the powerful and fully object oriented java language

    for processing. Write once Run anywhere: JSP technology brings the

    write

    once run anywhere method to interactive web pages. JSP pages can be

    easily

    moved across platforms with out any changes.

    2.2.6 JSP Architecture

    The source code of a JSP page is essentially just HTML sprinkled here

    and

    there with either special JSP tags and or Java code enclosed in these

    tags.

    The files extension is .jsp rather than the usual .html or .htm, and it

    tells the

    server that this document requires special handling. The special

    handling,

    accomplished with a web server extension or plug in, involves four

    steps.

    1.The JSP engine parses the page and creates a Java source file. 2.It

    then

    compiles the file produced in step1 into java class file. The class file

    created in

    step2 is a servlet. 3. The servlet engine loads the servlet class forexecution.

    4.The servlet executes and stream back the results back to the results

    to the

  • 8/14/2019 18363882 Search Engine With Web Crawler

    14/23

    requester. Step1 and step2 occur only once, when first deploy or

    update the

    JSP. The servlet engine performs step3 only upon the first request of

    that

    servlet since the last server restart. After that the class loader loads

    the

    class once and is available for the life of that JVM. Finally some

    application

    servers provide page caching, which can further improve the

    performance and

    reduce the cost of executing the request.

    Chapter 3Modules

    There are 3 modules for a search engine with web crawler. 1.

    Administrator

    Side 2. Search 3. Web service

    3.1 Administrator Side

    In this module,administrator downloads the web pages and save them

    in a

    file.Administrator also have the function to keeps track of the details

    about

    searching and can set the page details.

  • 8/14/2019 18363882 Search Engine With Web Crawler

    15/23

    This has a login session, by typing the correct username and password

    on

    the corresponding field we can enter into administrator side. Username

    and

    password are stored in the database. Only the authorized people can

    log on

    to the administrator side.

    This has two sub modules.

    3.1.1 Page Settings

    In this, administrator can set the font type and color of the content,

    and

    background color of the selected page. By using this administrator can

    set

    any color and font of the content from a selected list of colors and

    fonts. The

    background color can also selected from a list of colors.Administrator

    write

    the selected colors and font to the database and change the values

    according

    to the data read from the database.For this administrator first write a

    particular

    color and font to the database, then when a change in page settings

    8

    is occured, it will update in the database.

    3.1.2 Log Settings

    Administrator can keeps the details of searching. It keep the searched

    word,

    time and date of searching. Details are stored in the database. When

    go

  • 8/14/2019 18363882 Search Engine With Web Crawler

    16/23

    to the corresponding page we can see the log table containing the

    details of

    searching.

    There is a logout session for the administrator side.by this we can

    succeessfully

    logout from the administrator side.

    3.2 Search

    When a query is given by the user, search engine will check for the

    corresponding

    index file.If it is not present, make an index file with that query as

    filename.

    Then check all the web pages for the given query and add the address

    of

    that URLs to that index file.The count of given query term present in

    all

    the pages are counted and it is recorded into a database.

    The ranking is based on the preference of count of the query

    term.Then, list

    out the URLs from the database in the descending order.

    3.3 Web Service

    Web service includes the facilities for getting the instant news. The

    news

    searching is done by means of xml parsing. This is mainly fetched from

    yahoo.

    Chapter 4Working

    4.1 Steps used in the implementation of Search

  • 8/14/2019 18363882 Search Engine With Web Crawler

    17/23

    engine

    The Steps involved in the implementation of search engine with web

    crawler:

    1. The necessary URLs are first downloaded in to the cache by theadministrator.

    2. When the user submit the query, independent cache for individual

    index

    terms are created after checking whether it is present or not.

    3. The web pages are searched for finding the index terms and list out

    the URLs containing the corresponding index terms are recorded into

    the

    database.

    4. The count of the given query term in each webpage is also recorded

    into

    the database.

    5. Finally,the ranked URLs are listed out in the decreasing order from

    the

    database.

    Chapter 5

    System Design

    5.1 Data Flow Daigram

    A Data Flow Diagram (DFD) or a bubble chart is a graphical tool for

    structured

    analysis. It was De Macro in 1978 and Gane and Sarson in 1979 who

    introduced DFD. DFD models a system transforms the data and

    creates,

    output-data-flows which go by suing external entities from which data

    flows

  • 8/14/2019 18363882 Search Engine With Web Crawler

    18/23

    to a process, which to other processes or external entities or files. Data

    in

    files many also flow to processes as inputs.

    There are various symbols used in DFD. Bubbles represent the process.

    Named arrows indicate the dataflow. External entities are represented

    by

    rectangles and are outside the system such as venders or customers

    with

    whom the system interacts. They either supply or consume data.

    Entities

    supplying data are known as sources and those that consume data are

    called

    sinks. Data are stored in a data store by a process in the system. Each

    component in a DFD is labeled in with a descriptive name. Process

    names

    are further identified with a number.

    DFDs can be heirarchially organized, which help in partitioning and

    anslyzing large sytems. As a first step, one Data Flow Diagram can

    depict

    an entire system. Which gives the system overview. It is called Context

    Diagram of level 0 DF. The context Diagram can be further expanded.

    The

    successive expansion of DFD from the context diagram tho those

    giving more

    details is known as leveling of DFD. Thus of top down approach is used,

    starting with an overview and then working out the details.

    The main merit of DFD is that it can provide an overview of what data a

    system would process, what transformation of data are done, what files

    are

    used, and where the results flow.

    The data flow diagram of Search Engine With Web Crawler has been

  • 8/14/2019 18363882 Search Engine With Web Crawler

    19/23

    represented as a hierarchical DFD contest level DFD was drawn first.

    Then

    the processes were decomposed into several elementary levels and

    were represented in the order of importance.

  • 8/14/2019 18363882 Search Engine With Web Crawler

    20/23

  • 8/14/2019 18363882 Search Engine With Web Crawler

    21/23

    5.2 Data Base Design

  • 8/14/2019 18363882 Search Engine With Web Crawler

    22/23

  • 8/14/2019 18363882 Search Engine With Web Crawler

    23/23

    Chapter 6Conclusion

    Nowadays, there are many search engines like Google, Yahoo, Altavista

    etc.

    We are trying to develop a Search engine with some of the facilities

    like text

    search, news search etc of the current search engines.Still there arelimitations

    in our search engine.

    References

    [1] Baeza Yaetes: Modern Information Retrieval

    [2] http://www.searchenginewatch.com/

    [3] http://www.webcrawler.com/