search engine with web crawler

28
Search Engine with WebCrawler Abstract Project : SEARCH ENGINE WITH WEB CRAWLER Front End : Core java, JSP. Back End : File system & My sql server Web server : Tomcat web server . This project is an attempt to implement a search engine with web crawler so as to demonstrate its contribution to the human for performing the searching in web in a faster way. A search engine is an information retrieval system designed to help find information stored on a computer system. The most public, visible form of a search engine is a Web search engine which searches for the information on theWorld WideWeb. Search engines provide an interface to a group of items that enables the users to specify the criteria about an item of interest and have the engine find the matching items. A web

Upload: nithin

Post on 16-Nov-2014

3.542 views

Category:

Documents


3 download

TRANSCRIPT

Search Engine with WebCrawlerAbstractProject : SEARCH ENGINE WITH WEB CRAWLER Front End : Core java, JSP. Back End : File system & My sql server Web server : Tomcat web server . This project is an attempt to implement a search engine with web crawler so as to demonstrate its contribution to the human for performing the searching in web in a faster way. A search engine is an information retrieval system designed to help find information stored on a computer system. The most public, visible form of a search engine is a Web search engine which searches for the information on theWorld WideWeb. Search engines provide an interface to a group of items that enables the users to specify the criteria about an item of interest and have the engine find the matching items. A web crawler(Web spider or Web robot)is a program or automated script which browses the World Wide Web in a methodical, automated manner. This process is called Web crawling or spidering. Search engines use spidering as a means of providing up-to-date data. Web crawler starts with a list of

URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. Web crawler was the Internets first search engine that performed keyword searches in both the names and texts of pages on the World WideWeb. Webcrawlers search engine performed two basic functions. First, it compiled an ongoing index of web addresses (URLs). Webcrawler retrieved and marked a document, analyzed the content of both its title and its full text, registered the relevant links it contained, and then stored the information in its database. When the user submitted a query in the form of one or more keywords, Webcrawler compared it with the information in its index and reported back any matches.Webcrawlers second function was searching the Internet in real time for the sites that matched a given query. It was carried out using exactly the same process, following links from one page to another.

Contents1 Introduction 1.1 The Motivation 2 System Study

2.1 Proposed System 2.2 Technologies 2.2.1 Java 2.2.2 JDBC: (Java Database Connection) 2.2.3 Overview of the JDBC Process 2.2.4 Java Server Pages (JSP) 2.2.5 Advantages of JSP 2.2.6 JSP Architecture 3 Modules 3.1 Administrator Side 3.1.1 Page Settings 3.1.2 Log Settings . 3.2 Search 3.3 Web Service 4 Working 4.1 Steps used in the implementation of Search engine 5 System Design 5.1 Data Flow Daigram 5.2 Data Base Design 6 Conclusion References

Chapter 1 IntroductionMost people find what theyre looking for on the World Wide Web by using search engines like Yahoo!, Alta Vista, or Google. It is the search engines that finally bring your website to the notice of the prospective customers.

Hence it is better to know how these search engines actually work and how they present information to the customer initiating a search. When you ask a search engine to locate the information, it is actually searching through the index which it has created and not actually searching through the Web. Different search engines produce different rankings because not every search engine uses the same algorithm to search through the indices. Many leading search engines use a form of software program called the spiders or crawlers to find information on the Internet and store it for search results in giant databases or indexes. Some spiders record every word on a Web site for their respective indexes, while others only report certain keywords listed in title tags or meta tags.Search Engines use spiders to index the websites. When you submit your website pages to a search engine by completing their required submission page, the search engine spider will index your entire site. A spider is an automated program that is run by the search engine system. Search engine indexing collects, parses, and stores the data to facilitate fast and accurate information retrieval. Spiders are unable to index pictures or

read text that is contained within graphics, relying too heavily on such elements was a consideration for the online marketers. Webcrawler was the Internets first search engine that performed keyword searches in both the names and texts of pages on the World Wide Web. It won quick popularity and loyalty among surfers looking for information. During the Webs infancy, Webcrawler was born in January 1994. It was developed by Brian Pinkerton, a computer student at the University of Washington, to cope with the complexity of theWeb. Pinkertons application, Webcrawler, could automatically scan the individual sites on the Web, register their content, and create an index that surfers could query with keywords to find Web sites relevant to their interests.

1.1 The MotivationPrimarily it is due to the interest in the area of Information retrieval. Nowadays, there are many search engines like Google, Yahoo, Altavista etc. We are trying to develop a Search engine with some of the facilities like text search, news search etc of the current search engines.

Chapter 2

System Study2.1 Proposed SystemIn our proposed system,Search engine is implemented using web crawler. In our search engine user can search for text queries. When a query is submitted this will search in the downloaded web pages and the ranked URLs are listed to the user. The ranking is based on the number of searched words present in each web page. The user can also have the option for news fetching using yahoo API.

2.2 TechnologiesSelection of programming language depends on the system we needs. Since the application is web based system JAVA and its technologies are most suitable. In the development of this application, JSP is used for the design of web pages for both user and administrator.

2.2.1 JavaJava was introduced by Sun Microsystems in 1995 and instantly created a new sense of the interactive possibilities of the web. Originally it was called Oak. It was mainly developed for the development of software for consumer electronic devices. Both of the major web browsers include a Java

Virtual Machine(JVM).Almost all major operating system developers (IBM, Microsoft and others) have added Java compiler as part of their product offerings. It is a platform independent language. It is the first programming language that is not guide to any particular hardware or operating systems. Programs developed in Java can be executed anywhere on any system. The internet helped to propel Java into the forefront of programming, and Java, in turn, has had a profound effect on the internet.Java is a true Object oriented Language. It is a programming language expressly designed for use in the distributed environment of the internet. The object model in Java is simple and easy to extend. It can also be used to build small application modules or applets for use as part of a web page. Applets make it possible for a web page users interact with the page. Java could be easily incorporated into the web system.The programs you create are portable in a network. The output is Byte code. Byte code is a highly optimized set of instruction designed to be executed by the Java runtime system. It is a code understood by any processor. Translating a Java program into Byte code helps to make it easier to run a program in a wide variety of environments. The major features of Java are: Mainly Java is Platform-independent and portable.java programs can be

moved from one computer system to another, anywhere and anytime. Changes and upgrades in operating systems, processors and system resources will not force any changes in Java programs. Secondly Java is a true Object oriented Language. Almost everything in Java is an object. All program code and data reside within objects and classes. It provides many safe guards to ensure reliable code. Java makes memory management much easier. It has strict compile time and runtime checking for data types. The object model in Java and easy to extend. Java is designed as a distributed language for creating applications on networks as it handles. It has the ability to share both data and programs. Java applications can open and access remote objects on internet as easily as they can do in local system. It is a small and simple language. It was designed to be easy for the professional programmer to learn and use effectively. Java environment includes a large number of development tools and hundreds of classes and methods are part of the Java Standard Library (JSL), also known as the Application Programming Interface (API).The development tools, part of the Java are used as the front end for designing the GUI for the end users. It is a general purpose programming language which sup-

ports multi-threaded programs. This means that we need not wait for an application to finish one task before beginning another.

2.2.2 JDBC: (Java Database Connection)Practically every J2EE application saves, retrieves and manipulates information stored in a database-using web services provided by a J2EE component. A J2EE component supplies database access using Java data objects contained in the JDBC application programming interface (API). Sun Microsystems, inc. met the challenge in 1996 with the creation of the JDBC driver and JDBCAPI. The JDBC driver developed by the Sun Microsystems. Inc. wasnt a driver at all. It was a specification that described the detail functionality of a JDBC driver. The specification required a JDBC driver to be a translator that converted low-level proprietary DBMS messages to low-level messages understood by the JDBC API and user . This meant java programmers could use high-level java data objects defined in the JDBC API to write a routine that interacted with the DBMS. JDBC driver created by DBMS manufacturers have to Open a connection between the DBMS and the J2EE component. Translates low-level equivalents of SQL statements sent by the J2EE component into messages that can be processed by the DBMS. Returns data that conforms the JDBC specification to the JDBC driver.

Return information such as error messages that conforms to the JDBC specification to the JDBC driver. Provide Transaction Management routines that conforms to the JDBC specification. Close the connection between the DBMS and the J2EE component.

2.2.3 Overview of the JDBC ProcessThis process is divided into 5 routines. These include. Perform connection and authentication to a database server Manage transactions Move SQL statement to a database engine for preprocessing and execution Execute stored procedures Inspect and modify the result from SELECT statements

2.2.4 Java Server Pages (JSP)JSP is technology based on the JAVA language and enables the development of dynamic websites. JSP was developed by Sun Microsystems to allow server side development. Based on the Java programming language JSP offers proven portability, open standards. A JSP document can share date among users, access databases and do all the things that require server intervention. A JSP documents get compiled into Java byte code, a binary format with fast and efficient run time capabilities. JSP pages separate the page logic from its design and display. JSP technology is part of the Java technology family.

JSP pages are not restricted to any specific platform or web server. The JSP specification represents a broad spectrum of industry input. A servlet is a program written in the Java programming language that runs on the server, as opposed to the browser (applets pages are compiled into servlets, so theoretically you could write servlets to support your webbased applications. However, JSP technology was designed to simplify the process of creating pages by separating web presentation from web content. In many applications, the response sent to the client is a combination of template data and dynamically-generated data. In this situation, it is much easier to work with JSP pages than to do everything with servlets. The JSP 2.1 specification is an important part of the Java EE 5 Platform. There are a number of JSP technology implementations for different web servers. JSP technology is the result of industry collaboration and is designed to be an open, industry-standard method supporting numerous servers, browsers and tools. JSP technology speeds development with reusable components and tags, instead of relying heavily on scripting within the page itself. All JSP implementations support a Java programming language-based scripting language, which provides inherent scalability and support for complex operations. A JSP page is a page created by the web developer that includes JSP

technology-specific and custom tags, in combination with other static (HTML or XML) tags. A JSP page has the extension .jsp or .jspx; this signals to the web server that the JSP engine will process elements on this page. Us JSP pages are typically compiled into Java platform servlet classes. As a result, JSP pages require a Java virtual machine that supports the Java platform servlet specification. Pages built using JSP technology are typically implemented using a translation phase that is performed once, the first time the page is called. The page is compiled into a Java Servlet class and remains in server memory, so subsequent calls to the page have very fast response times. . JSP specification does support creation of XML documents. For simple XML generation, the XML tags may be included as static template portions of the JSP page. The JSP 2.0 specification describes a mapping between JSP pages and XML documents.

2.2.5 Advantages of JSP

Scripting: The different server side languages like ASP have one common drawback they depend on somewhat weak programming languages for processing. But JSP uses the powerful and fully object oriented java language for processing. Write once Run anywhere: JSP technology brings the write once run anywhere method to interactive web pages. JSP pages can be easily moved across platforms with out any changes.

2.2.6 JSP ArchitectureThe source code of a JSP page is essentially just HTML sprinkled here and there with either special JSP tags and or Java code enclosed in these tags. The files extension is .jsp rather than the usual .html or .htm, and it tells the server that this document requires special handling. The special handling, accomplished with a web server extension or plug in, involves four steps. 1.The JSP engine parses the page and creates a Java source file. 2.It then compiles the file produced in step1 into java class file. The class file created in step2 is a servlet. 3. The servlet engine loads the servlet class for execution. 4.The servlet executes and stream back the results back to the results to the

requester. Step1 and step2 occur only once, when first deploy or update the JSP. The servlet engine performs step3 only upon the first request of that servlet since the last server restart. After that the class loader loads the class once and is available for the life of that JVM. Finally some application servers provide page caching, which can further improve the performance and reduce the cost of executing the request.

Chapter 3 ModulesThere are 3 modules for a search engine with web crawler. 1. Administrator Side 2. Search 3. Web service

3.1 Administrator SideIn this module,administrator downloads the web pages and save them in a file.Administrator also have the function to keeps track of the details about searching and can set the page details.

This has a login session, by typing the correct username and password on the corresponding field we can enter into administrator side. Username and password are stored in the database. Only the authorized people can log on to the administrator side. This has two sub modules.

3.1.1 Page SettingsIn this, administrator can set the font type and color of the content, and background color of the selected page. By using this administrator can set any color and font of the content from a selected list of colors and fonts. The background color can also selected from a list of colors.Administrator write the selected colors and font to the database and change the values according to the data read from the database.For this administrator first write a particular color and font to the database, then when a change in page settings 8 is occured, it will update in the database.

3.1.2 Log SettingsAdministrator can keeps the details of searching. It keep the searched word, time and date of searching. Details are stored in the database. When go

to the corresponding page we can see the log table containing the details of searching. There is a logout session for the administrator side.by this we can succeessfully logout from the administrator side.

3.2 SearchWhen a query is given by the user, search engine will check for the corresponding index file.If it is not present, make an index file with that query as filename. Then check all the web pages for the given query and add the address of that URLs to that index file.The count of given query term present in all the pages are counted and it is recorded into a database. The ranking is based on the preference of count of the query term.Then, list out the URLs from the database in the descending order.

3.3 Web ServiceWeb service includes the facilities for getting the instant news. The news searching is done by means of xml parsing. This is mainly fetched from yahoo.

Chapter 4 Working4.1 Steps used in the implementation of Search

engineThe Steps involved in the implementation of search engine with web crawler: 1. The necessary URLs are first downloaded in to the cache by the administrator. 2. When the user submit the query, independent cache for individual index terms are created after checking whether it is present or not. 3. The web pages are searched for finding the index terms and list out the URLs containing the corresponding index terms are recorded into the database. 4. The count of the given query term in each webpage is also recorded into the database. 5. Finally,the ranked URLs are listed out in the decreasing order from the database.

Chapter 5 System Design5.1 Data Flow DaigramA Data Flow Diagram (DFD) or a bubble chart is a graphical tool for structured analysis. It was De Macro in 1978 and Gane and Sarson in 1979 who introduced DFD. DFD models a system transforms the data and creates, output-data-flows which go by suing external entities from which data flows

to a process, which to other processes or external entities or files. Data in files many also flow to processes as inputs. There are various symbols used in DFD. Bubbles represent the process. Named arrows indicate the dataflow. External entities are represented by rectangles and are outside the system such as venders or customers with whom the system interacts. They either supply or consume data. Entities supplying data are known as sources and those that consume data are called sinks. Data are stored in a data store by a process in the system. Each component in a DFD is labeled in with a descriptive name. Process names are further identified with a number. DFDs can be heirarchially organized, which help in partitioning and anslyzing large sytems. As a first step, one Data Flow Diagram can depict an entire system. Which gives the system overview. It is called Context Diagram of level 0 DF. The context Diagram can be further expanded. The successive expansion of DFD from the context diagram tho those giving more details is known as leveling of DFD. Thus of top down approach is used, starting with an overview and then working out the details. The main merit of DFD is that it can provide an overview of what data a system would process, what transformation of data are done, what files are used, and where the results flow. The data flow diagram of Search Engine With Web Crawler has been

represented as a hierarchical DFD contest level DFD was drawn first. Then the processes were decomposed into several elementary levels and were represented in the order of importance.

5.2 Data Base Design

Chapter 6ConclusionNowadays, there are many search engines like Google, Yahoo, Altavista etc. We are trying to develop a Search engine with some of the facilities like text search, news search etc of the current search engines.Still there are limitations in our search engine.

References[1] Baeza Yaetes: Modern Information Retrieval [2] http://www.searchenginewatch.com/ [3] http://www.webcrawler.com/