ch19

Chapter 19

Web Crawler

Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 19-2

Chapter Objectives

• Provide a case study example from problem statement through implementation

• Demonstrate how hash tables and graphs can be used to solve a problem


Web Crawler

• A web crawler is a system that searches the web, beginning with a user-designated we page, looking for a designated target string

• A web crawler follows all of the links on each page that it encounter until there are no more pages or until it reaches a designated limit


Web Crawler

• For this case study, we will create a graphical web crawler with the following requirements– Enter a designated starting web page

– Enter a target string for which to search

– Limit the search to 50 pages

– Display the results when done


Web Crawler - Design

• Our web crawler system consists of three high-level components:– The driver

– The graphical user interface

– The web crawler implementation• Makes use of graphs and hashtables


Web Crawler - Design• The algorithm for the web crawler is as follows

– Add the starting page to a HashSet of pages to be searched and to our graph

– Remove a page from the set of pages to be searched

– Search the page for the target string• If string exists, add page to list of results

– Search the page for links• If links have not already been searched, add them to set of

pages to be searched and to our graph

– Repeat the three previous steps until our limit is reached or the set is empty


FIGURE 19.1 User interface design


FIGURE 19.2 UML description

ch19

Technology

web crawler implementation

graphical web crawler

web crawler system

designatedstarting web

starting page

target string

designated target

hashset of pages