a world wide web search engine using hyperlink hong fan

70
- 1 - A World Wide Web Search Engine Using Hyperlink Structure To Improve Web Searches Hong Fan Certificate of Approval: ______________________ ________________________ Gerry V. Dozier Wenchen Hu, Cha Assistant Professor Assistant Professor Computer Science and Computer Science and Software Engineering Software Engineering _______________________ ________________________ David A. Umphress John F. Pritchett Associate Professor Dean, Graduate School Computer Science and Software Engineering

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 1 -

A World Wide Web Search Engine Using Hyperlink Structure To Improve Web Searches

Hong Fan

Certificate of Approval: ______________________ ________________________ Gerry V. Dozier Wenchen Hu, Cha Assistant Professor Assistant Professor Computer Science and Computer Science and Software Engineering Software Engineering _______________________ ________________________ David A. Umphress John F. Pritchett Associate Professor Dean, Graduate School Computer Science and Software Engineering

Page 2: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 2 -

A World Wide Web Search Engine Using Hyperlink Structure To Improve Web Searches

Hong Fan

A Project Report

Submitted to

the Graduate Faculty of

Auburn University

in Partial Fulfillment of the

Requirements for the

Degree of

Master of Software Engineering

Auburn, Alabama

December 16, 2000

Page 3: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 3 -

PROJECT REPORT ABSTRACT

A World Wide Web Search Engine Using Link Structure To Improve Web Searches

Hong Fan

Master of Software Engineering, December 16, 2000 (B.S., China Textile University, 1990)

71 Typed Pages

Directed by Wenchen Hu

The World Wide Web is growing rapidly, and as an important new medium for

communication, it provides a tremendous amount of information related to a wide range

of topics, and creating new challenges for information retrieval. A search engine provides

users with an efficient mean to search for valuable information on the Web. This project

is aimed at improving the performance of text-based search engines by applying a

ranking algorithm, which based on the hyperlink structure. The new search engine works

on top of the current text-based search engines. It is composed of a spider software

component, a page-ranking kernel, and a local database system. To evaluate the

performance of the new Web search engine, its results were compared to the results

obtained for the same queries from the search engines Alta Vista and Excite. The

experiments showed that the prototype performed significantly better than the purely

text-based search engines.

Page 4: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 4 -

TABLE OF CONTENTS

LIST OF FIGURES……………………………………………………………………vii LIST OF TABLES…………………………………………………………………….viii 1 INTRODUCTION…………………………………………………………………… 1

1.1 The significance of Web and Web search engines………………………….1 1.2 Disadvantages of the current search engines………………………………. 2 1.3 A proposed system for improving Web searches …………………………...3

2 LITERATURE REVIEW…………………………………………………………...5

2.1 Mining the Web hyperlink structure………………………………………...5 2.1.1 Web structures…………………………………………………..6 2.1.2 A novel method using hyperlink information …………………..6 2.1.3 A prototype using the pagerank algorithm……………………..10

2.2 Web page ranking algorithm……………………………………………….11

3 SYSTEM STRUCTURE………………………………………………………….. 14 3.1 System interface…………………………………………………………….14

3.1.1 Interface of the ranking system ……………………………….. 14 3.1.2 Web interface for listing rank results………………………….. 18

3.2 The spider …………………………………………………………………. 18 3.3 Database…………………………………………………………………….21

4 A WEB PAGE RANKING ALGORITHM USING HYPERLINK INFORMATION…………………………………………………………………... 22

4.1 The algorithm ……………………………………………………………...22 4.1.1 Search and growth………………………………………………22 4.1.2 Weight and propagation………………………………………...24 4.1.3 The mathematical foundation of iteration ……………………...25

4.2 Building the ranking system ……………………………………………….27 4.2.1 Constructing the root set ………………………………………..27 4.2.2 Extending the root set into the base set………………………… 29 4.2.3 Analyzing hyperlinks……………………………………………30 4.2.4 Calculate the rank score…………………………………………31

4.3 Examples of convergence ………………………………………………….34 5 RESULTS AND DISCUSSION…………………………………………………….37 5.1 Quality comparison …………………………………………………………37

5.1.1 Comparison to Alta Vista ……………………………………….38

Page 5: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 5 -

5.1.2 Comparison to Excite …………………………………………....43 5.2 Analysis of the results………………………………………………………..47

5.2.1 Removal of poor quality Web pages……………………………….47 5.2.2 Removal of non-relevant Web pages………………………………50 5.2.3 Relationship between hubs and authorities ………………………..50

6 CONCLUSIONS AND FUTURE WORK …………………………………………...59 REFERENCES…………………………………………………………………………..61

Page 6: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 6 -

LIST OF FIGURES Figure 2.1 Directions of hypertext links [2]……………………………………………...7

Figure 2.2 Multiple links starting from the same page [3]……………………………….8

Figure 2.3 A single link of arbitrary depth [3]…..………………………………………..8

Figure 2.4 A densely linked set of hubs and authorities[7].…………………………….13

Figure 3.1 The system structure…………………………………………………………15

Figure 3.2 The execution path for retrieving database information……………………..16

Figure 3.3 User interface of the ranking system………………………………………...17

Figure 3.4 The web interface for listing rank results……………………………………19

Figure 4.1 Expanding the root set into a base set………………………………………..23

Figure 4.2 The basic operations for calculating authority and hub weights……………..26

Figure 5.1 Quality comparison………………………………………………………….42

Figure 5.2 Good and excellent comparison……………………………………………..42

Figure 5.3 Quality comparison………………………………………………………….46

Figure 5.4 Good and excellent comparison……………………………………………..46

Figure 5.5 Web page contents of www.iteachnet.com/wwwboard/wwwboard.html …...48

Figure 5.6 Web page contents of http://w3.one.net/~ballet/………………………………..52

Figure 5.7 Web page contents of http://www.dancer.com/dance-links/ballet.htm……...55

Figure 5.8 The mutually reinforcing relationship between hubs and authorities……….58

Page 7: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 7 -

LIST OF TABLES

Table 4.1. (a) Authority weights of returned URLs from Excite………………………...35

Table 4.1. (b) Hub weights of returned URLs from Excite……………………………...35

Table 4.2. (a) Authority weights of returned URLs from Alta Vista…………………….36

Table 4.2. (b) Hub weights of returned URLs from Alta Vista………………………….36

Table 5.1 The top 10 URLs given by Alta Vista with query string “Computer job”……39

Table 5.2 The top 10 URLs from the ranking system with

query string “Computer job”…………………………………………………..39

Table 5.3 The top 10 URLs given by Alta Vista with query string “Ballet”…………….40

Table 5.4 The top 10 URLs from the ranking system with query string “Ballet”…….…40

Table 5.5 The top 10 URLs given by Alta Vista with query string “Camera”………….41

Table 5.6 The top 10 URLs from the ranking system with query string “Camera”……..41

Table 5.7 The top 10 URLs given by Excite with query string “Internet”………………44

Table 5.8 The top 10 URLs from the ranking system with query string “Internet”….…44

Table 5.9 The top 10 URLs given by Excite with query string “Java”………………….45

Table 5.10 The top 10 URLs from the ranking system with query string “Java”……….45

Table 5.11 The top 10 URLs searched by Alta Vista and their

authority weights for the query string “Computer job”……………..……….51

Table 5.12 The top 10 URLs from the ranking system and their old ranks from

Alta Vista for the query string “Computer job”.….……………………………..51

Page 8: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 8 -

Table 5.13 The top 10 URLs searched by Alta Vista with query “ballet”………………53

Table 5.14 The top 10 authorities ranked by the ranking system with

query “ballet”………………………………………………………………..53

Table 5.15 The top 5 authorities and top 5 hubs and their ranks

and weights for the query “ballet”……………………………….………….54

Page 9: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 9 -

Chapter 1 Introduction

1.1 The Significance of Web and Web Search Engines

The World Wide Web is growing rapidly, and with about 1 million pages being

added daily, the amount of information on the Web has changed the way people think

about and seek information. As an important medium, the Web provides a tremendous

amount of information related to a wide range of topics and the number of both

experienced and inexperienced users is also increasing at a phenomenal rate. When

searching the Web, users are usually looking for very specific information on their

particular topic. However, their searches cover a huge range of topics, and this

combination of minute detail and diverse subject range creates a tremendous challenge

for the development of information retrieval techniques. There must be a way to locate

information relevant to the user’s particular interests from within the reservoir of Web

resources which is available.

Search Engines, provides users with an efficient means of searching for valuable

information on the Web. There are many search engines which support information

retrieval on the Web, such as AltaVista, Excite, Infoseek, Lycos, etc. A search engine

usually collects Web pages on the Internet through a spider also known as crawler or

robot software, all of which will be scanned and indexed based on the full text of the

documents. In a typical search procedure, the user submits a query, which is simply a

Page 10: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 10 -

word or combination of words as keywords. The search engine will examine its the

backend database for any document found in the index which matches the query, and

then return a list of related Web pages. In this way, a Web user can quickly obtain a set

of all the Web pages in the search engine’s database containing the given keywords.

1.2 Disadvantages of the Current Search Engines

Traditional text-based Web search engines, which rely on keyword matching, visit

www sites, fetch pages and analyze text information to build indices. With the explosive

growth in the amount of Internet information, the number of documents in the indices has

been increasing by many orders of magnitude. In particular, the results returned for a

query may contain several thousand, or even million, relevant Web pages. For example,

if the search engine Excite is given the keyword “ internet”, over 25 million Web pages

will be found. Typically, a user will be willing to look at only a few of these pages,

usually the first ten results. One of the problems of text-based search engines is that many

Web pages among the returned results are low quality matches. It is also common

practice for some advertisers to attempt to gain people’s attention by taking measures

meant to mislead automated search engines. This can include the additional of spurious

keywords to trick a search service into listing a page as rating highly in a popular subject.

How to select the highest quality Web pages for placement at the top of the return list is

the main concern of search engine design.

Another problem for those designing search engines is that most users are not

experts in information retrieval. The Web user asking the question may not have enough

experience to format their query correctly. It is not always intuitively easy to formulate

Page 11: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 11 -

queries which can narrow the search to the precise area. Furthermore, regular users

generally do not understand how the search mechanisms. As mentioned in [1], the

document indices constructed by search engines are designed to be general and applicable

to all. If a user tries to narrow his or her search by including all senses as a key search

term, it often results in irrelevant information being presented. On the other hand, if a

user is skilled enough to formulate an appropriate query, most search engine will retrieve

pages with adequate recall (the percent of relevant of the relevant pages retrieved among

all possible relevant pages), but with poor precision (the ratio of relevant pages to the

total number of pages retrieved).

These disadvantages indicate that the performance of current search engines is

far from satisfactory. How to improve the quality of Web search results is

a subject and is being widely studied.

1.3 A Proposed System for Improving Web Searches

The purpose of this project is to develop a ranking system that can improve the

behavior of text based search engines by implementing the HITS algorithm presented by

the IBM Almaden Research Center [6,7]. The algorithm takes advantage of its ability to

mine the Web’s link structure. It analyzes hyperlinks to uncover two types of pages:

• authorities, which provide the best source of information on a given topic;

• hubs, which provide collections of links to authorities.

The mutually reinforcing relationship between hubs and authorities---- a good

authority is a page pointed to by many good hubs, while a good hub is a page that points

to many good authorities---- serves as the central theme in the exploration of link-based

Page 12: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 12 -

methods for Web information search.

The proposed system was implemented as a post-processor that can work on top

of current search engines such as Alta Vista and Excite. It was developed to address both

these parameters---- relevance and quality. It focuses on the use of links for analyzing the

collection of pages relevant to a broad search topic, and for discovering the most

“authoritative” pages on each topic. The algorithm computes lists of hubs and authorities

for Web search topics. Beginning with a search topic, the rank model has two main steps:

• a search-and-growth phase, which constructs a collection of Web pages

with respect to a search topic by producing a set of relevant pages rich in

candidate authorities.

• a weight-and-propagation phase, which numerical estimates the weights of

hubs and authorities by an iterative procedure.

The rearranged search results are returned as authorities for the search topic and

the higher the page authority is weight, the higher its position in the list.

Page 13: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 13 -

Chapter 2 Literature Review

Many studies have been done which aimed to improve the behavior of current

search engines. The problem with current approaches is that they almost invariably

evaluate a Web page in terms of its text information alone. They fail to take into account

the Web structure, in particular the hyperlinks. The Internet structure of the hyperlink

environment can be a rich source of information about the content of the environment.

Analyzing the hyperlink structure of Web pages gives a way to improve the behavior of

text-based search engines providing an effective method that can locate not only a set of

relevant pages, but also relevant pages of the highest quality. In this chapter, we will

present a short overview of the existing data mining methods.

2.1 Mining The Web Hyperlink Structure

One of the main advantages of Web is its ability to redirect the information flow

via hyperlinks. In order to evaluate the informative content of a Web page, the Web

structure has to be carefully analyzed. Hyperlink analysis, which is capable of measuring

the potential information contained in a Web page with respect to the Web space, has

gained more and more attention recently.

The links to and from Web pages are an important resource that has largely gone

unused in existing search engines. Web pages differ from general text in that they posses

Page 14: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 14 -

external and internal structure. The Web types and links between documents can be

useful information in finding pages for a given set of topics. Making use of the Web link

information will allow the construction of more powerful tools for answering user

queries.

2.1.1 Web Structures

The types of links a Web site may contain have been fully studied in [2]. As Fig.

2.1 shows, hypertext links within a Web site can be upward in the file hierarchy,

downward, or crosswise. The links pointing to other sites are referred to as outward links

and can help identify the type of a Web page. For example, a page which contains many

outward links typically is a topic index Web page, while a page which also contains many

links but most of them downward is a institution homepage. In types of sites, such as

Yahoo, most of the links are downward links to subcategories or outward links.

Furthermore, we can infer other information about a page from the number of links to it

and from it. For example, we might guess a page to be popular if it has more links toward

it than from it. Note that pages have both topics (such as software engineer) and types

(such as homepage, index, or Yahoo page).

2.1.2 A Novel Method Using Hyperlink Information

A novel method has been presented which aims to increase the precision of Web

search results by extracting hyperlink information from a Web object [3]. This method

looks at a Web page as an object. This Web object is not composed only of its static

textual information, but also the hyper information, which is the dynamic information

Page 15: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 15 -

Figure 2.1 Directions of hypertext links. Links on the same server can be upward

or downward in the file hierarchy or crosswise. Links to other servers are

considered outward [2].

www.yahoo.com

www.yahoo.com/Science

www.yahoo.com/Science/Computer_Science

www.ansa.co.uk

www.yahoo.com/Health/Medicine

upward

downward

crosswiseoutward

Page 16: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 16 -

Figure 2.2 Multiple links starting from the same page [3].

Figure 2.3 A single link of arbitrary depth [3].

AA B 1B1 B k-1

B k-1 B kB k

AA

B 0B 0 B 1

B 1 B n - 1B n - 1 B n

B n

Page 17: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 17 -

content provided by hyperlinks. Thus, the overall information describing a Web object

includes both hyper information and textual information, i.e.

INFORMATION=TEXTINFO+HYPERINFO, where the value of INFORMATION will

determine the position of a Web page with respect to a certain query. This mode analyzes

the web structure based on the multiple links in the same page, shown in Fig. 2.2, where

A is the start page, B1, … Bn are the pages pointed to by start page A. The arbitrary depth

k of a single link is shown in Fig. 2.3, where A is the start page and Bk is the page

pointed to by Bk-1.

For a single link such as that shown in Fig. 2.3, the hyper information for start

page A can be obtained by calculating the contribution of Web object (B) in depth k, the

value of whose textual information is diminished via a fading factor depending on its

depth. Thus, the contribution to the hyper information of page A by an object B at depth

k is Fk.TEXTINFO(B), where F is a suitable fading factor (0<F<1). By fixing a certain

depth, the overall information of a given Web object A will be

INFORMATION(A)=TEXTINFO(A) + HYPERINFO(A)

=TEXTINFO(A) + F.(TEXTINFO(B1) + F.(TEXTINFO(B2) +

F.(TEXTINFO(B3)+…+TEXTINFO(Bk))))=TEXTINFO(A)+F.TEXTINFO(B1) +

F2.TEXTINFO(B2) + …+ Fk.TEXINFO(Bk).

In general, a Web object has multiple links in the same page (see Fig.2.2). The

user cannot follow all the links at the same time, but must sequentially select them. The

rank model assumes that the user would select the highest informative link first and the

lowest informative link last. Then for a given Web object A, the hyper information

contributed by all links at depth k can be summed as

Page 18: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 18 -

F.TEXTINFO(B1) +…+Fn.TEXTINFO(Bn).

Compared to a random selection of the links, this “sequence of selections” is the

best sequence that maximizes the hyper information. This model can work on top of any

textual information function and has been implemented on the client side as s post-

processor for the main search engines.

The above model developed an algorithm based on both link structure and textual

information. However, a new model that dependsheavily on hyperlink structure

has been developed in order to improve text-based search engines and obtaining more

precise search results.

2.1.3 A New Search Engine Using the PageRank Algorithm

Google, a prototype with a full text and hyperlink database, is designed to crawl

and index the Web efficiently and return much more satisfying search results than

existing systems [4]. It makes use of the link structure of the Web to calculate a quality

ranking for each Web. The rank algorithm used by Google is PageRank [5]. PageRank

extends the idea that the importance as quality of an academic publication can be

evaluated by its citations to pages on the Web, which can be similarly be evaluated by

counting back links.

In particular, the creation of a hyper link by the author of a Web page represents

an implicit endorsement of the page being pointed to; by mining the collective judgment

contained in the set of such endorsements, people can gain a richer understanding of the

relevance and quality of the Web’s contents. Thus, by counting links from all pages

equally, and by normalizing the number of links on a page, the citation importance of a

Page 19: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 19 -

Web page that corresponds well with people’s subjective idea of importance can be

objectilye measured. The PageRank value of a page A, PR(A), is given as follows:

PR(A) = (1-d) + d ( PR(T1)/C(T1) + …+ PR(Tn)/C(Tn) )

Where T1…T2 are pages pointing to page A, the parameter d is a damping factor which is

scaled between 0 and 1, C(A) is the number of links going out of page A. The PageRank

or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the

principal eigenvector of the normalized link matrix of the Web. The PageRank metric,

PR(A), recursively defines the importance of a page A to be the weighted sum of the

back-links to it.

2.2 Web Page Ranking Algorithm

Ranking algorithms, when applied to the large number of results returned by the

search engine, can then help users to select those of most valuable to them from the sea

of Web resources. In practice, given a Web page p and a user’s query q, the ranking

algorithm will computer a score rank (p, q). The higher ranked (p, q) is, the more

valuable a Web page p is likely to be for the query q. Various methods have been

applied to develop rank algorithms. The prototype of Google described previously is an

example of the use of a ranking algorithm which is based on hyperlink analysis. Another

ranking model, the Clever system [6,7] was designed to improve the performance of

current search engines by using the Hyperlink Induced Topic Search algorithm. It can

work with any existing text-based search engine and rearrange the returned results by

applying its ranking algorithm. It classifies all the relevant pages returned for a given

query into two different categories: authority pages that contain rich information and hub

Page 20: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 20 -

pages that collect all the authority pages together.

The advantage of the Clever system is that it considers not only the in-degree but

also the out-degree of a Web site. An example is depicted in Fig. 2.4. The nodes

correspond to the Web pages, and a directed edge indicates the presence of a link

between two pages. The hub pages actually glue together authorities on a common topic.

This is a great improvement, avoiding the problem of unrelated pages of large in-degree

obtaining high rank scores.

In this chapter, some of work that has been done to improve the behavior of text-

based search engines was presented. In the next chapter, the project will be introduced.

Page 21: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 21 -

Figure 2.4 A densely linked set of hubs and authorities[7].

hubs authorities unrelated page oflarge in-degree

Page 22: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 22 -

Chapter 3 System Structure

The project consists of two parts--- a ranking system consisting of a spider and a

computational kernel, and a local database system for saving and retrieving the ranking

results. Fig.3.1 shows the architecture of the project. The ranking system illustrated in the

rectangular dashed block searches URLs and computes the score for ranking pages. It

saves the URLs as well as their rank scores into a backend database. The details for

accessing database information are shown in Fig.3.2. The on-line database system

enables a user to retrieve the information saved by the ranking system, listing the URLs

according to their rank scores.

3.1 System Interface

3.1.1 Interface of the Ranking System.

The interface for the ranking system was written in Java applet 1.1.7. Fig. 3.3

shows the user interface. When a user input query strings into the keyword field of the

interface, the rank system first sends them into a text-based search engine. For this

project, the search engine used was Alta Vista or Excite. The user can select either search

engine from the interface directly. The number of returned URLs is controlled by

entering the desired number into the “search limit” field. All the URLs searched by the

selected search engine make up the root set. The root set is expanded into the base set by

adding newfound URLs referenced by any one of the root set. The details concerning the

Construction of the root set and base set will be described in Section 4.2. The interface

Page 23: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 23 -

Figure 3.1 The system structure.

Alta vista Excite

query

URLs Spider software

Search andgrowth

Search andgrowth

Weight andpropagationWeight andpropagation

Save rank scores into database

Page 24: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 24 -

Figure 3.2 The execution path for retrieving database information.

Web Browser Query

Web Browser Query

Web Browserreport

Web Browserreport

DatabaseDatabase

CGI/PerlCGI/Perl

JDBCJDBC

Page 25: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 25 -

Figure 3.3 User interface of the ranking system.

Page 26: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 26 -

provides two windows in which to list the root set URLs and the base set URLs. Users

can watch the growth of the root set and base set while the spider program is running in

the background. There is a label between these two windows which indicates what stage

the process is at through after the rank system starts working.

3.1.2 Web Interface For Listing Rank Results

Fig. 3.4 illustrates the Web interface for accessing the local database information.

The Web interface was written in HTML. For listing the rank results, it requires a user to

first choose a topic and decide on the number of URLs to be included in the returned list,

then select the search engine associated with the rank results, and finally select the rank

method. The database system can then return a list of URLs based on the selected search

engine, relevant to the chosen topic, which have been ranked according to the specified

ranking method.

3.2 The Spider

The implementation of the rank system includes the use of spider software. The

spider, which may also be called a crawler or a robot, is a software program which can

automatically traverse the Web and download the network resource referred to by a URL.

The working mechanism of a spider is sample. Spiders start by parsing a specified web

page, noting any hypertext links on that page that point to other web pages. They then

parse those pages for new links recursively. Spider software does not actually move

around to different computers on the Internet, as viruses or intelligent agents do, but

resides on a single machine and sends HTTP requests for documents to other machines

Page 27: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 27 -

Figure 3.4 The web interface for listing rank results.

Page 28: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 28 -

on the Internet, just as a web browser does when the user clicks on links. All the spiders

really do is to automate the process of following links. Following links is not itself of any

great use, but the list of linked pages almost always serves some subsequent purpose.

The most common use is to build an index for a web search engine, although

spiders are also used for other purposes. Spider may also be used to:

• Test web pages and links for valid syntax and structure.

• Monitor sites to see when their structure or contents change.

• Search for copyright infringements.

• Build a special-purpose index—for example, one that has some understanding of

the content stored in multimedia files on the Web.

In this project, the spider executed the task of building the root set and extending

the root set into the base set. It downloaded the contents of a URL and picked out all the

URLs referenced by the Web page.

The spider program was written in Java 1.1.7. It used the URL class and its

method openStream() to download the contents of a specified URL. The spider

identified URL links of a specified Web page by parsing the downloaded information.

Several ways can be used to discover the URLs of a HTML file. In my program, the

spider collected all the URLs for building the root set and the base set by picking out the

string following HTML tag “<a href=”. The number of URLs in the root set was also

controlled by the spider software.

Page 29: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 29 -

3.3 Database

Oracle 8 was used as the backend database. The database was designed to be

simple and illustrative. One table named PAGE was created to store each page’s

information. Each record has 6 fields: keyword, URLs, search engine, old rank, authority

rank, and hub rank. The first three fields were specified to be not null.

The database access follows the execution path shown in Fig. 3.2. The user input

from the Web page is extracted and passed to a Java program as arguments by a Perl

program. The Perl program does not talk to the database directly in this project. It calls

the Java program, which can access the database and operate on it with the arguments it

receives. JDBC is JavaSoft’s database connectivity specification. It creates a

programming-level interface for communicating with databases in a uniform manner. The

Java program talks to the database using SQL statements and prints out reports for users

as HTML files through the Perl program. Finally, users can retrieve the database

information through a Web browser.

Page 30: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 30 -

Chapter 4 A Web Page Ranking Algorithm Using Hyperlink Information

4.1 The Algorithm

There are two phases in the development of this ranking system. The first is the

search and growth phase. The second is the weight and propagation phase, in which the

results returned by the first stage are evaluated.

4.1.1 Search And Growth

For analyzing the hyperlink information of available WWW pages, the ranking

system first constructs a collection of Web pages about a query string !. Since the search

results may contain millions of pages, the number of Web pages in the collection must be

limited to a reasonable quantity so that the system can reach a compromise between

obtaining a collection of pages highly relevant and saving computational effort. For

constructing such a collection of pages, the ranking system makes use of the results given

by a text-based search engine. The search engine will return a set of documents which

are determined by its own scoring function as a root set R!. It then extends the root set

R! by adding any additional document that is pointed to by a document already in the root

set. This is shown in Fig.4.1. The new collection is then renamed the base set and

denoted by S!. In this way, the link structure analysis can be restricted to a sub set S!,

which has the properties:

(1) S! is relatively small.

Page 31: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 31 -

Figure 4.1 Expanding the root set into a base set.

root

base S!R!

Page 32: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 32 -

(2) S! is rich in relevant pages.

(3) S! contains most (or many) of the strongest authorities.

Next, the ranking system calculates the rank score of each page based on the link

structure between any node pairs in the base set S!, and extracts good authorities and

hubs from the overall collection of pages.

4.1.2 Weight and Propagation

In this phase, the basic principle introduced in chapter 1 which assumes a good

authority page is pointed to by many good hub pages and a good hub page points to many

good authority pages, is converted into a method for finding good hubs and authorities.

When applying this method, each page p is assigned a non-negative authority

weight x(p) and a non-negative hub weight y(p). The relationship between hubs and

authorities is expected via an iterative computation that maintains and updates the

numerical weights for each page. As the results are evaluated, a good authority receives a

high score for x and a good hub receives a high score for y.

The iteration will lead to a fast growth of the actual magnitudes of x(p) and y(p) . In

order to keep their values bounded, normalization of the instant weights of x and y is

applied in the algorithm. In this project, all x and y values were set to a uniform constant

initially; and the weights of each type were normalized as follows:

"p# S! (x(p))2 = 1 , (1)

"p# S! (y(p))2 = 1 . (2)

Thus, we maintain the sum of their squares at 1. Since only the relative invariant values

are concerned in our manipulation, the final results are essentially unaffected by the

Page 33: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 33 -

initialization of all weights.

An alternative method for expressing the mutually reinforcing relationship

between hubs and authorities is: if a page p points to many pages with large x-value, it

should receive a large y-value; and if a page p is pointed to by many pages with large y-

values, then it should receive a large x-value. Thus, it is reasonable to update x(p) for a

page p to be the sum of y(q) over all pages q that link to p:

x(p) = " y(q) , q such that q $ p, (3)

where the notation q -> p indicates that q links to p.

Similarly, we can update the hub weight via

y(p) = " x(q) , q such that p $ q. (4)

Fig. 4.2 shows these two operations (3) and (4), the basic methods by which hubs and

authorities reinforce one another in an alternating iteration.

Each iteration consists of two steps:

(1) replace each x(p) by the sum of the y(q) values of pages pointing to p;

(2) replace each y(p) by the sum of the x(q) values of pages pointed to by p.

In the algorithm, the iteration will not stop until a fixed point is reached, i.e., both

the authority and hub weights converge to fixed values.

4.1.3 The Mathematical Foundation of Iteration

The mathematical foundation of the iterative method follows from the theory of

eigenvectors in [8]. To explain it simply, let us define an adjacency n x n matrix A,

whose (i, j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. In our case,

we can treat the set of all authority weights x as a vector x=(x1, x2, …, xn),

Page 34: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 34 -

Figure 4.2 The basic operations for calculating authority and hub weights.

page p

q1

q2

q3

page p

q1

q2

q3

x[p]:=sum of y[q],for all q pointing to p

y[p]:=sum of x[q],for all q pointing to by p

Page 35: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 35 -

and in the same way define the set of all hub weights y as a vector y=(y1,y2,…,yn). Then

the update rule for x can be written as x ATy and the update rule for y can be written as

y Ax. Going further, we can write

x ATy ATAx = (ATA)x

and

y Ax = AATy = (AAT)y.

Thus, the vector x(y) after multiple iterations is precisely the result of applying the power

iteration technique to ATA . Linear algebra tells us that this sequence of iterates, when

normalized, converges to the principle eigenvector of ATA. Similarly, the sequence of

values for the normalized vector y converges to the principal eigenvector of AAT. The

relationship between eigenvectors and power iteration are given detail in [9].

4.2 Building the Ranking System

4.2.1 Constructing the Root Set

Since the ranking system works on top of the other text-based search engines, all

the URLs in the root set are actually the search results returned by whichever of the

existing search engines such as Alta Vista or Excite have been chosen by the users. For

those Web search engines, once the query term is submitted via their interface, a

formatted query statement is constructed and sent to their database through CGI-BIN.

Usually, each search engine has its own format for the URL of the returned page. Thus, it

is necessary to construct different URL formats for each individual search engine in order

to get a list of the Web pages associated with the query term. However, the URL itself

follows a fixed format for each search engine, which makes automatic searches possible.

Page 36: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 36 -

If Alta Vista is chosen, the URLs of the returning page are:

• For listing the top 10 matched pages

http://www.altavista.com/cgi-bin/query?pg=q&what=web&fmt=.&q=keyword

• For listing the next 10 matched pages

http://www.altavista.com/cgi-bin/query?

pg=q&stype=stext&Translate=on&sc=on&q="+keyword+"&stq="+10

• For listing the next 20-30 matched pages

http://www.altavista.com/cgi-bin/query?pg=q

&stype=stext&Translate=on&sc=on&q="+keyword+"&stq="+20

The embedded “keyword” in these URLs can be any query strings. If the query

term is made of more than one word, for example, “computer book”, the format of

“keyword” will be “computer+book”.

This method also works with Excite. If Excite is chosen,

• The URL of the returning page that displays the top 10 matched pages is

http://search.excite.com/search.gw?search=keyword

• The URL for listing the next 10 matched URLs is

http://search.excite.com/search.gw?c=web&s="+keyword+"&showSummary=true&start="+

10+"&perPage=10&next=Next+Results

In this way, the spider software can walk through the Web pages returned by

a chosen search engine, extract the URLs following the string “<a href=” and build a

collection of URLs which becomes the root set. In this project, the size of root set is

Page 37: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 37 -

limited to 200. Notice that the URL of Excite for listing matched pages also lists other

non-relevant URLs. The spider program needs more control to avoid invoking the URLs

of advertisements and other inappropriate URLs which may be returned.

4.2.2 Extending the Root Set Into the Base Set

The base set is an expansion of the root set obtained by crawling Web pages

returned by a text based search engine. The spider program visits each page in the root set

to extract all the hyperlinks it points to. The working mechanics of finding new URLs

referenced by any page in the root set is similar to collecting the URLs from the Web

pages returned by the chosen search engine.

The logic procedure used to build the base set is as follows:

Find_ S! (!, E, n)

S!: the base set.

!: a query string.

E: a specified text-based search engine.

n: natural number (the size of root set).

Input ! as keyword into search engine E.

Let R! denote the root set.

R! := the top n results (the highest-ranked pages) returned by E.

Set S! := R!

Walk through each Web page within R! by applying a spider program.

Page 38: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 38 -

For each page p #R!

Let % (p) denote the set of all pages p points to.

Add all pages in (p) to S!.

End

Return S!.

4.2.3 Analyzing hyperlinks

The ranking system will rearrange the URLs in the root set by using the hyperlink

information for all the Web pages in the base set. The process of analyzing hyperlinks is

done twice. The first occurs while the spider performs the task of building the base set.

Later, the spider will crawl all the newly found Web pages in the base set in order to

perform the hyperlink analysis used for evaluating the rank score.

A Web page can link to many other pages, which may in turn reference the Web

page. When the spider crawls each Web page in the root set, it not only executes the task

of extracting URLs in the Web page it is visiting but also registers the newfound URLs as

outward links for that Web page. After the base set has been constructed, the spider walks

through all the newfound Web pages, extracting the URLs of each visiting Web page

again and comparing them with all existing URLs. If the outward link points to a URL

which is in the base set, the URL in the base set is registered as its outward link and the

Web page itself is registered as an inward link of the URL in the base set. In this case, the

spider only cares about URLs that already exist in the base set. After walking through all

the URLs in the base set, the hyperlinks among Web pages are recorded and saved for

use in the next step.

Page 39: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 39 -

4.2.4 Calculate the Rank Score

First, the type of link structure between any node pairs must be identified. The

study of the type of links has been done in [2]. Hypertext links within a Web site can be

upward in the file hierarchy, downward, or crosswise. Only the links which point to other

sites are referred to as outward links. In this project, all the links can be distinguished

between transverse and intrinsic links:

• a link is transverse if it is between pages with different domain name;

• a link is intrinsic if it is between pages with the same domain name.

The “domain name” here means the first level in the URL string associated with a

page. Thus, the upward, downward, and crosswise links are in the same category----

intrinsic links. Since intrinsic links very often exist purely to allow for navigation of the

infrastructure of a site, they contain much less information than outward links, which the

transverse links convey information on the authority of the pages they point to. When we

count the in-degree and out-degree numbers of a Web page, all intrinsic links should not

be taken into account. Only the edges corresponding to transverse links are kept in S!;

this results in a new graph G!.

There are still other issues that required attention. A phenomenon can be observed

in which a large number of pages from a single domain all point to a single page p. In

many cases, this corresponds to a mass endorsement, advertisement, or some other type

of “collusion” among the referring pages. These links do not seem intuitively to confer

authority and should not be contained in the new graph G!. In order to avoid the

problem, we can eliminate this phenomenon by fixing a parameter m which is typically

Page 40: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 40 -

between 4 and 8, and only allow up to m pages from a single domain to point to any

given page p. This control was employed in this project and was shown to be an effective

solution in most case.

The steps performed so far produce a small graph G! that has many relevant

pages, and strong authorities. These authorities both belong to G! and are heavily

referenced by pages within G!. It is now possible to extract the authorities which are

most likely to answer the query from the overall collection

of pages G! .

In order to find the hubs and authorities in the Web page collection G! with

respect to the query string !, the set of weights {x(p) } and { y(p) } are represented as

vector x and vector y separately, and the following procedure is used:

Iterate (G!,k)

G!: a collection of n linked pages

K: a nature number (the iteration number)

Let z denote the vector (1,1,1,…,1) # Rn

Set x0 := z

Set y0 :=z

For i = 1,2,…,k

Apply the x(p) = " y(q) operation to (x i-1, y i-1), obtaining new x-weights x’i

Apply the y(p) = " x(q) operation to (x’i, y i-1), obtaining new y-weights y’i

Normalize x’i, obtaining xi

Page 41: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 41 -

Normalize y’i, obtaining yi

End

Return (xk, yk).

This procedure can be used to filter out the top c authorities and top c hubs in the

following simple way (c is the number of URLs in the return list):

Filter (G!, k, c)

G!: A collection of n linked pages

k,c: natural numbers

(xk, yk) := Iterate (G!,k)

Report the pages with the c largest coordinates in xk as authorities

Report the pages with the c largest coordinates in yk as hubs

Typically, the value of c is between 5 and 10. Testing the number of iterations k,

by applying Iterate( ) with arbitrarily large values, we found the convergence of Iterate( )

to be quite rapid; k=20 is sufficient for both the authority weight and hub weight to

converge to fixed numbers.

4.3 Examples of Convergence

Tables 4.1a, 4.1b, 4.2a, and 4.2b indicate the relationships between the weight

values and the iteration numbers. From the data, we can see that both the authority

weights and hub weights converged quickly, and 20 iterations were sufficient to obtain

Page 42: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 42 -

values very close to the final value. Tables 4.1a, and 4.1b list the top 10 URLs returned

by the search engine Excite. The data in Table 4.1a give the authority weights. Table 4.1b

lists the hub weights. The URLs were the search results for the query “Java”. Table 4.2a,

and 4.2b indicate the change of authority weight and hub weight values for the top 10

URLs returned by the search engine Alta Vista. The data in Table 4.2a are the authority

weights and Table 4.2b lists the hub weights. The URLs were the search results for the

query “camera”.

Page 43: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 43 -

Table 4.1. (a) Authority weights of returned URLs from Excite.

.

URL Iteration 1 Iteration 20 Iteration 200 http://java.sun.com/ 52 1.13 1.13 http://www.zdnet.com/devhead/filters/java/ 1 4.30E-55 0 http://metalab.unc.edu/javafaq/ 9 0.45 0.45 http://javaboutique.internet.com/ 30 3.04 3.04 http://www.javareport.com/ 37 3.35 3.35 http://java.wiwi.uni-frankfurt.de/ 9 0.072 0.072 http://www.javalobby.org/ 18 0.078 0.078 http://caffeinebuzz.com/ 2 0.098 0.098 http://freewarejava.com/ 5 0.36 0.36 http://www.gamelan.com/ 106 6.35 6.35

Table 4.1. (b) Hub weights of returned URLs from Excite.

URL Iteration 1 Iteration 20 Iteration 200 http://java.sun.com/ 25 3.13 3.13 http://www.zdnet.com/devhead/filters/java/ 23 6.52E-26 0 http://metalab.unc.edu/javafaq/ 53 2.18 2.18 http://javaboutique.internet.com/ 17 1.17E-27 0 http://www.javareport.com/ 18 2.06 2.06 http://java.wiwi.uni-frankfurt.de/ 27 4.17 4.17 http://www.javalobby.org/ 89 25.72 25.72 http://caffeinebuzz.com/ 13 7.11 7.11 http://freewarejava.com/ 15 3.08 3.08 http://www.gamelan.com/ 23 1.04 1.04

Page 44: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 44 -

Table 4.2. (a) Authority weights of returned URLs from Alta Vista.

URL Iteration 1 Iteration 20 Iteration 200 http://testo.camera.it/ 1 8.00E-37 0 http://www.bouldernews.com/ 2 5.35E-25 0 http://www.dcn.com/ 2 5.35E-25 0 http://www2.famille.ne.jp/~bud1_bis/camera/index_e.html 7 3.21E-24 0 http://www.fcw.com/civic/articles/com-camera.asp 1 8.06E-37 0 http://www.dcresource.com/ 1 8.06E-37 0 http://www.camera-net.com/ 1 8.06E-37 0 http://www.pie.camcom.it/camera-arbitrale/index.html 4 0.26 0.26 http://www.camera.it/index.asp 1 8.06E-37 0 http://www.edromney.com/ 1 8.06E-37 0

Table 4.2. (b) Hub weights of returned URLs from Alta Vista.

URL Iteration 1 Iteration 20 Iteration 200 http://testo.camera.it/ 5 2.66E-18 0 http://www.bouldernews.com/ 20 1.45 1.45 http://www.dcn.com/ 1 4.04E-25 0 http://www2.famille.ne.jp/~bud1_bis/camera/index_e.html 8 1.47E-15 0 http://www.fcw.com/civic/articles/com-camera.asp 6 2.80E-17 0 http://www.dcresource.com/ 14 7.24E-12 0 http://www.camera-net.com/ 1 4.04E-25 0 http://www.pie.camcom.it/camera-arbitrale/index.html 5 1.11 1.11 http://www.camera.it/index.asp 1 4.04E-25 0 http://www.edromney.com/ 2 1.21E-17 0

Page 45: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 45 -

Chapter 5 Results and Discussion

This chapter gives the experimental results of the new ranking system, and

compares its performance to that of the existing search engines Alta Vista and Excite.

Several query strings were input to the ranking system through its interface. For each

topic, the top 10 pages from Alta Vista/Excite were compared to the top 10 authorities

returned by the ranking system. The comparisons are divided into two sections --- the

first section compares the results according to a quality score for the Web page. The

second section selects some Web pages as examples and analyzes their Web page

hyperlink structures, and discuses how the hyperlink information can help improving the

search results.

5.1 Quality Comparison

To give a quantitative measure, a numerical value was used to evaluate the quality

of each Web page. The quality score was assigned in terms of the page utility in

providing information about the topic covered by the query. Two attributes were used to

determined the quality scores in the project:

(1) Relevancy--- If the content of a Web page has nothing to do with the query string,

the URL’s quality score is 0. If it has some relationship with the query strings, the

quality score will be 1, 2, 3, which represent poor, good, and excellent

Page 46: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 46 -

relationships, respectively.

(2) Dead link--- A URL that cannot be opened. If a URL is a dead link, its quality

receives a negative number –1.

Hence, the quality of an individual Web page is represented as a numerical value scoring

from –1 to 3. The quality score of all the Web pages were obtained by reading the

contents of each. It is obviously to see that the higher the quality scores the high quality

of a Web page. Although the nature of quality judgment by human may not be well

defined, as the system will be used by human searching the Web this is a reasonable

method of assessing the results.

5.1.1 Comparison to Alta Vista

Tables 5.1, 5.3, 5.5 list the quality scores of the top 10 URLs returned by Alta

Vista for the keywords “computer job”, “Ballet”, and “Camera”. And Tables 5.2, 5.4, 5.6

lists the quality scores of the new top 10 URLs after applying the ranking system.

In order to evaluate the value of the authority pages returned by the ranking

system quantitatively, we summed the quality values of all the URLs in each table and

computed the percentage of good/excellent Web pages in the top 10 URLs.

Figure 5.1 shows the sum of quality scores in bar graphs, with the x axis listing

the query strings and the y axis representing the sum of quality scores.

Figure 5.2 shows the percentage of good/excellent pages in bar graph, with the x

axis listing the query strings and the y axis representing the percentage of good/excellent

Web pages. The data listed in the tables and bar graphs show the disadvantages of the

current text-based search engines and the improvement in the performance after applying

Page 47: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 47 -

Table 5.1 The top 10 URLs given by Alta Vista with query string “Computer job”.

Rank URL Quality Score Status 1 http://www.jobbankusa.com/www.html 2 Good 2 http://www.dalmanjobs.com/ -1 Dead link 3 http://www.jobs.com/ 2 Good 4 http://www.iteachnet.com/wwwboard/wwwboard.html 0 Non- relevant 5 http://www.chisoft.com/ 1 Poor 6 http://members.home.net/oacjlp/cfall1.htm -1 Dead link 7 http://www.siliconalleyjobs.com/ 3 Excellent 8 http://www.n-s-i.net/fact.html 2 Good 9 http://hrea.org/lists/huridocs-tech/markup/msg00250.html 1 Poor 10 http://www.churchfriends.com/board/biz/messages/7.html -1 Dead link

Total quality scores 8

Table 5.2 The top 10 URLs from the ranking system with query string “Computer job”.

Rank URL Quality Score Status 1 http://www.itcomputerjobsearch.com/ 2 Good 2 http://www.athomebusinessportal.com/ 2 Good 3 http://www.stayathomework.com/ 2 Good 4 http://www.daytraderstocktrader.com/ 1 Poor 5 http://houston.computerwork.com/ 3 Excellent 6 http://bayarea.computerwork.com/ 3 Excellent 7 http://philadelphia.computerwork.com/ 3 Excellent 8 http://twincities.computerwork.com/ 3 Excellent 9 http://www.virtualbusinesswebhost.com/ 0 Non-relevant 10 http://www.joltjobs.com/ 3 Excellent

Total quality scores 22

Page 48: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 48 -

Table 5.3 The top 10 URLs given by Alta Vista with query string “Ballet”.

Rank URL Quality Score Status 1 http://www.ballet.co.uk/ 2 Good 2 http://www.ballet.org.uk/ 2 Good 3 http://www.pbt.org/ 2 Good 4 http://www.royal-ballet-school.org.uk/ 2 Good 5 http://www.nzballet.org.nz/ 2 Good 6 http://www.danvilleballet.org/ 2 Good 7 http://www.westchesterballet.org/ 2 Good 8 http://www.nycballet.com/ 2 Good 9 http://w3.one.net/~ballet/ 0 Non-relevant10 http://www.theshopatvcb.com/ 1 Poor

Total quality scores 17

Table 5.4 The top 10 URLs from the ranking system with query string “Ballet”.

Rank URL Quality Score Status 1 http://www.kirovballet.com/ 2 Good 2 http://www.joffrey.com/ 2 Good 3 http://www.het-nationale-ballet.nl/ 2 Good 4 http://www.balletwest.org/ 2 Good 5 http://www.en-ballet.co.uk/ 2 Good 6 http://www.hamburgballett.de/ 2 Good 7 http://www.coloradoballet.org/ 2 Good 8 http://www.cincinnatiballet.com/ 2 Good 9 http://www.koninklijkballetvanvlaanderen.be/ 2 Good 10 http://www.balletaustin.org/ 2 Good

Total quality scores 20

Page 49: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 49 -

Table 5.5 The top 10 URLs given by Alta Vista with query string “Camera”.

Rank URL Quality Score Statues 1 http://www.camerasphere.com/ 2 Good 2 http://www.mi.camcom.it/ 0 Non-relevant 3 http://testo.camera.it/ 0 Non-relevant 4 http://www.mpegcam.net/ 1 Poor 5 http://www.bouldernews.com/ 2 Good 6 http://www.dcn.com/ 0 Non-relevant 7 http://www2.famille.ne.jp/~bud1_bis/camera/index_e.html 1 Poor 8 http://www.fcw.com/civic/articles/com-camera.asp 1 Poor 9 http://www.ckcpower.com/camerabags.htm 2 Good 10 http://www.geocities.com/~ffrog/campath.html 0 Non-relevant

Total quality scores 9

Table 5.6 The top 10 URLs from the ranking system with query string “Camera”.

Rank URL Quality Score Statues 1 http://www.buffzone.com/ 2 Good 2 http://www.cameraarts.com/ 2 Good 3 http://www.dcresource.com/ 3 Excellent 4 http://www.acecam.com/ 3 Excellent 5 http://www.samsungcamera.com/ 2 Good 6 http://www.digital-cameras.com/ 2 Good 7 http://www.cameraworld.com/ 3 Excellent 8 http://www.glazerscamera.com/welcome.htm 3 Excellent 9 http://www.wolfcamera.com/ 3 Excellent 10 http://www.pie.camcom.it/camera-arbitrale/index.html 1 Poor

Total quality scores 24

Page 50: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 50 -

0

5

10

15

20

25

quality scores

Computerjob

Ballet Camera

query strings

Alta VistaNew Rank

Figure 5.1 Quality comparison.

0

0.20.40.60.8

1

percentage(%) of good &

excellent URLs

Computerjob

Ballet Camera

Query strings Alta VistaNew Rank

Figure 5.2 Good and excellent comparison.

Page 51: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 51 -

the ranking system.

5.1.2 Comparison to Excite

The scoring function employed by Excite does not rely on pure “textual

information”. Excite is one of the search engines that pay the most attention to the

structure of WWW pages. Hence, Dead link and Non-relevant pages are not included in

the top 10 URLs returned by Excite, and the ranking system has only a limited effect on

the results returned by the search engine.

Tables 5.7, and 5.9 list the quality scores of the top 10 URLs returned by Excite

for the keywords “Internet”, and “Java”. Tables 5.8, and 5.10 list the quality scores of

the new top 10 URLs after applying the ranking system.

Figure 5.3 shows the sum of the quality scores using a bar graph, with x axis

listing the query strings and the y axis representing the sum of the quality scores.

Figure 5.4 shows the percentage of good/excellent pages in bar graph, with x axis

listing the query strings and the y axis representing the percentage of good/excellent Web

pages.

From the results, it is difficult to come to a definite conclusion as to which is

better than the other. The quality of the URLs given by Excite and that of the URLs

ranked by the ranking system are competitive with each other. This is consistent with the

subjective impression that Excite pays the most attention to the structure of WWW pages.

Page 52: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 52 -

Table 5.7 The top 10 URLs given by Excite with query string “Internet”.

Rank URL Quality Score Status 1 http://www.leapday.demon.nl/Introduc.htm 2 Good 2 http://www.currents.net/resources/ispsearch/intquest.html 2 Good 3 http://www.viainter.net/ 1 Poor 4 http://www.linkexchange.com/ 2 Good 5 http://www.microsoft.com/ 2 Good 6 http://www.windows95.com/ 1 Poor 7 http://www.ipl.org/ 2 Good 8 http://www.ibill.com/ 1 Poor 9 http://www.mckinley.com/ 2 Good 10 http://www.thesaurus.com/ 2 Good

Total quality scores 17

Table 5.8 The top 10 URLs from the ranking system with query string “Internet”.

Rank URL Quality Score Status 1 http://www.internetnews.com/ 2 Good 2 http://www.internet.com/ 3 Excellent 3 http://dart.fine-art.com/ 1 Poor 4 http://netserf.cua.edu/ 2 Good 5 http://www.isoc.org/ 3 Excellent 6 http://www.ietf.org/ 2 Good 7 http://argos.evansville.edu/ 2 Good 8 http://www.alexa.com/ 2 Good 9 http://www.microsoft.com/ 2 Good 10 http://www.demon.net/ 3 Excellent

Total quality scores 22

Page 53: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 53 -

Table 5.9 The top 10 URLs given by Excite with query string “Java”.

Rank URL Quality Score Status 1 http://java.sun.com/ 3 Excellent 2 http://www.zdnet.com/devhead/filters/java/ 3 Excellent 3 http://metalab.unc.edu/javafaq/ 3 Excellent 4 http://javaboutique.internet.com/ 3 Excellent 5 http://www.javareport.com/ 3 Excellent 6 http://java.wiwi.uni-frankfurt.de/ 1 Poor 7 http://www.javalobby.org/ 2 Good 8 http://caffeinebuzz.com/ 1 Poor 9 http://freewarejava.com/ 3 Excellent 10 http://www.gamelan.com/ 2 Good

Total quality scores 24

Table 5.10 The top 10 URLs from the ranking system with query string “Java”.

Rank URL Quality Score Status 1 http://www.javahow.to/ 3 Excellent 2 http://java.sun.com/docs/books/tutorial/ 2 Good 3 http://www.jars.com/ 3 Excellent 4 http://www.developer.com/directories/pages/dir.java.html 2 Good

5 http://www.yahoo.com/Computers_and_Internet/ Programming_Languages/Java/ 3 Excellent

6 http://www.objectspace.com/jgl/ 1 Poor 7 http://www.javaworld.com/ 3 Excellent 8 http://java.sun.com/javaone/ 2 Good 9 http://www.afu.com/javafaq.html 2 Good 10 http://java.sun.com/faqIndex.html 2 Good

Total quality scores 23

Page 54: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 54 -

0

5

10

15

20

25

quality scores

Internet Java

query strings

ExciteNew Rank

Figure 5.3 Quality comparison.

00.20.40.60.8

1

percentage(%) of good &

excellent URLs

Internet Java

query stringsExciteNew Rank

Figure 5.4 Good and excellent comparison.

Page 55: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 55 -

5.2 Analysis of the Results

In this section, we studied the contents of Web pages and analyzed the hyperlink

information provided by the ranking system. The research works help us to understand

the advantages of using hyperlink information. Since Alta Vista is a pure text-based

search engine, we chose to discuss the results of Alta Vista before and after applying the

proposed system.

5.2.1 Removal of Poor Quality Web Pages

Table 5.11 lists the top 10 URLs searched by AltaVista. Table 5.12 lists the top 10

URLs after applying the ranking system. All the returned Web pages were search results

with respect to the query string “computer job”. The authority weights listed in the two

tables might explain why the rank of these URLs changed a lot after applying the ranking

system. By visiting the 4th URL www.iteachnet.com/wwwboard/wwwboard.html in

Table 5.11, it is easy to see why this URL was listed in the 4th position. Fig. 5.5 gives a

view of the Web page though a Web browser. Reading this Web page carefully, one can

discover quickly the reason why this URL appeared in a higher position in Table 5.11 ---

the excessive repetition of one of the keywords “job”. Excessive repetition of one or

more keywords is a simple way that the author of a Web page can attempt to influence or

trick a search engine. Fig.5.5 shows a good example of a page that has some relationship

with the query strings, but is a poor quality results. Such pages do appear in the data sets,

and this is as it should be because the search engines’ responses to them are of interest. In

the body of Web page www.iteachnet.com/wwwboard/wwwboard.html, the word “job”

was repeated more than 100 times. This method of “cheating” always works if the search

Page 56: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 56 -

Figure 5.5 Web page contents of www.iteachnet.com/wwwboard/wwwboard.html.

Page 57: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 57 -

engine is text-based only. This is because a text based search engine suffers from an

intrinsic weakness: it does not take into account the Web structure the Web object is part

of. The problem with text-based search engines is that they look at a Web object and

evaluate it as though it were a piece of text.

The power of the ranking system is that it makes use of the information provided

by hyperlinks. By carefully analyzing the Web structure, the capability of a Web page to

redirect the information flow via hyperlinks can be evaluated. The approach of

calculating the authority weights and hub weights used in the ranking system enables the

search engine to obtain more precise information about a Web page. By adding the Web

structure analysis, the rank system can rank search results according to not only the text

information included in a Web page, but also the potential ability of a user to gain further

relevant information with a browser, i.e., how much information one can obtain using this

page as a starting point from which to navigate the Web and how much information one

can explore by navigating to it from other Web pages. It is just this additional

information, gained from the hyperlinks, that enables the ranking system to overcome the

big problem of “search engine persuasion” (tuning pages so to cheat a search engine, in

order to make it award the page a higher rank).

The 4th URL in Table 5.11 gained a low authority value during the ranking

procedure, and was removed from the top 10 URLs after applying the ranking system.

The improvement is obviously. Although the keyword “job” was repeated many times in

the text body of Web page www.iteachnet.com/wwwboard/wwwboard.html, this method

of cheating failed to mislead the ranking system, and the page moved to a lower position

because of its low authority weight.

Page 58: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 58 -

5.2.2 Removal of Non-relevant Web Pages

Table 5.13 lists the top 10 URLs searched by Alta Vista, and Table 5.14 lists the

top 10 URLs after ranking. In this case, the user query string is “ballet”. Fig. 5.6 shows a

Web page, which is the 9th on the list in Table 5.13. By reading this page, it clearly fails

to provide any information about ballet. It is a non-relevant URL. The reason that it was

listed in a higher position is because of the title “Index of /~ballet”. Although the content

of this Web page has nothing to do with ballet and the author might have no intention to

trick a search engine, the word “ballet” embedded in its title happened to be the keyword

and resulted in a higher position in the list. However, by analyzing hyperlink information,

the ranking system successful detected this non-relevant page, which had neither an out

page nor an in page, and placed it in a lower position (98th) in the new list.

5.2.3 Relationship Between Hubs and Authorities

So far, the comparison between the results given by Alta Vista and the new results

ranked by the ranking system have been based on authority values. In order to test the

validity of the contention that there is a mutually reinforcing relationship, i.e. that a good

hub is a page that points to many good authorities and a good authority is a page that is

pointed to by many good hubs, we analyzed the hyperlinks between authority and hub

pages of the rank results in this section.

Table 5.15 lists the old rank and new rank of the top 5 authorities and top 5 hubs

and their weights. The query string is “ballet”. It is noticable that all the authority

weights are very close. However, the 1st hub’s weight is dramatically higher than all the

weights of other hubs. Fig 5.7 shows the content of this Web page. It consists of a

Page 59: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 59 -

Table 5.11 The top 10 URLs searched by Alta Vista and their authority weights for the

query string “Computer job”.

Rank URL Authority 1 http://www.jobbankusa.com/www.html 0.0065 2 http://www.dalmanjobs.com/ 0 3 http://www.jobs.com/ 0 4 http://www.iteachnet.com/wwwboard/wwwboard.html 0.0032 5 http://www.chisoft.com/ 0 6 http://members.home.net/oacjlp/cfall1.htm 0 7 http://www.siliconalleyjobs.com/ 0.016 8 http://www.n-s-i.net/fact.html 0 9 http://hrea.org/lists/huridocs-tech/markup/msg00250.html 0 10 http://www.churchfriends.com/board/biz/messages/7.html 0

Table 5.12 The top 10 URLs from the ranking system and their old ranks from Alta

Vista for the query string “Computer job”.

New rank URL Authority Old rank 1 http://www.itcomputerjobsearch.com/ 484.9 128 2 http://www.athomebusinessportal.com/ 364.3 114 3 http://www.stayathomework.com/ 129.4 113 4 http://www.daytraderstocktrader.com/ 17.6 28 5 http://houston.computerwork.com/ 5.1 36 6 http://bayarea.computerwork.com/ 5.1 39 7 http://philadelphia.computerwork.com/ 5.1 46 8 http://twincities.computerwork.com/ 5.1 56 9 http://www.virtualbusinesswebhost.com/ 2.3 18 10 http://www.joltjobs.com/ 0.22 124

Page 60: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 60 -

Figure 5.6 Web page contents of http://w3.one.net/~ballet/.

Page 61: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 61 -

Table 5.13 The top 10 URLs searched by Alta Vista with query “ballet”.

position URL 1 http://www.ballet.co.uk/ 2 http://www.ballet.org.uk/ 3 http://www.pbt.org/ 4 http://www.royal-ballet-school.org.uk/ 5 http://www.nzballet.org.nz/ 6 http://www.danvilleballet.org/ 7 http://www.westchesterballet.org/ 8 http://www.nycballet.com/ 9 http://w3.one.net/~ballet/

10 http://www.theshopatvcb.com/

Table 5.14 The top 10 authorities ranked by the ranking system with query “ballet”.

position URL 1 http://www.kirovballet.com/ 2 http://www.joffrey.com/ 3 http://www.het-nationale-ballet.nl/ 4 http://www.balletwest.org/ 5 http://www.en-ballet.co.uk/ 6 http://www.hamburgballett.de/ 7 http://www.coloradoballet.org/ 8 http://www.cincinnatiballet.com/ 9 http://www.koninklijkballetvanvlaanderen.be/

10 http://www.balletaustin.org/

Page 62: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 62 -

Table 5.15 The top 5 authorities and top 5 hubs and their ranks and weights for the

query “ballet”.

URL Type New rank Old rank Weight http://www.kirovballet.com/ Authority 1 76 71.95 http://www.joffrey.com/ Authority 2 81 70.82 http://www.het-nationale-ballet.nl/ Authority 3 79 65.4 http://www.balletwest.org/ Authority 4 35 62.67 http://www.en-ballet.co.uk/ Authority 5 45 61.79 http://www.dancer.com/dance-links/ballet.htm Hub 1 32 61.6 http://www.sccs.swarthmore.edu/~mack/ballet.html Hub 2 108 9.2 http://www.edanz.com/ballet/ Hub 3 60 9.16 http://www.sapphireswan.com/dance/links/ballet.htm Hub 4 86 7.22 http://www.art4net.com/BALLET.html Hub 5 104 4.09

Page 63: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 63 -

Figure 5.7 Web page contents of http://www.dancer.com/dance-links/ballet.htm.

Page 64: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 64 -

collection of all the links which point to other Ballet Company Web sites from A to Z.

Thus, it has a large number of out links. By analyzing the in pages of authorities and the

out pages of hubs in Table 5.15 carefully, we constructed a directed graph which could

help us gain a good understanding about how these URLs could be ranked in a high

position by the ranking system. Fig. 5.8 illustrates the relationship between authorities

and hubs in Table 5.15. In Fig. 5.8, a URL is represented by a circle, while the number in

the circle is its rank. All the circles on the left side are hubs and all the circles on the right

side are authorities. Note that Fig. 5.8 only showed the edges among these URLs. Each

hub may have outward links that the Web pages it points to did not show in the graph,

and each authority may have inward links that the Web pages that point to it did not

shown in the graph.

By reviewing the operations used to calculate the authority weight and hub weight

shown in Fig 4.2, we could predict that the 1st hub would gain a large weight number at

the first iteration since it pointed to all the ballet company Webs from A to Z. Then its

high value would influence the authority weight of all the Web pages it pointed to. If

these Web pages were also pointed to by other good hubs, their high authority weight

would increase the hub weight of the Web pages which pointed to them in the next

iteration, and the reinforcement was repeated during further iterations. Finally, both the

hub weights and authority weights converged to fixed values. The ranking system then

ranked all the URLs in the new sequences based on either authority or hub weight.

Fig.5.8 clearly shows that all the top 5 authorities were pointed to by the 1st hub

http://www.dancer.com/dance-links/ballet.htm. Since the 1st hub had a weight that was

dramatically higher than any other hub, it contributed more weight than any of the other

Page 65: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 65 -

hubs when taking into account for calculating the weight of authorities. This may explain

why all the 5 top authorities have weights close to the weight of the 1st hub. Hence, we

come to the conclusion that the main reason that the top 5 authorities moved from lower

positions in the old list to higher positions in the new list is that all of them were pointed

to by the 1st hub; and consequently any URL which was not pointed by the 1st hub would

get a low authority weight and prevented from occupying the top positions. The final

arrangement of these top 5 authorities was determined by the weight of the other hubs

pointed to them.

Page 66: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 66 -

Figure 5.8 The mutually reinforcing relationship between hubs and authorities.

2

3

4

55

4

2

3

1

Hubs Authorities

1

http://www.kirovballet.com/

http://www.joffrey.com/

http://www.het-nationale-ballet.nl/

http://www.balletwest.org/

http://www.en-ballet.co.uk/

http://www.dancer.com/dance-links/ballet.htm

http://www.edanz.com/ballet/

http://www.sapphireswan.com/dance/links/ballet.htm

http://www.art4net.com/BALLET.html

6http://www-sci.lib.uci.edu/HSG/Ref4.html

http://www.sccs.swarthmore.edu/~mack/ballet.html

Page 67: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 67 -

Chapter 6 Conclusions and Future Work

The new ranking algorithm applied in this project is a link-based approach to

WWW searches. It works on top of the existing text-based search engines and aims to

locate high-quality information related to a search topic on the World Wide Web, based

on a structural analysis of the link topology surrounding “authoritative” pages on the

query topic.

Our experiments used Alta Vista and Excite as anchors for the implementation of

the ranking system. Diverse query strings were entered as keywords to test the

improvement in the quality of the returned result made by applying the ranking system.

The power of producing results that are of as high quality as possible in the context of

available WWW pages was analyzed, and the comparison showed positive results.

For Alta Vista, which is a purely text-based search engine, the improvement

made by the ranking system is dramatic. The ranking system successfully deleted all the

dead links and most non-relevant WWW pages from the list returned. The percentage of

both good and excellent quality Web pages in the new list showed a major improvement.

For Excite, which is also a text-based search engine but which pays more

attention to Web page structure, the improvement made by the ranking system is more

limited. This is because most of the results returned for Excite have better quality than

that of Alta Vista. However, the percentage of both good and excellent quality Web pages

in the new list still showed some improvement.

Page 68: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 68 -

In this project, the ranking algorithm based on the linkage structure transcended

the limitations of traditional technology by exploring the structure of “communities” of

hubs and authorities on the WWW. Note that the iterative process of computing hub

weights and authority weights ignores the text describing the topics. In some cases

merely mining the linkage structure may not be good enough, so there is still room to

enhance the performance of the ranking system by improving the focus on the topic. In

particular, in a HTML file, the text around href links to a page p is descriptive of the

contents of p. It should be possible to introduce a new text-weighted process and

incorporate this textual conferral of authority into the basic iterative process described

previously. The idea is to assign to each link a positive numerical weight &(p, q) that

increases with the amount of topic-related text in the vicinity of the href from page p to

page q. The precise mechanism that can be used for this second weighting phase will be a

challenge for the future.

Page 69: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 69 -

References

1. M.R. Wulfekuhler, and W.F. Punch, “Finding salient features for personal web

page categories,” Computer Networks and ISDN Systems, vol.29, pp.1147-1156,

1997.

2. E. Spertus, “ParaSite: mining structural information on the Web,” Computer

Networks and ISDN Systems, vol. 29, pp.1205-1215, 1997.

3. M. Marchiori, “The quest for correct information on the Web: hyper search

engines,” Computer Networks and ISDN Systems, vol.29, pp. 1225-1235, 1997.

4. S. Brin, and L. Page, “The anatomy of a large-scale hyper textual Web search

engine,” Computer Networks and ISDN Systems, vol. 30, pp. 107-117, 1998.

5. L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking:

bringing order to the Web,” Manuscript in Progress,

http://google.stanford.edu/~backrub/pageranksub.ps.

6. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J.

Kleinberg, “Automatic resource compilation by analyzing hyperlink structure and

associated text,” Computer Networks and ISDN Systems, vol. 30, pp. 65-74,

1998.

7. J.M. Kleinberg, “Authoritative source in a hyperlinked environment,” IBM report

RJ 10076, May 1997.

Page 70: A World Wide Web Search Engine Using Hyperlink Hong Fan

- 70 -

8. C.W. Groetsch, and J.T. King, “Matrix methods and applications: an introduction

to linear algebra,” Prentice-Hall, Inc. 1988.

9. G.Golub, and C.F. Van Loan, “Matrix Computations,” Johns Hopkins University

Press, Baltimore, 1989.