a world wide web search engine using hyperlink hong fan
TRANSCRIPT
- 1 -
A World Wide Web Search Engine Using Hyperlink Structure To Improve Web Searches
Hong Fan
Certificate of Approval: ______________________ ________________________ Gerry V. Dozier Wenchen Hu, Cha Assistant Professor Assistant Professor Computer Science and Computer Science and Software Engineering Software Engineering _______________________ ________________________ David A. Umphress John F. Pritchett Associate Professor Dean, Graduate School Computer Science and Software Engineering
- 2 -
A World Wide Web Search Engine Using Hyperlink Structure To Improve Web Searches
Hong Fan
A Project Report
Submitted to
the Graduate Faculty of
Auburn University
in Partial Fulfillment of the
Requirements for the
Degree of
Master of Software Engineering
Auburn, Alabama
December 16, 2000
- 3 -
PROJECT REPORT ABSTRACT
A World Wide Web Search Engine Using Link Structure To Improve Web Searches
Hong Fan
Master of Software Engineering, December 16, 2000 (B.S., China Textile University, 1990)
71 Typed Pages
Directed by Wenchen Hu
The World Wide Web is growing rapidly, and as an important new medium for
communication, it provides a tremendous amount of information related to a wide range
of topics, and creating new challenges for information retrieval. A search engine provides
users with an efficient mean to search for valuable information on the Web. This project
is aimed at improving the performance of text-based search engines by applying a
ranking algorithm, which based on the hyperlink structure. The new search engine works
on top of the current text-based search engines. It is composed of a spider software
component, a page-ranking kernel, and a local database system. To evaluate the
performance of the new Web search engine, its results were compared to the results
obtained for the same queries from the search engines Alta Vista and Excite. The
experiments showed that the prototype performed significantly better than the purely
text-based search engines.
- 4 -
TABLE OF CONTENTS
LIST OF FIGURES……………………………………………………………………vii LIST OF TABLES…………………………………………………………………….viii 1 INTRODUCTION…………………………………………………………………… 1
1.1 The significance of Web and Web search engines………………………….1 1.2 Disadvantages of the current search engines………………………………. 2 1.3 A proposed system for improving Web searches …………………………...3
2 LITERATURE REVIEW…………………………………………………………...5
2.1 Mining the Web hyperlink structure………………………………………...5 2.1.1 Web structures…………………………………………………..6 2.1.2 A novel method using hyperlink information …………………..6 2.1.3 A prototype using the pagerank algorithm……………………..10
2.2 Web page ranking algorithm……………………………………………….11
3 SYSTEM STRUCTURE………………………………………………………….. 14 3.1 System interface…………………………………………………………….14
3.1.1 Interface of the ranking system ……………………………….. 14 3.1.2 Web interface for listing rank results………………………….. 18
3.2 The spider …………………………………………………………………. 18 3.3 Database…………………………………………………………………….21
4 A WEB PAGE RANKING ALGORITHM USING HYPERLINK INFORMATION…………………………………………………………………... 22
4.1 The algorithm ……………………………………………………………...22 4.1.1 Search and growth………………………………………………22 4.1.2 Weight and propagation………………………………………...24 4.1.3 The mathematical foundation of iteration ……………………...25
4.2 Building the ranking system ……………………………………………….27 4.2.1 Constructing the root set ………………………………………..27 4.2.2 Extending the root set into the base set………………………… 29 4.2.3 Analyzing hyperlinks……………………………………………30 4.2.4 Calculate the rank score…………………………………………31
4.3 Examples of convergence ………………………………………………….34 5 RESULTS AND DISCUSSION…………………………………………………….37 5.1 Quality comparison …………………………………………………………37
5.1.1 Comparison to Alta Vista ……………………………………….38
- 5 -
5.1.2 Comparison to Excite …………………………………………....43 5.2 Analysis of the results………………………………………………………..47
5.2.1 Removal of poor quality Web pages……………………………….47 5.2.2 Removal of non-relevant Web pages………………………………50 5.2.3 Relationship between hubs and authorities ………………………..50
6 CONCLUSIONS AND FUTURE WORK …………………………………………...59 REFERENCES…………………………………………………………………………..61
- 6 -
LIST OF FIGURES Figure 2.1 Directions of hypertext links [2]……………………………………………...7
Figure 2.2 Multiple links starting from the same page [3]……………………………….8
Figure 2.3 A single link of arbitrary depth [3]…..………………………………………..8
Figure 2.4 A densely linked set of hubs and authorities[7].…………………………….13
Figure 3.1 The system structure…………………………………………………………15
Figure 3.2 The execution path for retrieving database information……………………..16
Figure 3.3 User interface of the ranking system………………………………………...17
Figure 3.4 The web interface for listing rank results……………………………………19
Figure 4.1 Expanding the root set into a base set………………………………………..23
Figure 4.2 The basic operations for calculating authority and hub weights……………..26
Figure 5.1 Quality comparison………………………………………………………….42
Figure 5.2 Good and excellent comparison……………………………………………..42
Figure 5.3 Quality comparison………………………………………………………….46
Figure 5.4 Good and excellent comparison……………………………………………..46
Figure 5.5 Web page contents of www.iteachnet.com/wwwboard/wwwboard.html …...48
Figure 5.6 Web page contents of http://w3.one.net/~ballet/………………………………..52
Figure 5.7 Web page contents of http://www.dancer.com/dance-links/ballet.htm……...55
Figure 5.8 The mutually reinforcing relationship between hubs and authorities……….58
- 7 -
LIST OF TABLES
Table 4.1. (a) Authority weights of returned URLs from Excite………………………...35
Table 4.1. (b) Hub weights of returned URLs from Excite……………………………...35
Table 4.2. (a) Authority weights of returned URLs from Alta Vista…………………….36
Table 4.2. (b) Hub weights of returned URLs from Alta Vista………………………….36
Table 5.1 The top 10 URLs given by Alta Vista with query string “Computer job”……39
Table 5.2 The top 10 URLs from the ranking system with
query string “Computer job”…………………………………………………..39
Table 5.3 The top 10 URLs given by Alta Vista with query string “Ballet”…………….40
Table 5.4 The top 10 URLs from the ranking system with query string “Ballet”…….…40
Table 5.5 The top 10 URLs given by Alta Vista with query string “Camera”………….41
Table 5.6 The top 10 URLs from the ranking system with query string “Camera”……..41
Table 5.7 The top 10 URLs given by Excite with query string “Internet”………………44
Table 5.8 The top 10 URLs from the ranking system with query string “Internet”….…44
Table 5.9 The top 10 URLs given by Excite with query string “Java”………………….45
Table 5.10 The top 10 URLs from the ranking system with query string “Java”……….45
Table 5.11 The top 10 URLs searched by Alta Vista and their
authority weights for the query string “Computer job”……………..……….51
Table 5.12 The top 10 URLs from the ranking system and their old ranks from
Alta Vista for the query string “Computer job”.….……………………………..51
- 8 -
Table 5.13 The top 10 URLs searched by Alta Vista with query “ballet”………………53
Table 5.14 The top 10 authorities ranked by the ranking system with
query “ballet”………………………………………………………………..53
Table 5.15 The top 5 authorities and top 5 hubs and their ranks
and weights for the query “ballet”……………………………….………….54
- 9 -
Chapter 1 Introduction
1.1 The Significance of Web and Web Search Engines
The World Wide Web is growing rapidly, and with about 1 million pages being
added daily, the amount of information on the Web has changed the way people think
about and seek information. As an important medium, the Web provides a tremendous
amount of information related to a wide range of topics and the number of both
experienced and inexperienced users is also increasing at a phenomenal rate. When
searching the Web, users are usually looking for very specific information on their
particular topic. However, their searches cover a huge range of topics, and this
combination of minute detail and diverse subject range creates a tremendous challenge
for the development of information retrieval techniques. There must be a way to locate
information relevant to the user’s particular interests from within the reservoir of Web
resources which is available.
Search Engines, provides users with an efficient means of searching for valuable
information on the Web. There are many search engines which support information
retrieval on the Web, such as AltaVista, Excite, Infoseek, Lycos, etc. A search engine
usually collects Web pages on the Internet through a spider also known as crawler or
robot software, all of which will be scanned and indexed based on the full text of the
documents. In a typical search procedure, the user submits a query, which is simply a
- 10 -
word or combination of words as keywords. The search engine will examine its the
backend database for any document found in the index which matches the query, and
then return a list of related Web pages. In this way, a Web user can quickly obtain a set
of all the Web pages in the search engine’s database containing the given keywords.
1.2 Disadvantages of the Current Search Engines
Traditional text-based Web search engines, which rely on keyword matching, visit
www sites, fetch pages and analyze text information to build indices. With the explosive
growth in the amount of Internet information, the number of documents in the indices has
been increasing by many orders of magnitude. In particular, the results returned for a
query may contain several thousand, or even million, relevant Web pages. For example,
if the search engine Excite is given the keyword “ internet”, over 25 million Web pages
will be found. Typically, a user will be willing to look at only a few of these pages,
usually the first ten results. One of the problems of text-based search engines is that many
Web pages among the returned results are low quality matches. It is also common
practice for some advertisers to attempt to gain people’s attention by taking measures
meant to mislead automated search engines. This can include the additional of spurious
keywords to trick a search service into listing a page as rating highly in a popular subject.
How to select the highest quality Web pages for placement at the top of the return list is
the main concern of search engine design.
Another problem for those designing search engines is that most users are not
experts in information retrieval. The Web user asking the question may not have enough
experience to format their query correctly. It is not always intuitively easy to formulate
- 11 -
queries which can narrow the search to the precise area. Furthermore, regular users
generally do not understand how the search mechanisms. As mentioned in [1], the
document indices constructed by search engines are designed to be general and applicable
to all. If a user tries to narrow his or her search by including all senses as a key search
term, it often results in irrelevant information being presented. On the other hand, if a
user is skilled enough to formulate an appropriate query, most search engine will retrieve
pages with adequate recall (the percent of relevant of the relevant pages retrieved among
all possible relevant pages), but with poor precision (the ratio of relevant pages to the
total number of pages retrieved).
These disadvantages indicate that the performance of current search engines is
far from satisfactory. How to improve the quality of Web search results is
a subject and is being widely studied.
1.3 A Proposed System for Improving Web Searches
The purpose of this project is to develop a ranking system that can improve the
behavior of text based search engines by implementing the HITS algorithm presented by
the IBM Almaden Research Center [6,7]. The algorithm takes advantage of its ability to
mine the Web’s link structure. It analyzes hyperlinks to uncover two types of pages:
• authorities, which provide the best source of information on a given topic;
• hubs, which provide collections of links to authorities.
The mutually reinforcing relationship between hubs and authorities---- a good
authority is a page pointed to by many good hubs, while a good hub is a page that points
to many good authorities---- serves as the central theme in the exploration of link-based
- 12 -
methods for Web information search.
The proposed system was implemented as a post-processor that can work on top
of current search engines such as Alta Vista and Excite. It was developed to address both
these parameters---- relevance and quality. It focuses on the use of links for analyzing the
collection of pages relevant to a broad search topic, and for discovering the most
“authoritative” pages on each topic. The algorithm computes lists of hubs and authorities
for Web search topics. Beginning with a search topic, the rank model has two main steps:
• a search-and-growth phase, which constructs a collection of Web pages
with respect to a search topic by producing a set of relevant pages rich in
candidate authorities.
• a weight-and-propagation phase, which numerical estimates the weights of
hubs and authorities by an iterative procedure.
The rearranged search results are returned as authorities for the search topic and
the higher the page authority is weight, the higher its position in the list.
- 13 -
Chapter 2 Literature Review
Many studies have been done which aimed to improve the behavior of current
search engines. The problem with current approaches is that they almost invariably
evaluate a Web page in terms of its text information alone. They fail to take into account
the Web structure, in particular the hyperlinks. The Internet structure of the hyperlink
environment can be a rich source of information about the content of the environment.
Analyzing the hyperlink structure of Web pages gives a way to improve the behavior of
text-based search engines providing an effective method that can locate not only a set of
relevant pages, but also relevant pages of the highest quality. In this chapter, we will
present a short overview of the existing data mining methods.
2.1 Mining The Web Hyperlink Structure
One of the main advantages of Web is its ability to redirect the information flow
via hyperlinks. In order to evaluate the informative content of a Web page, the Web
structure has to be carefully analyzed. Hyperlink analysis, which is capable of measuring
the potential information contained in a Web page with respect to the Web space, has
gained more and more attention recently.
The links to and from Web pages are an important resource that has largely gone
unused in existing search engines. Web pages differ from general text in that they posses
- 14 -
external and internal structure. The Web types and links between documents can be
useful information in finding pages for a given set of topics. Making use of the Web link
information will allow the construction of more powerful tools for answering user
queries.
2.1.1 Web Structures
The types of links a Web site may contain have been fully studied in [2]. As Fig.
2.1 shows, hypertext links within a Web site can be upward in the file hierarchy,
downward, or crosswise. The links pointing to other sites are referred to as outward links
and can help identify the type of a Web page. For example, a page which contains many
outward links typically is a topic index Web page, while a page which also contains many
links but most of them downward is a institution homepage. In types of sites, such as
Yahoo, most of the links are downward links to subcategories or outward links.
Furthermore, we can infer other information about a page from the number of links to it
and from it. For example, we might guess a page to be popular if it has more links toward
it than from it. Note that pages have both topics (such as software engineer) and types
(such as homepage, index, or Yahoo page).
2.1.2 A Novel Method Using Hyperlink Information
A novel method has been presented which aims to increase the precision of Web
search results by extracting hyperlink information from a Web object [3]. This method
looks at a Web page as an object. This Web object is not composed only of its static
textual information, but also the hyper information, which is the dynamic information
- 15 -
Figure 2.1 Directions of hypertext links. Links on the same server can be upward
or downward in the file hierarchy or crosswise. Links to other servers are
considered outward [2].
www.yahoo.com
www.yahoo.com/Science
www.yahoo.com/Science/Computer_Science
www.ansa.co.uk
www.yahoo.com/Health/Medicine
upward
downward
crosswiseoutward
- 16 -
Figure 2.2 Multiple links starting from the same page [3].
Figure 2.3 A single link of arbitrary depth [3].
AA B 1B1 B k-1
B k-1 B kB k
AA
B 0B 0 B 1
B 1 B n - 1B n - 1 B n
B n
- 17 -
content provided by hyperlinks. Thus, the overall information describing a Web object
includes both hyper information and textual information, i.e.
INFORMATION=TEXTINFO+HYPERINFO, where the value of INFORMATION will
determine the position of a Web page with respect to a certain query. This mode analyzes
the web structure based on the multiple links in the same page, shown in Fig. 2.2, where
A is the start page, B1, … Bn are the pages pointed to by start page A. The arbitrary depth
k of a single link is shown in Fig. 2.3, where A is the start page and Bk is the page
pointed to by Bk-1.
For a single link such as that shown in Fig. 2.3, the hyper information for start
page A can be obtained by calculating the contribution of Web object (B) in depth k, the
value of whose textual information is diminished via a fading factor depending on its
depth. Thus, the contribution to the hyper information of page A by an object B at depth
k is Fk.TEXTINFO(B), where F is a suitable fading factor (0<F<1). By fixing a certain
depth, the overall information of a given Web object A will be
INFORMATION(A)=TEXTINFO(A) + HYPERINFO(A)
=TEXTINFO(A) + F.(TEXTINFO(B1) + F.(TEXTINFO(B2) +
F.(TEXTINFO(B3)+…+TEXTINFO(Bk))))=TEXTINFO(A)+F.TEXTINFO(B1) +
F2.TEXTINFO(B2) + …+ Fk.TEXINFO(Bk).
In general, a Web object has multiple links in the same page (see Fig.2.2). The
user cannot follow all the links at the same time, but must sequentially select them. The
rank model assumes that the user would select the highest informative link first and the
lowest informative link last. Then for a given Web object A, the hyper information
contributed by all links at depth k can be summed as
- 18 -
F.TEXTINFO(B1) +…+Fn.TEXTINFO(Bn).
Compared to a random selection of the links, this “sequence of selections” is the
best sequence that maximizes the hyper information. This model can work on top of any
textual information function and has been implemented on the client side as s post-
processor for the main search engines.
The above model developed an algorithm based on both link structure and textual
information. However, a new model that dependsheavily on hyperlink structure
has been developed in order to improve text-based search engines and obtaining more
precise search results.
2.1.3 A New Search Engine Using the PageRank Algorithm
Google, a prototype with a full text and hyperlink database, is designed to crawl
and index the Web efficiently and return much more satisfying search results than
existing systems [4]. It makes use of the link structure of the Web to calculate a quality
ranking for each Web. The rank algorithm used by Google is PageRank [5]. PageRank
extends the idea that the importance as quality of an academic publication can be
evaluated by its citations to pages on the Web, which can be similarly be evaluated by
counting back links.
In particular, the creation of a hyper link by the author of a Web page represents
an implicit endorsement of the page being pointed to; by mining the collective judgment
contained in the set of such endorsements, people can gain a richer understanding of the
relevance and quality of the Web’s contents. Thus, by counting links from all pages
equally, and by normalizing the number of links on a page, the citation importance of a
- 19 -
Web page that corresponds well with people’s subjective idea of importance can be
objectilye measured. The PageRank value of a page A, PR(A), is given as follows:
PR(A) = (1-d) + d ( PR(T1)/C(T1) + …+ PR(Tn)/C(Tn) )
Where T1…T2 are pages pointing to page A, the parameter d is a damping factor which is
scaled between 0 and 1, C(A) is the number of links going out of page A. The PageRank
or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the
principal eigenvector of the normalized link matrix of the Web. The PageRank metric,
PR(A), recursively defines the importance of a page A to be the weighted sum of the
back-links to it.
2.2 Web Page Ranking Algorithm
Ranking algorithms, when applied to the large number of results returned by the
search engine, can then help users to select those of most valuable to them from the sea
of Web resources. In practice, given a Web page p and a user’s query q, the ranking
algorithm will computer a score rank (p, q). The higher ranked (p, q) is, the more
valuable a Web page p is likely to be for the query q. Various methods have been
applied to develop rank algorithms. The prototype of Google described previously is an
example of the use of a ranking algorithm which is based on hyperlink analysis. Another
ranking model, the Clever system [6,7] was designed to improve the performance of
current search engines by using the Hyperlink Induced Topic Search algorithm. It can
work with any existing text-based search engine and rearrange the returned results by
applying its ranking algorithm. It classifies all the relevant pages returned for a given
query into two different categories: authority pages that contain rich information and hub
- 20 -
pages that collect all the authority pages together.
The advantage of the Clever system is that it considers not only the in-degree but
also the out-degree of a Web site. An example is depicted in Fig. 2.4. The nodes
correspond to the Web pages, and a directed edge indicates the presence of a link
between two pages. The hub pages actually glue together authorities on a common topic.
This is a great improvement, avoiding the problem of unrelated pages of large in-degree
obtaining high rank scores.
In this chapter, some of work that has been done to improve the behavior of text-
based search engines was presented. In the next chapter, the project will be introduced.
- 21 -
Figure 2.4 A densely linked set of hubs and authorities[7].
hubs authorities unrelated page oflarge in-degree
- 22 -
Chapter 3 System Structure
The project consists of two parts--- a ranking system consisting of a spider and a
computational kernel, and a local database system for saving and retrieving the ranking
results. Fig.3.1 shows the architecture of the project. The ranking system illustrated in the
rectangular dashed block searches URLs and computes the score for ranking pages. It
saves the URLs as well as their rank scores into a backend database. The details for
accessing database information are shown in Fig.3.2. The on-line database system
enables a user to retrieve the information saved by the ranking system, listing the URLs
according to their rank scores.
3.1 System Interface
3.1.1 Interface of the Ranking System.
The interface for the ranking system was written in Java applet 1.1.7. Fig. 3.3
shows the user interface. When a user input query strings into the keyword field of the
interface, the rank system first sends them into a text-based search engine. For this
project, the search engine used was Alta Vista or Excite. The user can select either search
engine from the interface directly. The number of returned URLs is controlled by
entering the desired number into the “search limit” field. All the URLs searched by the
selected search engine make up the root set. The root set is expanded into the base set by
adding newfound URLs referenced by any one of the root set. The details concerning the
Construction of the root set and base set will be described in Section 4.2. The interface
- 23 -
Figure 3.1 The system structure.
Alta vista Excite
query
URLs Spider software
Search andgrowth
Search andgrowth
Weight andpropagationWeight andpropagation
Save rank scores into database
- 24 -
Figure 3.2 The execution path for retrieving database information.
Web Browser Query
Web Browser Query
Web Browserreport
Web Browserreport
DatabaseDatabase
CGI/PerlCGI/Perl
JDBCJDBC
- 25 -
Figure 3.3 User interface of the ranking system.
- 26 -
provides two windows in which to list the root set URLs and the base set URLs. Users
can watch the growth of the root set and base set while the spider program is running in
the background. There is a label between these two windows which indicates what stage
the process is at through after the rank system starts working.
3.1.2 Web Interface For Listing Rank Results
Fig. 3.4 illustrates the Web interface for accessing the local database information.
The Web interface was written in HTML. For listing the rank results, it requires a user to
first choose a topic and decide on the number of URLs to be included in the returned list,
then select the search engine associated with the rank results, and finally select the rank
method. The database system can then return a list of URLs based on the selected search
engine, relevant to the chosen topic, which have been ranked according to the specified
ranking method.
3.2 The Spider
The implementation of the rank system includes the use of spider software. The
spider, which may also be called a crawler or a robot, is a software program which can
automatically traverse the Web and download the network resource referred to by a URL.
The working mechanism of a spider is sample. Spiders start by parsing a specified web
page, noting any hypertext links on that page that point to other web pages. They then
parse those pages for new links recursively. Spider software does not actually move
around to different computers on the Internet, as viruses or intelligent agents do, but
resides on a single machine and sends HTTP requests for documents to other machines
- 27 -
Figure 3.4 The web interface for listing rank results.
- 28 -
on the Internet, just as a web browser does when the user clicks on links. All the spiders
really do is to automate the process of following links. Following links is not itself of any
great use, but the list of linked pages almost always serves some subsequent purpose.
The most common use is to build an index for a web search engine, although
spiders are also used for other purposes. Spider may also be used to:
• Test web pages and links for valid syntax and structure.
• Monitor sites to see when their structure or contents change.
• Search for copyright infringements.
• Build a special-purpose index—for example, one that has some understanding of
the content stored in multimedia files on the Web.
In this project, the spider executed the task of building the root set and extending
the root set into the base set. It downloaded the contents of a URL and picked out all the
URLs referenced by the Web page.
The spider program was written in Java 1.1.7. It used the URL class and its
method openStream() to download the contents of a specified URL. The spider
identified URL links of a specified Web page by parsing the downloaded information.
Several ways can be used to discover the URLs of a HTML file. In my program, the
spider collected all the URLs for building the root set and the base set by picking out the
string following HTML tag “<a href=”. The number of URLs in the root set was also
controlled by the spider software.
- 29 -
3.3 Database
Oracle 8 was used as the backend database. The database was designed to be
simple and illustrative. One table named PAGE was created to store each page’s
information. Each record has 6 fields: keyword, URLs, search engine, old rank, authority
rank, and hub rank. The first three fields were specified to be not null.
The database access follows the execution path shown in Fig. 3.2. The user input
from the Web page is extracted and passed to a Java program as arguments by a Perl
program. The Perl program does not talk to the database directly in this project. It calls
the Java program, which can access the database and operate on it with the arguments it
receives. JDBC is JavaSoft’s database connectivity specification. It creates a
programming-level interface for communicating with databases in a uniform manner. The
Java program talks to the database using SQL statements and prints out reports for users
as HTML files through the Perl program. Finally, users can retrieve the database
information through a Web browser.
- 30 -
Chapter 4 A Web Page Ranking Algorithm Using Hyperlink Information
4.1 The Algorithm
There are two phases in the development of this ranking system. The first is the
search and growth phase. The second is the weight and propagation phase, in which the
results returned by the first stage are evaluated.
4.1.1 Search And Growth
For analyzing the hyperlink information of available WWW pages, the ranking
system first constructs a collection of Web pages about a query string !. Since the search
results may contain millions of pages, the number of Web pages in the collection must be
limited to a reasonable quantity so that the system can reach a compromise between
obtaining a collection of pages highly relevant and saving computational effort. For
constructing such a collection of pages, the ranking system makes use of the results given
by a text-based search engine. The search engine will return a set of documents which
are determined by its own scoring function as a root set R!. It then extends the root set
R! by adding any additional document that is pointed to by a document already in the root
set. This is shown in Fig.4.1. The new collection is then renamed the base set and
denoted by S!. In this way, the link structure analysis can be restricted to a sub set S!,
which has the properties:
(1) S! is relatively small.
- 31 -
Figure 4.1 Expanding the root set into a base set.
root
base S!R!
- 32 -
(2) S! is rich in relevant pages.
(3) S! contains most (or many) of the strongest authorities.
Next, the ranking system calculates the rank score of each page based on the link
structure between any node pairs in the base set S!, and extracts good authorities and
hubs from the overall collection of pages.
4.1.2 Weight and Propagation
In this phase, the basic principle introduced in chapter 1 which assumes a good
authority page is pointed to by many good hub pages and a good hub page points to many
good authority pages, is converted into a method for finding good hubs and authorities.
When applying this method, each page p is assigned a non-negative authority
weight x(p) and a non-negative hub weight y(p). The relationship between hubs and
authorities is expected via an iterative computation that maintains and updates the
numerical weights for each page. As the results are evaluated, a good authority receives a
high score for x and a good hub receives a high score for y.
The iteration will lead to a fast growth of the actual magnitudes of x(p) and y(p) . In
order to keep their values bounded, normalization of the instant weights of x and y is
applied in the algorithm. In this project, all x and y values were set to a uniform constant
initially; and the weights of each type were normalized as follows:
"p# S! (x(p))2 = 1 , (1)
"p# S! (y(p))2 = 1 . (2)
Thus, we maintain the sum of their squares at 1. Since only the relative invariant values
are concerned in our manipulation, the final results are essentially unaffected by the
- 33 -
initialization of all weights.
An alternative method for expressing the mutually reinforcing relationship
between hubs and authorities is: if a page p points to many pages with large x-value, it
should receive a large y-value; and if a page p is pointed to by many pages with large y-
values, then it should receive a large x-value. Thus, it is reasonable to update x(p) for a
page p to be the sum of y(q) over all pages q that link to p:
x(p) = " y(q) , q such that q $ p, (3)
where the notation q -> p indicates that q links to p.
Similarly, we can update the hub weight via
y(p) = " x(q) , q such that p $ q. (4)
Fig. 4.2 shows these two operations (3) and (4), the basic methods by which hubs and
authorities reinforce one another in an alternating iteration.
Each iteration consists of two steps:
(1) replace each x(p) by the sum of the y(q) values of pages pointing to p;
(2) replace each y(p) by the sum of the x(q) values of pages pointed to by p.
In the algorithm, the iteration will not stop until a fixed point is reached, i.e., both
the authority and hub weights converge to fixed values.
4.1.3 The Mathematical Foundation of Iteration
The mathematical foundation of the iterative method follows from the theory of
eigenvectors in [8]. To explain it simply, let us define an adjacency n x n matrix A,
whose (i, j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. In our case,
we can treat the set of all authority weights x as a vector x=(x1, x2, …, xn),
- 34 -
Figure 4.2 The basic operations for calculating authority and hub weights.
page p
q1
q2
q3
page p
q1
q2
q3
x[p]:=sum of y[q],for all q pointing to p
y[p]:=sum of x[q],for all q pointing to by p
- 35 -
and in the same way define the set of all hub weights y as a vector y=(y1,y2,…,yn). Then
the update rule for x can be written as x ATy and the update rule for y can be written as
y Ax. Going further, we can write
x ATy ATAx = (ATA)x
and
y Ax = AATy = (AAT)y.
Thus, the vector x(y) after multiple iterations is precisely the result of applying the power
iteration technique to ATA . Linear algebra tells us that this sequence of iterates, when
normalized, converges to the principle eigenvector of ATA. Similarly, the sequence of
values for the normalized vector y converges to the principal eigenvector of AAT. The
relationship between eigenvectors and power iteration are given detail in [9].
4.2 Building the Ranking System
4.2.1 Constructing the Root Set
Since the ranking system works on top of the other text-based search engines, all
the URLs in the root set are actually the search results returned by whichever of the
existing search engines such as Alta Vista or Excite have been chosen by the users. For
those Web search engines, once the query term is submitted via their interface, a
formatted query statement is constructed and sent to their database through CGI-BIN.
Usually, each search engine has its own format for the URL of the returned page. Thus, it
is necessary to construct different URL formats for each individual search engine in order
to get a list of the Web pages associated with the query term. However, the URL itself
follows a fixed format for each search engine, which makes automatic searches possible.
- 36 -
If Alta Vista is chosen, the URLs of the returning page are:
• For listing the top 10 matched pages
http://www.altavista.com/cgi-bin/query?pg=q&what=web&fmt=.&q=keyword
• For listing the next 10 matched pages
http://www.altavista.com/cgi-bin/query?
pg=q&stype=stext&Translate=on&sc=on&q="+keyword+"&stq="+10
• For listing the next 20-30 matched pages
http://www.altavista.com/cgi-bin/query?pg=q
&stype=stext&Translate=on&sc=on&q="+keyword+"&stq="+20
The embedded “keyword” in these URLs can be any query strings. If the query
term is made of more than one word, for example, “computer book”, the format of
“keyword” will be “computer+book”.
This method also works with Excite. If Excite is chosen,
• The URL of the returning page that displays the top 10 matched pages is
http://search.excite.com/search.gw?search=keyword
• The URL for listing the next 10 matched URLs is
http://search.excite.com/search.gw?c=web&s="+keyword+"&showSummary=true&start="+
10+"&perPage=10&next=Next+Results
In this way, the spider software can walk through the Web pages returned by
a chosen search engine, extract the URLs following the string “<a href=” and build a
collection of URLs which becomes the root set. In this project, the size of root set is
- 37 -
limited to 200. Notice that the URL of Excite for listing matched pages also lists other
non-relevant URLs. The spider program needs more control to avoid invoking the URLs
of advertisements and other inappropriate URLs which may be returned.
4.2.2 Extending the Root Set Into the Base Set
The base set is an expansion of the root set obtained by crawling Web pages
returned by a text based search engine. The spider program visits each page in the root set
to extract all the hyperlinks it points to. The working mechanics of finding new URLs
referenced by any page in the root set is similar to collecting the URLs from the Web
pages returned by the chosen search engine.
The logic procedure used to build the base set is as follows:
Find_ S! (!, E, n)
S!: the base set.
!: a query string.
E: a specified text-based search engine.
n: natural number (the size of root set).
Input ! as keyword into search engine E.
Let R! denote the root set.
R! := the top n results (the highest-ranked pages) returned by E.
Set S! := R!
Walk through each Web page within R! by applying a spider program.
- 38 -
For each page p #R!
Let % (p) denote the set of all pages p points to.
Add all pages in (p) to S!.
End
Return S!.
4.2.3 Analyzing hyperlinks
The ranking system will rearrange the URLs in the root set by using the hyperlink
information for all the Web pages in the base set. The process of analyzing hyperlinks is
done twice. The first occurs while the spider performs the task of building the base set.
Later, the spider will crawl all the newly found Web pages in the base set in order to
perform the hyperlink analysis used for evaluating the rank score.
A Web page can link to many other pages, which may in turn reference the Web
page. When the spider crawls each Web page in the root set, it not only executes the task
of extracting URLs in the Web page it is visiting but also registers the newfound URLs as
outward links for that Web page. After the base set has been constructed, the spider walks
through all the newfound Web pages, extracting the URLs of each visiting Web page
again and comparing them with all existing URLs. If the outward link points to a URL
which is in the base set, the URL in the base set is registered as its outward link and the
Web page itself is registered as an inward link of the URL in the base set. In this case, the
spider only cares about URLs that already exist in the base set. After walking through all
the URLs in the base set, the hyperlinks among Web pages are recorded and saved for
use in the next step.
- 39 -
4.2.4 Calculate the Rank Score
First, the type of link structure between any node pairs must be identified. The
study of the type of links has been done in [2]. Hypertext links within a Web site can be
upward in the file hierarchy, downward, or crosswise. Only the links which point to other
sites are referred to as outward links. In this project, all the links can be distinguished
between transverse and intrinsic links:
• a link is transverse if it is between pages with different domain name;
• a link is intrinsic if it is between pages with the same domain name.
The “domain name” here means the first level in the URL string associated with a
page. Thus, the upward, downward, and crosswise links are in the same category----
intrinsic links. Since intrinsic links very often exist purely to allow for navigation of the
infrastructure of a site, they contain much less information than outward links, which the
transverse links convey information on the authority of the pages they point to. When we
count the in-degree and out-degree numbers of a Web page, all intrinsic links should not
be taken into account. Only the edges corresponding to transverse links are kept in S!;
this results in a new graph G!.
There are still other issues that required attention. A phenomenon can be observed
in which a large number of pages from a single domain all point to a single page p. In
many cases, this corresponds to a mass endorsement, advertisement, or some other type
of “collusion” among the referring pages. These links do not seem intuitively to confer
authority and should not be contained in the new graph G!. In order to avoid the
problem, we can eliminate this phenomenon by fixing a parameter m which is typically
- 40 -
between 4 and 8, and only allow up to m pages from a single domain to point to any
given page p. This control was employed in this project and was shown to be an effective
solution in most case.
The steps performed so far produce a small graph G! that has many relevant
pages, and strong authorities. These authorities both belong to G! and are heavily
referenced by pages within G!. It is now possible to extract the authorities which are
most likely to answer the query from the overall collection
of pages G! .
In order to find the hubs and authorities in the Web page collection G! with
respect to the query string !, the set of weights {x(p) } and { y(p) } are represented as
vector x and vector y separately, and the following procedure is used:
Iterate (G!,k)
G!: a collection of n linked pages
K: a nature number (the iteration number)
Let z denote the vector (1,1,1,…,1) # Rn
Set x0 := z
Set y0 :=z
For i = 1,2,…,k
Apply the x(p) = " y(q) operation to (x i-1, y i-1), obtaining new x-weights x’i
Apply the y(p) = " x(q) operation to (x’i, y i-1), obtaining new y-weights y’i
Normalize x’i, obtaining xi
- 41 -
Normalize y’i, obtaining yi
End
Return (xk, yk).
This procedure can be used to filter out the top c authorities and top c hubs in the
following simple way (c is the number of URLs in the return list):
Filter (G!, k, c)
G!: A collection of n linked pages
k,c: natural numbers
(xk, yk) := Iterate (G!,k)
Report the pages with the c largest coordinates in xk as authorities
Report the pages with the c largest coordinates in yk as hubs
Typically, the value of c is between 5 and 10. Testing the number of iterations k,
by applying Iterate( ) with arbitrarily large values, we found the convergence of Iterate( )
to be quite rapid; k=20 is sufficient for both the authority weight and hub weight to
converge to fixed numbers.
4.3 Examples of Convergence
Tables 4.1a, 4.1b, 4.2a, and 4.2b indicate the relationships between the weight
values and the iteration numbers. From the data, we can see that both the authority
weights and hub weights converged quickly, and 20 iterations were sufficient to obtain
- 42 -
values very close to the final value. Tables 4.1a, and 4.1b list the top 10 URLs returned
by the search engine Excite. The data in Table 4.1a give the authority weights. Table 4.1b
lists the hub weights. The URLs were the search results for the query “Java”. Table 4.2a,
and 4.2b indicate the change of authority weight and hub weight values for the top 10
URLs returned by the search engine Alta Vista. The data in Table 4.2a are the authority
weights and Table 4.2b lists the hub weights. The URLs were the search results for the
query “camera”.
- 43 -
Table 4.1. (a) Authority weights of returned URLs from Excite.
.
URL Iteration 1 Iteration 20 Iteration 200 http://java.sun.com/ 52 1.13 1.13 http://www.zdnet.com/devhead/filters/java/ 1 4.30E-55 0 http://metalab.unc.edu/javafaq/ 9 0.45 0.45 http://javaboutique.internet.com/ 30 3.04 3.04 http://www.javareport.com/ 37 3.35 3.35 http://java.wiwi.uni-frankfurt.de/ 9 0.072 0.072 http://www.javalobby.org/ 18 0.078 0.078 http://caffeinebuzz.com/ 2 0.098 0.098 http://freewarejava.com/ 5 0.36 0.36 http://www.gamelan.com/ 106 6.35 6.35
Table 4.1. (b) Hub weights of returned URLs from Excite.
URL Iteration 1 Iteration 20 Iteration 200 http://java.sun.com/ 25 3.13 3.13 http://www.zdnet.com/devhead/filters/java/ 23 6.52E-26 0 http://metalab.unc.edu/javafaq/ 53 2.18 2.18 http://javaboutique.internet.com/ 17 1.17E-27 0 http://www.javareport.com/ 18 2.06 2.06 http://java.wiwi.uni-frankfurt.de/ 27 4.17 4.17 http://www.javalobby.org/ 89 25.72 25.72 http://caffeinebuzz.com/ 13 7.11 7.11 http://freewarejava.com/ 15 3.08 3.08 http://www.gamelan.com/ 23 1.04 1.04
- 44 -
Table 4.2. (a) Authority weights of returned URLs from Alta Vista.
URL Iteration 1 Iteration 20 Iteration 200 http://testo.camera.it/ 1 8.00E-37 0 http://www.bouldernews.com/ 2 5.35E-25 0 http://www.dcn.com/ 2 5.35E-25 0 http://www2.famille.ne.jp/~bud1_bis/camera/index_e.html 7 3.21E-24 0 http://www.fcw.com/civic/articles/com-camera.asp 1 8.06E-37 0 http://www.dcresource.com/ 1 8.06E-37 0 http://www.camera-net.com/ 1 8.06E-37 0 http://www.pie.camcom.it/camera-arbitrale/index.html 4 0.26 0.26 http://www.camera.it/index.asp 1 8.06E-37 0 http://www.edromney.com/ 1 8.06E-37 0
Table 4.2. (b) Hub weights of returned URLs from Alta Vista.
URL Iteration 1 Iteration 20 Iteration 200 http://testo.camera.it/ 5 2.66E-18 0 http://www.bouldernews.com/ 20 1.45 1.45 http://www.dcn.com/ 1 4.04E-25 0 http://www2.famille.ne.jp/~bud1_bis/camera/index_e.html 8 1.47E-15 0 http://www.fcw.com/civic/articles/com-camera.asp 6 2.80E-17 0 http://www.dcresource.com/ 14 7.24E-12 0 http://www.camera-net.com/ 1 4.04E-25 0 http://www.pie.camcom.it/camera-arbitrale/index.html 5 1.11 1.11 http://www.camera.it/index.asp 1 4.04E-25 0 http://www.edromney.com/ 2 1.21E-17 0
- 45 -
Chapter 5 Results and Discussion
This chapter gives the experimental results of the new ranking system, and
compares its performance to that of the existing search engines Alta Vista and Excite.
Several query strings were input to the ranking system through its interface. For each
topic, the top 10 pages from Alta Vista/Excite were compared to the top 10 authorities
returned by the ranking system. The comparisons are divided into two sections --- the
first section compares the results according to a quality score for the Web page. The
second section selects some Web pages as examples and analyzes their Web page
hyperlink structures, and discuses how the hyperlink information can help improving the
search results.
5.1 Quality Comparison
To give a quantitative measure, a numerical value was used to evaluate the quality
of each Web page. The quality score was assigned in terms of the page utility in
providing information about the topic covered by the query. Two attributes were used to
determined the quality scores in the project:
(1) Relevancy--- If the content of a Web page has nothing to do with the query string,
the URL’s quality score is 0. If it has some relationship with the query strings, the
quality score will be 1, 2, 3, which represent poor, good, and excellent
- 46 -
relationships, respectively.
(2) Dead link--- A URL that cannot be opened. If a URL is a dead link, its quality
receives a negative number –1.
Hence, the quality of an individual Web page is represented as a numerical value scoring
from –1 to 3. The quality score of all the Web pages were obtained by reading the
contents of each. It is obviously to see that the higher the quality scores the high quality
of a Web page. Although the nature of quality judgment by human may not be well
defined, as the system will be used by human searching the Web this is a reasonable
method of assessing the results.
5.1.1 Comparison to Alta Vista
Tables 5.1, 5.3, 5.5 list the quality scores of the top 10 URLs returned by Alta
Vista for the keywords “computer job”, “Ballet”, and “Camera”. And Tables 5.2, 5.4, 5.6
lists the quality scores of the new top 10 URLs after applying the ranking system.
In order to evaluate the value of the authority pages returned by the ranking
system quantitatively, we summed the quality values of all the URLs in each table and
computed the percentage of good/excellent Web pages in the top 10 URLs.
Figure 5.1 shows the sum of quality scores in bar graphs, with the x axis listing
the query strings and the y axis representing the sum of quality scores.
Figure 5.2 shows the percentage of good/excellent pages in bar graph, with the x
axis listing the query strings and the y axis representing the percentage of good/excellent
Web pages. The data listed in the tables and bar graphs show the disadvantages of the
current text-based search engines and the improvement in the performance after applying
- 47 -
Table 5.1 The top 10 URLs given by Alta Vista with query string “Computer job”.
Rank URL Quality Score Status 1 http://www.jobbankusa.com/www.html 2 Good 2 http://www.dalmanjobs.com/ -1 Dead link 3 http://www.jobs.com/ 2 Good 4 http://www.iteachnet.com/wwwboard/wwwboard.html 0 Non- relevant 5 http://www.chisoft.com/ 1 Poor 6 http://members.home.net/oacjlp/cfall1.htm -1 Dead link 7 http://www.siliconalleyjobs.com/ 3 Excellent 8 http://www.n-s-i.net/fact.html 2 Good 9 http://hrea.org/lists/huridocs-tech/markup/msg00250.html 1 Poor 10 http://www.churchfriends.com/board/biz/messages/7.html -1 Dead link
Total quality scores 8
Table 5.2 The top 10 URLs from the ranking system with query string “Computer job”.
Rank URL Quality Score Status 1 http://www.itcomputerjobsearch.com/ 2 Good 2 http://www.athomebusinessportal.com/ 2 Good 3 http://www.stayathomework.com/ 2 Good 4 http://www.daytraderstocktrader.com/ 1 Poor 5 http://houston.computerwork.com/ 3 Excellent 6 http://bayarea.computerwork.com/ 3 Excellent 7 http://philadelphia.computerwork.com/ 3 Excellent 8 http://twincities.computerwork.com/ 3 Excellent 9 http://www.virtualbusinesswebhost.com/ 0 Non-relevant 10 http://www.joltjobs.com/ 3 Excellent
Total quality scores 22
- 48 -
Table 5.3 The top 10 URLs given by Alta Vista with query string “Ballet”.
Rank URL Quality Score Status 1 http://www.ballet.co.uk/ 2 Good 2 http://www.ballet.org.uk/ 2 Good 3 http://www.pbt.org/ 2 Good 4 http://www.royal-ballet-school.org.uk/ 2 Good 5 http://www.nzballet.org.nz/ 2 Good 6 http://www.danvilleballet.org/ 2 Good 7 http://www.westchesterballet.org/ 2 Good 8 http://www.nycballet.com/ 2 Good 9 http://w3.one.net/~ballet/ 0 Non-relevant10 http://www.theshopatvcb.com/ 1 Poor
Total quality scores 17
Table 5.4 The top 10 URLs from the ranking system with query string “Ballet”.
Rank URL Quality Score Status 1 http://www.kirovballet.com/ 2 Good 2 http://www.joffrey.com/ 2 Good 3 http://www.het-nationale-ballet.nl/ 2 Good 4 http://www.balletwest.org/ 2 Good 5 http://www.en-ballet.co.uk/ 2 Good 6 http://www.hamburgballett.de/ 2 Good 7 http://www.coloradoballet.org/ 2 Good 8 http://www.cincinnatiballet.com/ 2 Good 9 http://www.koninklijkballetvanvlaanderen.be/ 2 Good 10 http://www.balletaustin.org/ 2 Good
Total quality scores 20
- 49 -
Table 5.5 The top 10 URLs given by Alta Vista with query string “Camera”.
Rank URL Quality Score Statues 1 http://www.camerasphere.com/ 2 Good 2 http://www.mi.camcom.it/ 0 Non-relevant 3 http://testo.camera.it/ 0 Non-relevant 4 http://www.mpegcam.net/ 1 Poor 5 http://www.bouldernews.com/ 2 Good 6 http://www.dcn.com/ 0 Non-relevant 7 http://www2.famille.ne.jp/~bud1_bis/camera/index_e.html 1 Poor 8 http://www.fcw.com/civic/articles/com-camera.asp 1 Poor 9 http://www.ckcpower.com/camerabags.htm 2 Good 10 http://www.geocities.com/~ffrog/campath.html 0 Non-relevant
Total quality scores 9
Table 5.6 The top 10 URLs from the ranking system with query string “Camera”.
Rank URL Quality Score Statues 1 http://www.buffzone.com/ 2 Good 2 http://www.cameraarts.com/ 2 Good 3 http://www.dcresource.com/ 3 Excellent 4 http://www.acecam.com/ 3 Excellent 5 http://www.samsungcamera.com/ 2 Good 6 http://www.digital-cameras.com/ 2 Good 7 http://www.cameraworld.com/ 3 Excellent 8 http://www.glazerscamera.com/welcome.htm 3 Excellent 9 http://www.wolfcamera.com/ 3 Excellent 10 http://www.pie.camcom.it/camera-arbitrale/index.html 1 Poor
Total quality scores 24
- 50 -
0
5
10
15
20
25
quality scores
Computerjob
Ballet Camera
query strings
Alta VistaNew Rank
Figure 5.1 Quality comparison.
0
0.20.40.60.8
1
percentage(%) of good &
excellent URLs
Computerjob
Ballet Camera
Query strings Alta VistaNew Rank
Figure 5.2 Good and excellent comparison.
- 51 -
the ranking system.
5.1.2 Comparison to Excite
The scoring function employed by Excite does not rely on pure “textual
information”. Excite is one of the search engines that pay the most attention to the
structure of WWW pages. Hence, Dead link and Non-relevant pages are not included in
the top 10 URLs returned by Excite, and the ranking system has only a limited effect on
the results returned by the search engine.
Tables 5.7, and 5.9 list the quality scores of the top 10 URLs returned by Excite
for the keywords “Internet”, and “Java”. Tables 5.8, and 5.10 list the quality scores of
the new top 10 URLs after applying the ranking system.
Figure 5.3 shows the sum of the quality scores using a bar graph, with x axis
listing the query strings and the y axis representing the sum of the quality scores.
Figure 5.4 shows the percentage of good/excellent pages in bar graph, with x axis
listing the query strings and the y axis representing the percentage of good/excellent Web
pages.
From the results, it is difficult to come to a definite conclusion as to which is
better than the other. The quality of the URLs given by Excite and that of the URLs
ranked by the ranking system are competitive with each other. This is consistent with the
subjective impression that Excite pays the most attention to the structure of WWW pages.
- 52 -
Table 5.7 The top 10 URLs given by Excite with query string “Internet”.
Rank URL Quality Score Status 1 http://www.leapday.demon.nl/Introduc.htm 2 Good 2 http://www.currents.net/resources/ispsearch/intquest.html 2 Good 3 http://www.viainter.net/ 1 Poor 4 http://www.linkexchange.com/ 2 Good 5 http://www.microsoft.com/ 2 Good 6 http://www.windows95.com/ 1 Poor 7 http://www.ipl.org/ 2 Good 8 http://www.ibill.com/ 1 Poor 9 http://www.mckinley.com/ 2 Good 10 http://www.thesaurus.com/ 2 Good
Total quality scores 17
Table 5.8 The top 10 URLs from the ranking system with query string “Internet”.
Rank URL Quality Score Status 1 http://www.internetnews.com/ 2 Good 2 http://www.internet.com/ 3 Excellent 3 http://dart.fine-art.com/ 1 Poor 4 http://netserf.cua.edu/ 2 Good 5 http://www.isoc.org/ 3 Excellent 6 http://www.ietf.org/ 2 Good 7 http://argos.evansville.edu/ 2 Good 8 http://www.alexa.com/ 2 Good 9 http://www.microsoft.com/ 2 Good 10 http://www.demon.net/ 3 Excellent
Total quality scores 22
- 53 -
Table 5.9 The top 10 URLs given by Excite with query string “Java”.
Rank URL Quality Score Status 1 http://java.sun.com/ 3 Excellent 2 http://www.zdnet.com/devhead/filters/java/ 3 Excellent 3 http://metalab.unc.edu/javafaq/ 3 Excellent 4 http://javaboutique.internet.com/ 3 Excellent 5 http://www.javareport.com/ 3 Excellent 6 http://java.wiwi.uni-frankfurt.de/ 1 Poor 7 http://www.javalobby.org/ 2 Good 8 http://caffeinebuzz.com/ 1 Poor 9 http://freewarejava.com/ 3 Excellent 10 http://www.gamelan.com/ 2 Good
Total quality scores 24
Table 5.10 The top 10 URLs from the ranking system with query string “Java”.
Rank URL Quality Score Status 1 http://www.javahow.to/ 3 Excellent 2 http://java.sun.com/docs/books/tutorial/ 2 Good 3 http://www.jars.com/ 3 Excellent 4 http://www.developer.com/directories/pages/dir.java.html 2 Good
5 http://www.yahoo.com/Computers_and_Internet/ Programming_Languages/Java/ 3 Excellent
6 http://www.objectspace.com/jgl/ 1 Poor 7 http://www.javaworld.com/ 3 Excellent 8 http://java.sun.com/javaone/ 2 Good 9 http://www.afu.com/javafaq.html 2 Good 10 http://java.sun.com/faqIndex.html 2 Good
Total quality scores 23
- 54 -
0
5
10
15
20
25
quality scores
Internet Java
query strings
ExciteNew Rank
Figure 5.3 Quality comparison.
00.20.40.60.8
1
percentage(%) of good &
excellent URLs
Internet Java
query stringsExciteNew Rank
Figure 5.4 Good and excellent comparison.
- 55 -
5.2 Analysis of the Results
In this section, we studied the contents of Web pages and analyzed the hyperlink
information provided by the ranking system. The research works help us to understand
the advantages of using hyperlink information. Since Alta Vista is a pure text-based
search engine, we chose to discuss the results of Alta Vista before and after applying the
proposed system.
5.2.1 Removal of Poor Quality Web Pages
Table 5.11 lists the top 10 URLs searched by AltaVista. Table 5.12 lists the top 10
URLs after applying the ranking system. All the returned Web pages were search results
with respect to the query string “computer job”. The authority weights listed in the two
tables might explain why the rank of these URLs changed a lot after applying the ranking
system. By visiting the 4th URL www.iteachnet.com/wwwboard/wwwboard.html in
Table 5.11, it is easy to see why this URL was listed in the 4th position. Fig. 5.5 gives a
view of the Web page though a Web browser. Reading this Web page carefully, one can
discover quickly the reason why this URL appeared in a higher position in Table 5.11 ---
the excessive repetition of one of the keywords “job”. Excessive repetition of one or
more keywords is a simple way that the author of a Web page can attempt to influence or
trick a search engine. Fig.5.5 shows a good example of a page that has some relationship
with the query strings, but is a poor quality results. Such pages do appear in the data sets,
and this is as it should be because the search engines’ responses to them are of interest. In
the body of Web page www.iteachnet.com/wwwboard/wwwboard.html, the word “job”
was repeated more than 100 times. This method of “cheating” always works if the search
- 56 -
Figure 5.5 Web page contents of www.iteachnet.com/wwwboard/wwwboard.html.
- 57 -
engine is text-based only. This is because a text based search engine suffers from an
intrinsic weakness: it does not take into account the Web structure the Web object is part
of. The problem with text-based search engines is that they look at a Web object and
evaluate it as though it were a piece of text.
The power of the ranking system is that it makes use of the information provided
by hyperlinks. By carefully analyzing the Web structure, the capability of a Web page to
redirect the information flow via hyperlinks can be evaluated. The approach of
calculating the authority weights and hub weights used in the ranking system enables the
search engine to obtain more precise information about a Web page. By adding the Web
structure analysis, the rank system can rank search results according to not only the text
information included in a Web page, but also the potential ability of a user to gain further
relevant information with a browser, i.e., how much information one can obtain using this
page as a starting point from which to navigate the Web and how much information one
can explore by navigating to it from other Web pages. It is just this additional
information, gained from the hyperlinks, that enables the ranking system to overcome the
big problem of “search engine persuasion” (tuning pages so to cheat a search engine, in
order to make it award the page a higher rank).
The 4th URL in Table 5.11 gained a low authority value during the ranking
procedure, and was removed from the top 10 URLs after applying the ranking system.
The improvement is obviously. Although the keyword “job” was repeated many times in
the text body of Web page www.iteachnet.com/wwwboard/wwwboard.html, this method
of cheating failed to mislead the ranking system, and the page moved to a lower position
because of its low authority weight.
- 58 -
5.2.2 Removal of Non-relevant Web Pages
Table 5.13 lists the top 10 URLs searched by Alta Vista, and Table 5.14 lists the
top 10 URLs after ranking. In this case, the user query string is “ballet”. Fig. 5.6 shows a
Web page, which is the 9th on the list in Table 5.13. By reading this page, it clearly fails
to provide any information about ballet. It is a non-relevant URL. The reason that it was
listed in a higher position is because of the title “Index of /~ballet”. Although the content
of this Web page has nothing to do with ballet and the author might have no intention to
trick a search engine, the word “ballet” embedded in its title happened to be the keyword
and resulted in a higher position in the list. However, by analyzing hyperlink information,
the ranking system successful detected this non-relevant page, which had neither an out
page nor an in page, and placed it in a lower position (98th) in the new list.
5.2.3 Relationship Between Hubs and Authorities
So far, the comparison between the results given by Alta Vista and the new results
ranked by the ranking system have been based on authority values. In order to test the
validity of the contention that there is a mutually reinforcing relationship, i.e. that a good
hub is a page that points to many good authorities and a good authority is a page that is
pointed to by many good hubs, we analyzed the hyperlinks between authority and hub
pages of the rank results in this section.
Table 5.15 lists the old rank and new rank of the top 5 authorities and top 5 hubs
and their weights. The query string is “ballet”. It is noticable that all the authority
weights are very close. However, the 1st hub’s weight is dramatically higher than all the
weights of other hubs. Fig 5.7 shows the content of this Web page. It consists of a
- 59 -
Table 5.11 The top 10 URLs searched by Alta Vista and their authority weights for the
query string “Computer job”.
Rank URL Authority 1 http://www.jobbankusa.com/www.html 0.0065 2 http://www.dalmanjobs.com/ 0 3 http://www.jobs.com/ 0 4 http://www.iteachnet.com/wwwboard/wwwboard.html 0.0032 5 http://www.chisoft.com/ 0 6 http://members.home.net/oacjlp/cfall1.htm 0 7 http://www.siliconalleyjobs.com/ 0.016 8 http://www.n-s-i.net/fact.html 0 9 http://hrea.org/lists/huridocs-tech/markup/msg00250.html 0 10 http://www.churchfriends.com/board/biz/messages/7.html 0
Table 5.12 The top 10 URLs from the ranking system and their old ranks from Alta
Vista for the query string “Computer job”.
New rank URL Authority Old rank 1 http://www.itcomputerjobsearch.com/ 484.9 128 2 http://www.athomebusinessportal.com/ 364.3 114 3 http://www.stayathomework.com/ 129.4 113 4 http://www.daytraderstocktrader.com/ 17.6 28 5 http://houston.computerwork.com/ 5.1 36 6 http://bayarea.computerwork.com/ 5.1 39 7 http://philadelphia.computerwork.com/ 5.1 46 8 http://twincities.computerwork.com/ 5.1 56 9 http://www.virtualbusinesswebhost.com/ 2.3 18 10 http://www.joltjobs.com/ 0.22 124
- 60 -
Figure 5.6 Web page contents of http://w3.one.net/~ballet/.
- 61 -
Table 5.13 The top 10 URLs searched by Alta Vista with query “ballet”.
position URL 1 http://www.ballet.co.uk/ 2 http://www.ballet.org.uk/ 3 http://www.pbt.org/ 4 http://www.royal-ballet-school.org.uk/ 5 http://www.nzballet.org.nz/ 6 http://www.danvilleballet.org/ 7 http://www.westchesterballet.org/ 8 http://www.nycballet.com/ 9 http://w3.one.net/~ballet/
10 http://www.theshopatvcb.com/
Table 5.14 The top 10 authorities ranked by the ranking system with query “ballet”.
position URL 1 http://www.kirovballet.com/ 2 http://www.joffrey.com/ 3 http://www.het-nationale-ballet.nl/ 4 http://www.balletwest.org/ 5 http://www.en-ballet.co.uk/ 6 http://www.hamburgballett.de/ 7 http://www.coloradoballet.org/ 8 http://www.cincinnatiballet.com/ 9 http://www.koninklijkballetvanvlaanderen.be/
10 http://www.balletaustin.org/
- 62 -
Table 5.15 The top 5 authorities and top 5 hubs and their ranks and weights for the
query “ballet”.
URL Type New rank Old rank Weight http://www.kirovballet.com/ Authority 1 76 71.95 http://www.joffrey.com/ Authority 2 81 70.82 http://www.het-nationale-ballet.nl/ Authority 3 79 65.4 http://www.balletwest.org/ Authority 4 35 62.67 http://www.en-ballet.co.uk/ Authority 5 45 61.79 http://www.dancer.com/dance-links/ballet.htm Hub 1 32 61.6 http://www.sccs.swarthmore.edu/~mack/ballet.html Hub 2 108 9.2 http://www.edanz.com/ballet/ Hub 3 60 9.16 http://www.sapphireswan.com/dance/links/ballet.htm Hub 4 86 7.22 http://www.art4net.com/BALLET.html Hub 5 104 4.09
- 63 -
Figure 5.7 Web page contents of http://www.dancer.com/dance-links/ballet.htm.
- 64 -
collection of all the links which point to other Ballet Company Web sites from A to Z.
Thus, it has a large number of out links. By analyzing the in pages of authorities and the
out pages of hubs in Table 5.15 carefully, we constructed a directed graph which could
help us gain a good understanding about how these URLs could be ranked in a high
position by the ranking system. Fig. 5.8 illustrates the relationship between authorities
and hubs in Table 5.15. In Fig. 5.8, a URL is represented by a circle, while the number in
the circle is its rank. All the circles on the left side are hubs and all the circles on the right
side are authorities. Note that Fig. 5.8 only showed the edges among these URLs. Each
hub may have outward links that the Web pages it points to did not show in the graph,
and each authority may have inward links that the Web pages that point to it did not
shown in the graph.
By reviewing the operations used to calculate the authority weight and hub weight
shown in Fig 4.2, we could predict that the 1st hub would gain a large weight number at
the first iteration since it pointed to all the ballet company Webs from A to Z. Then its
high value would influence the authority weight of all the Web pages it pointed to. If
these Web pages were also pointed to by other good hubs, their high authority weight
would increase the hub weight of the Web pages which pointed to them in the next
iteration, and the reinforcement was repeated during further iterations. Finally, both the
hub weights and authority weights converged to fixed values. The ranking system then
ranked all the URLs in the new sequences based on either authority or hub weight.
Fig.5.8 clearly shows that all the top 5 authorities were pointed to by the 1st hub
http://www.dancer.com/dance-links/ballet.htm. Since the 1st hub had a weight that was
dramatically higher than any other hub, it contributed more weight than any of the other
- 65 -
hubs when taking into account for calculating the weight of authorities. This may explain
why all the 5 top authorities have weights close to the weight of the 1st hub. Hence, we
come to the conclusion that the main reason that the top 5 authorities moved from lower
positions in the old list to higher positions in the new list is that all of them were pointed
to by the 1st hub; and consequently any URL which was not pointed by the 1st hub would
get a low authority weight and prevented from occupying the top positions. The final
arrangement of these top 5 authorities was determined by the weight of the other hubs
pointed to them.
- 66 -
Figure 5.8 The mutually reinforcing relationship between hubs and authorities.
2
3
4
55
4
2
3
1
Hubs Authorities
1
http://www.kirovballet.com/
http://www.joffrey.com/
http://www.het-nationale-ballet.nl/
http://www.balletwest.org/
http://www.en-ballet.co.uk/
http://www.dancer.com/dance-links/ballet.htm
http://www.edanz.com/ballet/
http://www.sapphireswan.com/dance/links/ballet.htm
http://www.art4net.com/BALLET.html
6http://www-sci.lib.uci.edu/HSG/Ref4.html
http://www.sccs.swarthmore.edu/~mack/ballet.html
- 67 -
Chapter 6 Conclusions and Future Work
The new ranking algorithm applied in this project is a link-based approach to
WWW searches. It works on top of the existing text-based search engines and aims to
locate high-quality information related to a search topic on the World Wide Web, based
on a structural analysis of the link topology surrounding “authoritative” pages on the
query topic.
Our experiments used Alta Vista and Excite as anchors for the implementation of
the ranking system. Diverse query strings were entered as keywords to test the
improvement in the quality of the returned result made by applying the ranking system.
The power of producing results that are of as high quality as possible in the context of
available WWW pages was analyzed, and the comparison showed positive results.
For Alta Vista, which is a purely text-based search engine, the improvement
made by the ranking system is dramatic. The ranking system successfully deleted all the
dead links and most non-relevant WWW pages from the list returned. The percentage of
both good and excellent quality Web pages in the new list showed a major improvement.
For Excite, which is also a text-based search engine but which pays more
attention to Web page structure, the improvement made by the ranking system is more
limited. This is because most of the results returned for Excite have better quality than
that of Alta Vista. However, the percentage of both good and excellent quality Web pages
in the new list still showed some improvement.
- 68 -
In this project, the ranking algorithm based on the linkage structure transcended
the limitations of traditional technology by exploring the structure of “communities” of
hubs and authorities on the WWW. Note that the iterative process of computing hub
weights and authority weights ignores the text describing the topics. In some cases
merely mining the linkage structure may not be good enough, so there is still room to
enhance the performance of the ranking system by improving the focus on the topic. In
particular, in a HTML file, the text around href links to a page p is descriptive of the
contents of p. It should be possible to introduce a new text-weighted process and
incorporate this textual conferral of authority into the basic iterative process described
previously. The idea is to assign to each link a positive numerical weight &(p, q) that
increases with the amount of topic-related text in the vicinity of the href from page p to
page q. The precise mechanism that can be used for this second weighting phase will be a
challenge for the future.
- 69 -
References
1. M.R. Wulfekuhler, and W.F. Punch, “Finding salient features for personal web
page categories,” Computer Networks and ISDN Systems, vol.29, pp.1147-1156,
1997.
2. E. Spertus, “ParaSite: mining structural information on the Web,” Computer
Networks and ISDN Systems, vol. 29, pp.1205-1215, 1997.
3. M. Marchiori, “The quest for correct information on the Web: hyper search
engines,” Computer Networks and ISDN Systems, vol.29, pp. 1225-1235, 1997.
4. S. Brin, and L. Page, “The anatomy of a large-scale hyper textual Web search
engine,” Computer Networks and ISDN Systems, vol. 30, pp. 107-117, 1998.
5. L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking:
bringing order to the Web,” Manuscript in Progress,
http://google.stanford.edu/~backrub/pageranksub.ps.
6. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J.
Kleinberg, “Automatic resource compilation by analyzing hyperlink structure and
associated text,” Computer Networks and ISDN Systems, vol. 30, pp. 65-74,
1998.
7. J.M. Kleinberg, “Authoritative source in a hyperlinked environment,” IBM report
RJ 10076, May 1997.
- 70 -
8. C.W. Groetsch, and J.T. King, “Matrix methods and applications: an introduction
to linear algebra,” Prentice-Hall, Inc. 1988.
9. G.Golub, and C.F. Van Loan, “Matrix Computations,” Johns Hopkins University
Press, Baltimore, 1989.