brute force web search for wireless devices using mobile agents

12
Brute force web search for wireless devices using mobile agents Konstantinos G. Zerfiridis, Helen D. Karatza * Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece Received 24 November 2001; received in revised form 19 March 2002; accepted 5 May 2002 Abstract Web based search engines have been with us for a long time now. They proved to be an irreplaceable tool for researchers and Internet users all over the world. The exponential growth of the Internet has disclosed great challenges to these engines, as it is hard to maintain an accurate database of numerous web pages over time. This problem becomes wearisome, as it is often necessary to browse through several results before locating a web page that matches the given query. As today mobile devices are able to connect to the Internet through high-cost low-bandwidth wireless networks, this tactic can become very expensive. Motivated by these issues, we designed and implemented SearchSweep, a mobile agent based client-server system that uses existing search engine systems on the web to locate and download web pages. A refinement system on the server makes this solution ideal for mobile users, or users with limited bandwidth. The structure of SearchSweep platform is presented and the use of mobile agents on wireless devices is proposed as a way of attacking their limitations. Ó 2003 Elsevier Inc. All rights reserved. Keywords: Mobile agents; Wireless networks; Document retrieval 1. Introduction Ever since they were created, search engines are used to locate URLs on the web, according to the given keywords. They have been a useful tool to researchers all over the world, as the exponential growth of the Internet became overwhelming. But along with the evolution of the net, certain problems arise. Namely, a good percentage of the URLs returned from the search engines are either non-existing or they no longer carry the required content. Searching the Internet can become a costly and time consuming task because most wireless devices connect to the Internet through a high-cost low- bandwidth connection. Typically web based search engines (Kamei et al., 1997) use a web spider, which constantly collects new or updated web pages, and a database in which the infor- mation of the web page information (URL, words, links, date, etc.) are archived. Most of these search en- gines use a combination of keywords and boolean op- erators to locate the appropriate web pages in their database. This carries some drawbacks. For example, there is no way to specify the importance of one key- word over another using boolean logic. Furthermore, because of the heterogeneity of the syntax used in the web search engines, the same query may produce dif- ferent results in different search engines. Therefore, the user is forced to be familiar with each one. When looking for certain content on the Internet, it is often necessary to using a search engine. Today there is a variety of search engines on the net, each one utilizing different ways of indexing, retrieving and searching for content. However, the rapid evolution of the Internet renders a great percentage of a search engine’s database invalid as outdated web pages are often removed or changed. Additionally, the overwhelming expansion of the net forces them to index more content in less time. Therefore, increasing the size of the database often su- persedes the need of keeping it updated. This creates a lot of problems as most of the times it is necessary to go through a lot of web pages that a search engine suggests, just to find out that most of them no longer carry the required content. One solution is to refine the results by downloading the web pages and verifying the search query over each * Corresponding author. Tel.: +30-2310-997974; fax: +30-2310- 998310. E-mail addresses: [email protected] (K.G. Zerfiridis), karatza@csd. auth.gr (H.D. Karatza). 0164-1212/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi:10.1016/S0164-1212(03)00085-2 The Journal of Systems and Software 69 (2004) 195–206 www.elsevier.com/locate/jss

Upload: konstantinos-g-zerfiridis

Post on 02-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

The Journal of Systems and Software 69 (2004) 195–206

www.elsevier.com/locate/jss

Brute force web search for wireless devices using mobile agents

Konstantinos G. Zerfiridis, Helen D. Karatza *

Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

Received 24 November 2001; received in revised form 19 March 2002; accepted 5 May 2002

Abstract

Web based search engines have been with us for a long time now. They proved to be an irreplaceable tool for researchers and

Internet users all over the world. The exponential growth of the Internet has disclosed great challenges to these engines, as it is hard

to maintain an accurate database of numerous web pages over time. This problem becomes wearisome, as it is often necessary to

browse through several results before locating a web page that matches the given query. As today mobile devices are able to connect

to the Internet through high-cost low-bandwidth wireless networks, this tactic can become very expensive. Motivated by these issues,

we designed and implemented SearchSweep, a mobile agent based client-server system that uses existing search engine systems on

the web to locate and download web pages. A refinement system on the server makes this solution ideal for mobile users, or users

with limited bandwidth. The structure of SearchSweep platform is presented and the use of mobile agents on wireless devices is

proposed as a way of attacking their limitations.

� 2003 Elsevier Inc. All rights reserved.

Keywords: Mobile agents; Wireless networks; Document retrieval

1. Introduction

Ever since they were created, search engines are used

to locate URLs on the web, according to the given

keywords. They have been a useful tool to researchers

all over the world, as the exponential growth of the

Internet became overwhelming. But along with the

evolution of the net, certain problems arise. Namely, agood percentage of the URLs returned from the search

engines are either non-existing or they no longer carry

the required content. Searching the Internet can become

a costly and time consuming task because most wireless

devices connect to the Internet through a high-cost low-

bandwidth connection.

Typically web based search engines (Kamei et al.,

1997) use a web spider, which constantly collects new orupdated web pages, and a database in which the infor-

mation of the web page information (URL, words,

links, date, etc.) are archived. Most of these search en-

gines use a combination of keywords and boolean op-

*Corresponding author. Tel.: +30-2310-997974; fax: +30-2310-

998310.

E-mail addresses: [email protected] (K.G. Zerfiridis), karatza@csd.

auth.gr (H.D. Karatza).

0164-1212/$ - see front matter � 2003 Elsevier Inc. All rights reserved.

doi:10.1016/S0164-1212(03)00085-2

erators to locate the appropriate web pages in their

database. This carries some drawbacks. For example,

there is no way to specify the importance of one key-

word over another using boolean logic. Furthermore,

because of the heterogeneity of the syntax used in the

web search engines, the same query may produce dif-

ferent results in different search engines. Therefore, the

user is forced to be familiar with each one.When looking for certain content on the Internet, it is

often necessary to using a search engine. Today there is

a variety of search engines on the net, each one utilizing

different ways of indexing, retrieving and searching for

content. However, the rapid evolution of the Internet

renders a great percentage of a search engine’s database

invalid as outdated web pages are often removed or

changed. Additionally, the overwhelming expansion ofthe net forces them to index more content in less time.

Therefore, increasing the size of the database often su-

persedes the need of keeping it updated. This creates a

lot of problems as most of the times it is necessary to go

through a lot of web pages that a search engine suggests,

just to find out that most of them no longer carry the

required content.

One solution is to refine the results by downloadingthe web pages and verifying the search query over each

196 K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206

one. But to our knowledge there is no search engine that

does that, because such a task would greatly diminish

the bandwidth, the processing power, and therefore the

quality of service of the search engines. There are cur-

rently many applications available (for example Coper-

nic, BullsEye) that do this task. The downside is thatthese applications must run on the user’s computer,

causing in many cases resource depletion. A simple

query search using these applications could cause en-

ough traffic to clog a dial-up connection for several

minutes. Thus, using such tools with wireless devices

could become very expensive.

We propose the use of mobile agents for document

retrieval as a middleware between the web search en-gines and the wireless users. We implemented Search-

Sweep, a client-server platform which utilizes mobile

agents’ inherent ability of state and code migration. An

advantage of this approach is that the user does not

need to be connected while the search is under way, and

the total amount of communication is only a fraction of

the alternative. To our knowledge nothing similar has

ever been implemented.The structure of this paper is as follows. Section 2

examines some of the techniques used by search engines,

their advantages and disadvantages, and the limitations

of the wireless networks. In Section 3 basic concepts of

mobile agents are briefly reviewed and the structure of

SearchSweep is discussed. Section 4 shows the benefits

of using such a platform in comparison to existing ap-

proaches by means of quantitative experiments. Aqualitative report of the results is shown in Section 5.

Section 6 briefly conceptualizes on the advantages of

using mobile agents on a wireless network and on a

cluster of servers, and presents ongoing and future

work. Finally, Section 7 summarizes the paper.

2. The problem

2.1. Search engines

To find out why search engines have such erratic

behavior, the way they work should be examined. There

are currently numerous engines each designed for a

specific purpose and each having its advantages and

disadvantages. They utilize several technologies in orderto locate the required content. As different engines uses

different strategies, their results often deviate widely. As

a result, it is often necessary to use more than one in

order to locate a satisfactory page.

An approach towards unifying several search engines

to one are the meta-search engines (Chignell et al., 1999).

These engines retrieve the results from several web based

search engines when a single query is given. But thismethod has the disadvantage that it may return the same

results over and over again from each one of the search

engines. Even in the case where a refinement is done by

the meta-search engine in order to prevent duplicate hits,

the results may not contain the required content.

Many search engines organize the web pages in self-

organizing maps (Ritter and Kohonen, 1989; Kohonen,

1998) in order to save space and therefore speed up thesearch process. SOMs are neural networks that are able

to organize the information according to relevance. This

has the advantage that search engines can find web

pages relevant to the subject of the given query, but on

the downside, many of the results may not contain the

required keywords.

Most search engines remove the articles, the prepo-

sitions, or even some clauses from the retrieved webpages before being processed into the database, as their

participation in a query is possibly insignificant. This

has the disadvantage that an ‘‘exact phrase’’ search will

not always match the given query.

A great amount of the search engines use suffix-

striping algorithms on the words of the retrieved web

pages before processing them further. This is also

known as stemming (Frakes and Beaza-Yates, 1992;Porter, 1980), and by using it, a smaller keyword dic-

tionary is produced. This results in faster searching. An

additional benefit of such algorithms is that the search

engine may return more pages relevant to the query. But

in the case that only pages that contain the exact key-

word are needed, this technique might produce un-

wanted results.

Search engines are constantly retrieving content fromnew URLs because their aim is to include as much

content as possible. Most of these engines implement

algorithms for repeated retrievals of the same URL in

order to determine how often it is being updated. This

way, search engines can have a relatively current state of

web pages that are updated every hour, every day, every

month, or remain the same over time. As those algo-

rithms cannot predict with accuracy when a page isgoing to be updated, there is always a possibility that the

search engine may return addresses that no longer carry

the required content. Furthermore, these algorithms

cannot efficiently account for removal of web pages that

are not updated at all. That is because when the algo-

rithm determines that a web page does not change over

time, it gives a low priority for rechecking for updates at

this address. Therefore, even if this page is removedfrom the server, the web engine will include it in the

results. The ‘‘non-existing URL’’ problem can also

occur if part of the network that connects the target

server with the rest of the Internet is temporarily down.

Several web based search engines emerged that can

receive a query from a user in the form of a question,

and derive URLs that possibly answer the users ques-

tion. Such search engines use sophisticated A.I. algo-rithms and other search engines, inheriting as a result

their disadvantages.

K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206 197

Additionally, because of the quality-for-speed trade-

offs often made, the results of web search engines can

become highly unstable. Search engines have proved to

have radical behavior with regard to their results. De-

pending on the number of matches, they can act quite

differently.When a query comprised of a single keyword was

send to seven of the most well known search engines, a

great amount of hits were produced. Each engine pro-

duced thousands of results. As shown in Fig. 1, by

examining the first hundred results of each query––ac-

cording to each engine’s sorting––it is derived that the

valid pages could amount in some cases to just 5% of the

total pages examined. In the examination process, eachURL was downloaded and labeled: (a) valid if it con-

tained the requested keyword, (b) duplicate if it had the

exact same content as another URL that was labeled

valid, (c) invalid if the keyword did not appear in the

URL’s content, and (d) unreachable if the retriever was

unable to download it.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Engine 1 Engine 2 Engine 3 Engine 4

Fig. 1. Amount and quality of results for

0.00

2.00

4.00

6.00

8.00

10.00

12.00

Engine1 Engine2 Engine3 En

Fig. 2. Amount and quality of results

This was not the case when more complex queries

were used (Fig. 2). For this experiment, queries of sev-

eral irrelevant keywords were submitted in order to

produce limited results. The same set of queries was used

for all the engines. In this case, invalid pages were

practically non-existent, but duplicate and unreachablepages accounted in some cases for half the hits.

Additionally, Selberg and Etzioni (2000) shows that

the results of the search engines may change significantly

over time. It becomes therefore obvious that the com-

plex algorithms used, lead search engines to erratic be-

havior.

2.2. Wireless networks

During the recent years, wireless communications

enjoyed thriving success. Today wireless telephones are

widespread, helping people communicate easier than

ever. The ascending popularity of portable devices

moved the attention from phones to communications in

Engine 5 Engine 6 Engine 7

Duplicate

Unreachable

Valid

Invalid

queries that produce several results.

gine4 Engine5 Engine6 Engine7

Duplicate

Unreachable

Valid

for queries with limited results.

198 K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206

general, in order to have instant access to information.

Modern wireless networks are designed to be highly

flexible in order to be easily deployed and accessed from

all over the world.

The success of PDAs and portable devices created the

need for wireless access to the Internet. The popularityof GSM networks for wireless voice communication,

used by today’s cellular phones, made them the obvious

solution. But having a top speed of 9600 bps, it is im-

practical and very expensive to use such networks for

Internet browsing. However, this speed is adequate for

sending and reading e-mail and viewing simple web

pages. An implementation on those characteristics is the

WAP browsers on modern cellular phones. Currently, alimited number of GSM network providers can support

speeds which in theory can reach of up to 43.2 Kbps

using HSCSD.

The successor of GSM is the GPRS (Lin et al., 2001),

which is the mobile phone system that is commercially

available since 2002. It utilizes various radio channel

coding schemes to achieve raw radio link bit rates of 171

Kbps.Another emerging wireless communication system is

Bluetooth. It connects devices at speeds up to 1 Mbps at

a maximum distance of 10 m. Its low cost and limited

power consumption makes it an ideal communication

medium for wireless devices of any kind. Although its

really short range makes it impossible for outdoor use, it

can be used to provide access to the Internet with the

help of other devices. Such is the case with the newcellular phones that supports high-speed data and

Bluetooth technology for Internet access from notebook

computers. Similar is the case for HomeRF at speeds of

1.6 Mbps (it is expected to reach 10 Mbps) and an ef-

fective range of 50 m. Bluetooth and HomeRF are fre-

quency-hopping technologies that are ideal for

streaming audio and data over home networks. They

cannot really substitute LANs, so they often coexistwith other technologies such as 802.11b. Wireless net-

works based on industry standard IEEE 802.11b sup-

port transmission raw radio link bit rates of up to 11

Mbps, while IEEE 802.11a can provide 54 Mbps. These

networks are designed to cover a relatively confined

area. It should be noted that the actual throughput of

such networks may vary widely depending on any

number of variables. Transfer rates of 5 Mbps for802.11b and 25 Mbps for 802.11a are quite common for

distances of 100 m between the sender and the receiver.

Although there are currently many wireless local area

network (LAN) and metropolitan area network (MAN)

solutions providing us with a range of bandwidths, the

most widespread use of digital wireless networks is ex-

pected to come from the GSM or GPRS network. They

provide wide area coverage and require relatively lowpower consumption. Their bandwidth is sufficient for a

number of applications, so it is the obvious choice for

the mobile user. Their major disadvantages are high cost

of use and low transfer rates. Using such devices to lo-

cate content on the net can prove to be very expensive.

3. The solution

Software agents are programs that act on behalf of

people. They are able to perform specified tasks that are

assigned to them and they can accomplish that with or

without the supervision of the user, according to the

requirements of the given job.

Mobile agents have an additional property (Chess

et al., 1995a). The ability to transport themselves ondifferent systems after being executed, carrying with

them their program code, current state of execution and

any data that they obtained. This gives them the unique

capacity of living on a distributed network rather than

on a distant stationary system, and to take advantage of

the services that each host has to offer locally. Fur-

thermore, mobile agents allow proprietary code to be

used on the hosts, allowing complete customization ofthe retrieved results. The hosts should implement a

specified environment that can authenticate the origin

and credentials of the arriving mobile agents, provide

for them the necessary execution machine and limit their

access to system resources (Chess et al., 1995b).

Aglets Workbench (IBM, 1997; Lange and Oshima,

1998) is a framework developed by IBM Japan research

group for programming and deploying mobile agents.‘‘Aglets’’ are Java objects that can move from one sys-

tem on the Internet to another autonomously. Although

they can carry an itinerary, aglets can change it dy-

namically as they roam the Internet. They can transport

themselves, spawn new aglets, interact with other aglets

on the same or a distant context or even clone them-

selves. Implemented on Java, they have inherited the

property of being able to exist on heterogeneous net-works. This makes them ideal for flexible client-server

solutions, because in most cases clients are thinner than

the servers. Such is the case of wireless devises.

Furthermore, the Aglet’s platform is implemented to

use OMG’s MASIF (Mobile Agent System Interoper-

ability Facility) interface, allowing them to interoperate

with different agent systems. Additionally, Aglets can

use IBM’s JKQML (Java Knowledge Query and Ma-nipulation Language) for developing intelligent mobile

agents.

Although the Aglet platform is not the most opti-

mized for performance (Silva et al., 2000), it recently

became an open source project when IBM released the

source code. This makes it the most likely platform to be

adopted by a great number of mobile devices. Further-

more, its evolution to a faster platform with even morefeatures has already begun. The SearchSweep platform

described in this paper is a versatile and expandable

Fig. 3. Searcher’s graphic interface.

K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206 199

project; therefore it has a lot to benefit from the po-

tentials of this developing technology. SearchSweep

makes use of the policy based security system of the

Aglets platform and benefits from its widespread use on

heterogeneous systems.

The Aglets platform is written entirely with Java,therefore it can be run on any system that incorporates

Java 2 Standard Edition. Currently there are many

personal digital assistants that include the full version of

J2SE. Additionally there is a Java 2 Micro Edition

version of Aglets under development as there are many

mobile phones that have embedded J2ME.

3.1. SearchSweep structure

SearchSweep is designed using mobile agent tech-

nology. This gives it a straightforward structure, which

is easy to use and extend. From the developers point of

view, the major advantage that the mobile agent plat-

form provides is the modularity of each subsystem. The

messaging capabilities of agents, and their inherent

ability to dispatch their code to a different system,makes such solutions ideal for systems that need to be

highly scalable.

Mobile agents as a middleware can be used in a

variety of ways. The SearchSweep platform uses the

client-server approach. The client is responsible for in-

stantiating a Searcher mobile agent. This agent has the

inherent abilities to utilize certain services if provided by

the current context. Initially, the agent presumably ex-ists on a mobile device with limited resources. In our

implementation, the Searcher agent implements a simple

graphical interface (Fig. 3). Through that, the user is

able to configure the agent properly and to send it to a

server.

As a client cannot always know what services can be

provided by the server, the Searcher agent can create a

BotProbe agent, which is immediately dispatched to theserver where it queries for available services. Afterwards

it sends a report back to the agent that created it and

dies. This is depicted in Fig. 4. After the desired services

are selected and the appropriate keywords are supplied,

the Searcher agent is submitted to the server.

The server should have appropriate resources and

infrastructure to support multiple Searcher agents. In

this case, aManager agent is responsible for starting and

Searcher1

3

Client

Fig. 4. The searcher aglet creates and dispatches the BotProb aglet to the

necessary information (2) and sends a message with that information back t

maintaining a number of Retriever agents. Retrievers are

the only agents that have access to the Internet and their

job is to download from the web any given page. TheManager, which is configurable through its own inter-

face (Fig. 5), is able to create more Retrievers if there is a

high demand, or destroy some of them if they are no

longer needed in order to release system resources. Fig. 6

shows that the server’s context contains five free Re-

trievers. The Manager agent also maintains a cache of

the retrieved web pages on the local file system for future

use. This proved to speed up the process of retrievingweb pages, when similar requests were made.

On the server side, a Search Engine Registry agent

exist which is in charge of subscribing new SearchBot

agents to the system. SearchBots are agents, which have

the necessary knowledge to access a web based search

engine, and parse their results. This approach gives to

the system the flexibility to accept new services on the

fly. If a SearchBot is send to the server that contains theSearch Engine Registry agent, it will be verified and

subscribed so that future Searcher agents can use it.

When the Searcher agent reaches the server, it com-

municates with the SearchBots to negotiate a proper

query for each of the search engines. Afterwards, it re-

quests from the Manager to retrieve the results from the

Bot Prob Search Engine Registry2

Server

server (1), where it locates the Search Engine Registry, retrieves the

o the searcher aglet (3).

Fig. 5. The Manager can be monitored and configured via its own

graphical interface.

200 K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206

appropriate web sites. When the results arrive, the

Searcher parses through them and eliminates the du-

plicate URLs. It sends to the Manager a request to re-

trieve the results and when all of them are available, itvalidates them by scanning through each retrieved web

page to verify the existence (or absence) of the keywords

(Fig. 7). Searcher repeats this process as many times as

necessary to meet the criteria given by the user.

Retrieve Result PageMore PagesRequested

Fig. 7. Verification process of the m

Fig. 6. Aglet’s contex

It then starts the refinement process. Initially the

agent strips the unnecessary html tags, and creates a

signature file from the text of each downloaded page.

These signature files are compared so that the duplicate

or almost identical pages can be located. In that case,

only one is kept, and the rest of the URLs are marked asalternative sources of that page. The next step is to sort

the retrieved pages according to a marking scheme. The

marking scheme used by the Searcher takes into account

two parameters: how many times each keyword is found

in a page, and how many times this page is referenced by

the other pages that where downloaded by this Searcher.

These two parameters have equal weight. Thus, the

most referenced page with the least keyword appea-rance, scores the same as the page which has not

been referenced at all but has the most keyword ap-

pearance.

When the Searcher retrieves the requested amount of

relevant pages, it is able to automatically return to the

device from where it was dispatched. This process is

depicted in Fig. 8. Alternatively, it may stay dormant on

the server until the user manually retracts it. This mightprove useful for users that cannot stay on-line for the

duration of this process. Although the Searcher agent is

carrying back the web addresses that meet the given

Retrieve Results Evaluate Results

obile Searcher at the server.

t at the server.

Mobile User Search Sweep Server

Internet

Web based search engines

12

4 3

Fig. 8. The searcher aglet dispatches from the wireless device to the server (1), the server retrieves the requested results from the search engines (2), the

searcher requests from the server to download the appropriate web pages (3), they are evaluated, and the searcher traverses to the wireless device with

the results (4).

Fig. 9. SearchSweep Micro Browser.

K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206 201

criteria, it can also carry properly refined versions of the

top rated web pages for viewing on mobile devices (Fig.

9). In that case, the use of compression decreases the

amount of transferred data even further.

SearchSweep platform use the web based search en-

gines, and therefore it is subject to their shortcomings.

Nevertheless, as the Searcher agent always downloads

and verifies each page before possessing it any further, itavoids most of the common problems that search engine

have. In essence, they are used only as a first step in a

search, providing the system with possible hits. Pages

that do not contain the given keywords but are included

in a search engine’s results, either because of the algo-

rithm used (stemming, SOMS, article removal, speed-up

tradeoffs), or because they are outdated, are removed by

the SearchSweep. Searchers can combine the power ofseveral engines by querying more than one search engine

at a time. The end user does not have to be familiar with

the syntax used by each one, as the agent provides a

unified GUI. Any duplicate pages are eliminated in the

refinement process. By utilizing multiple specialized

search engines, SearchSweep can target specific areas of

interest (for example auctions) and minimize their often

observed erratic behavior.

Because the SearchSweep platform is built on mobile

agent technology, it is highly scalable. Fig. 10 shows that

new services can be added on the server context in the

form of agents. This is done easily as long as the pro-

grammer follows certain guidelines. The agent has to

subscribe its services to the server’s context, and supplyinformation about the services that it can provide when

asked. The structure of such an agent is shown below:

public class CompressionAgent extends

Aglet {

public void onCreation(Object init) {

subscribeMessage(00ServiceProvider-Agent00);

}

public boolean handleMessage(Message

msg) {

if (msg.sameKind(00getSPProxy00)) {

msg.sendReply(getProxy());

return true;

ManagerCache

Retriever Registry

Retriever #1

Retriever # n

Retriever #2 SearchBot # 1

SearchBot # 2

SearchBot # n

Search Engine Registry

Searcher

Additional ServicesCompresionRTF2TXT

Fig. 10. Diagram of the server’s context.

202 K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206

}

if (msg.sameKind(00moreInfo00)) {

Object[] o;

. . .o[0]¼ 00Compression00;msg.sendReply(o);

return true;

}

if (msg.sameKind(00. . .00)) {

. . .}

. . .return false;

}

}

4. Experimental results

The average amount of traffic that a refined search

could produce can exceed 5 MB in a matter of seconds,

Idle server

No cache & no proxy U

Dispatch to server 5.46 s 5

Dispatch to server traffic 6120 bytes 6

Average delay in queue 0 s 0

Mean queue length 0 0Mean serve time 12.87 s 1

Server Network Traffic 740,311 bytes 7

Dispatch to client 40.23 s 3

Dispatch to client traffic 47,912 bytes 4

Total client Traffic 54,032 bytes 4

Service time 58.56 s 5

depending on how many search engines are used. Al-

though this kind of search returns results with the exact

content that was requested, low-bandwidth users couldnot afford to use it. SearchSweep platform uses a high-

bandwidth server to retrieve the content, refine it and

return to the user only the highest-ranking results.

In this section experimental results of the Search-

Sweep platform are presented. The agent used requested

up to 10 results from five search engines. After the re-

fining process, it selected the five highest-ranking results,

compressed them and it returned back to the user. Theagent did not retrieve the images of the processed web

pages as this can be done directly by the user. The

keywords used were chosen in random from a properly

processed text file. The two main metrics measured were

the traffic that was produced, and the service time. The

server used was a 500 MHz PC with 256 MB RAM and

the client was a 266 MHz PC. The server had a T1

connection to the Internet, and the communication be-tween the server and the client was restricted to 9600

bps.

Saturated server

sing cache & proxy Using cache & proxy

.54 s 6.01 s

111 bytes 6134 bytes

s 2663.56 s

1992.23 s 13.42 s

27,470 bytes 723,157 bytes

6.69 s 41.21 s

3,483 bytes 49,120 bytes

9,594 bytes 55,254 bytes

4.47 s 2724.20 s

K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206 203

4.1. Experiment #1

In this experiment the SearchSweep platform was

tested without making use of the local cache or the

proxy server. Fifty tests were conducted using randomly

generated queries. In order to test the raw behavior ofthe platform, it was made certain that the server was

available at the beginning of each test.

The results show that an average of 723 KB of data

was retrieved from each test, and the total traffic be-

tween the client and the server was only 53 KB, which is

almost 14 times less than the amount of traffic that the

query produced at the server. The total service time was

less than one minute. However, if we were to download723 KB from a 9.6 Kbps connection we would have to

be on-line for more than 10 min.

This experiment shows the definite advantage of the

SearchSweep platform for wireless devices with limited

bandwidth. Although our assumption that the server is

always free, is not realistic, the results come to show that

the platform works and has practical value on high

priced wireless connection.

4.2. Experiment #2

For this experiment SearchSweep’s cache and the

LAN’s proxy server were used. Again, all the tests

conducted on the server when it was free. This experi-

ment provided us with information on the usefulness of

the local cache and speed-up or slowdown of the serverdue to the use of the proxy.

After 50 tests, the cache file reached 40 MB. A point

that should be noted here is that although there were

1739 new entries to the cache file, there was not even one

correct hit. Therefore, we came to the conclusion that

the use of a local cache is not necessary and that it even

slowed down the server in some cases. The uses of the

LAN’s proxy did not seem to slow down the perfor-mance of the platform. The response time of the plat-

form to this set of tests was slightly faster, but this can

be attributed to the fact that the total amount of data

retrieved was less than the amount in the first experi-

ment.

4.3. Experiment #3

For the final experiment it was made sure that the

server was busy all the time. It was found that the per-

formance of the current Aglet platform had an adequate

response when up to 200 agents were waiting idle and

the rest of the system worked to serve the agents that

came in first. However there was a significant slowdown

after an undefined threshold. The system became un-

stable and a reliable server with more than 400 agentscould not be maintained.

To test the system when it had a significant amount of

agent waiting to be served, the queue was limited to 200

entries and populated with agents. Then we started

sending a new agent when each agent returned. We

started taking measurements when the last agent of the

initial 200 returned. This was done in order to take anaccurate measurement for the average delay in queue.

The results show that on a busy server, the Search-

Sweep platform can be up to 5 times slower than the

alternative. Downloading 706 KB over a 9600 Kbps

connection can take 10 min, and in this experiment the

average wait in queue was 44.39 min. Nevertheless,

mobile agents are designed for disconnected operation.

Therefore a user can submit an agent, disconnect andreconnect to retrieve it, reducing the total on-line time to

the level of the first experiment. Additionally, modern

wireless networks like GPRS are charging the content

and not the on-line time. The total traffic in this exper-

iment between the client and the server was 15 times

smaller that the network traffic produced by the server.

5. Qualitative report

Although the SearchSweep platform was designed to

alleviate the mobile users from the painstaking and ex-

pensive task of searching through several web pages to

find the required content, certain intelligence was build

into the system. As described previously, the dispatched

agent can evaluate each page locally at the server andcarry back only a selected number of results. The mobile

agent may use any number of algorithms on the set of

pages that were retrieved by the server as an evaluation

of their quality. In order to rate the quality of the re-

turned results by the Searcher agent we conducted a

number of experiments.

We used four queries that produced different amount

of results on the web search engines. This was done sothat we can get a clear picture of how the platform be-

haves under different circumstances. The first query re-

trieved pages about finding accommodation in Athens,

and produced the most hits. The second query aimed in

finding web pages which contained drivers for an ob-

solete video card, and the third was about finding re-

ports on the adaptation to the new euro currency. The

forth query, that produced the least hits, retrieved pagesabout a very specific custom made cell phone accessory.

Additionally we varied the depth of the search by

increasing each time the number of engines used and the

amount of hits that the system requested from each one.

Although the Searcher agent returned a set of URLs, we

only evaluated the first five pages that the agent com-

pressed and brought back. If the page was irrelevant to

the subject of the given query, it was given no marks atall. The top score for each web page was 20, and

therefore in theory a query can score up to 100. The

204 K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206

results were judged according to the quantity, quality

and relevance of information that they contained.

Hits per engine (enginesused)

Averagehits

10 (3) 30 (5) 100 (7)

Query 1 22 57 75 43,000

Query 2 31 46 63 5000

Query 3 25 64 60 1000

Query 4 54 85 85 200

The test results show that as the depth of the search is

getting deeper, the agent returns results that more clo-

sely match our criteria. The only exception was with thethird query where the last test scored poorly in com-

parison to the second test. We traced this exception and

found that it was the ‘‘most-referenced’’ scheme used by

the agent that was responsible. Although this scheme

did not have any effect at the third test fn the last query

as well, it produced favorable results at all tests of the

first two queries.

Overall, the agent returned at least one page con-taining sufficient information about the requested query

in each case. It should be noted that third set of tests was

time consuming. In order to prevent flooding the server

with such queries, it is recommended not to request

more than thirty hits per engine, as the results of the

second set of tests were satisfactory in all cases.

public class SearcherExample extends Searcherpublic void setResults(Vector results){

// Called before preparing the Ve}public Vector getResults(){

…}public void setSearchParameters(String[

String[] shouldNotIncludePhrase…

}public boolean Verify(String s){

// Verifies that s fulfills the search}public void dispatchToManager(){

// called before the agent is dispa}public void dispatchToHome(){

// called before the agent is dispa}public void showGUI(){

// creates a transient interface.}public void arrivedHome() {

// called when the agent arrives t}

}

Fig. 11. Example of creating a new mobile SearchS

6. Future work

Because the SearchSweep platform was implemented

using mobile agents, it is highly scalable. The current

version runs the server on one machine. But agents have

a number of properties that make them ideal for bal-ancing the load of an application to several servers.

Thus, when for example the Manager agent reaches a

certain threshold of requests that are waiting in the

queue, it can clone itself, and send its clone to an

available server to continue in parallel. A version of

SearchSweep implemented on a heterogeneous cluster of

eight interconnected servers is under way. It uses Epoch

Load Sharing (Karatza and Hilzer, 2001) with smallepoch size, as a prior knowledge of the agent’s execution

time is not known. This is expected to alleviate to a

certain degree of scalability issues raised in Section 4.

We are also working towards creating a version that will

be able to handle disconnected operation, increased

queue length and smaller service time.

Systems such as SearchSweep are ideal for Internet

Service Providers that are giving access to users withwireless devices. An extension to the platform, which

logs resource usage for each user’s agent, could be easily

implemented. This way a billing system can be installed

and the platform could be used commercially. Vendors

could also design proprietary Searcher agents as shown

in Fig. 11, and implement custom functions such as

charging techniques.

{

ctor for dispatching

] shouldIncludePhrase,, Integer pages){

parameters

tched to the manager.

tched back to home

o the device that dispatched it

weep searcher, by inheriting from Searcher.

public class ExampleBot extends SearchBot{public String getSearchURL(String[] shouldIncludePhrase,

String[] shouldNotIncludePhrase, Integer pages){// Creates a proper query URL, understood by// this search engine

}

public Object[] parseResults(String wpage){Boolean morePages;Vector URLs=new Vector();// Given the html source of a web page, it// determines which URLs are the results, and if// there are more results availableObject[] r={URLs,morePages};return r;

}}

Fig. 12. Example of creating a Search Engine Broaker, by inheriting from SearchBot.

K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206 205

The SearchSweep platform can be used in the

emerging mobile e-commerce (or m-commerce) market

to provide highly accurate content to wireless devices.

We believe that it can greatly assist the development of

more mobile agent solutions for the mobile users.

SearchSweep is designed for mobile devices as theysuffer from limited network bandwidth and processing

power, but in order to be used in m-commerce some

security issues must be resolved.

As shown in Fig. 12, new SearchBots can be easily

implemented and incorporated in the SearchSweep

platform at runtime. A future addition to the platform

would be to create communities of SearchBots accord-

ing to their relevance. For example, SearchBots thatrepresent auction search engines could be grouped to-

gether to provide unified services. Moreover, the

Searcher agent is a Java class which can be easily

overridden in order to provide more sophisticated

search patterns. For example another version of

Searcher can be implemented, which will rate each result

according to where certain words appear in a page.

7. Conclusions

There are many professions that would benefit from

wireless networks as LANs are restricting because of

their physical infrastructure. Many of the inherent dis-

advantages of such networks can be treated with the use

of mobile agents. In a nutshell, our idea was to usemobile agents in a wireless network between many cli-

ents and a server in order to take advantage of the

ability of the mobile agents to operate asynchronously

and independently of the process that created them on

the client computer. We showed that the mobile agent

technology can be very beneficial to wireless devices,

because they can dispatch to a server any complex task

that needs resources which are either limited or un-

available to that device. With these in mind we designed

and implemented SearchSweep, a search refining plat-

form which utilizes the high bandwidth provided by a

server. The inherent advantages of using mobile agents,

makes the SearchSweep platform highly scalable.

References

Chignell, M., Gwizdka, J., Bodner, R., 1999. Discriminating meta-

search: a framework for evaluation. Information Processing &

Management 35 (3), 37–362.

Chess, D., Grosof, B., Harrison, C., Levine, D., Parris, C., Tsudik, G.,

1995a. Itinerant agents for mobile computing. Journal of IEEE

Personal Communications 2 (5).

Chess, D., Harrison, C., Kershenbaum, A., 1995. Mobile Agents: Are

They a Good Idea? IBM Research Division, research report.

Frakes, B., Beaza-Yates, R., 1992. Information Retrieval: Data

Structures and Algorithms. Prentice-Hall, Inc.

IBM Tokyo Research Labs, 1997. The Aglet Workbench: Program-

ming Mobile Agents in Java. http://www.trl.ibm.co.jp/aglets/.

Kamei, S., Kawano, H., Hasegawa, T., 1997. Effectiveness of

cooperative resource collecting robots for web search engines.

Proceedings of IEEE Pacific RIM Conference on Communications,

Computers, and Signal Processing, 410–413.

Karatza, H., Hilzer, R., 2001. Epoch load sharing in a network of

workstations. In: Proceedings of 34th Annual Simulation Sympo-

sium 2001. IEEE Computer Society, Seattle, WA.

Kohonen, T., 1998. Self-organization of very large document collec-

tions: State of the art. In: Proceedings of the 8th International

Conference on Artificial Neural Networks. Springer, pp. 65–74.

Lange, D., Oshima, M., 1998. Programming and Deploying Java

Mobile Agents with Aglets. Addison-Wesley.

Lin, B., Rao, H., Chlamtac, I., 2001. General Packet Radio Service

(GPRS): architecture, interfaces, and deployment. Wireless Com-

munications and Mobile Computing 1 (1), 77–92.

Porter, M., 1980. An algorithm for suffix stripping. Program 14 (3),

130–137.

Ritter, H., Kohonen, T., 1989. Self-organizing semantic map. Bio-

logical Cybernetics 61, 241–254.

Selberg, E., Etzioni, O., 2000. On the instability of Web Search

Engines. In: RIAO’2000 Content-Based Multimedia Information

Access. Coll�eege de France, Paris, France.

206 K.G. Zerfiridis, H.D. Karatza / The Journal of Systems and Software 69 (2004) 195–206

Silva, L., Soares, G., Martins, P., Batista, V., Santos, L., 2000.

Comparing the performance of mobile agent systems: a study of

benchmarking. Journal of Computer Communications 23 (8).

Konstantinos G. Zerfiridis received his Diploma degree in Mathematicsin June 1998 at the Aristotle University of Thessaloniki. In 1999 hereceived his M.Sc. degree in computer science from the University ofEdinburgh. He is currently a researcher and working towards a Ph.D.at the Aristotle University of Thessaloniki. His research interests are

mobile computing,mobile agents and distributed systems.His email andweb address are <[email protected]> and <agent.csd.auth.gr/~zerf>.

Helen D. Karatza is an Associate Professor at the Department of In-formatics at the Aristotle University of Thessaloniki, Greece. Her re-search interests mainly include performance evaluation of parallel anddistributed systems, multiprocessor scheduling and simulation. Heremail and web address are <[email protected]> and <agent.csd.auth.gr/~karatza>.