liv - ftp.cs.toronto.edu

14

Upload: others

Post on 03-Feb-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Live WWW and its Agents: the CAS-LWWW Project

Yiming Ye John K. Tsotsos

Department of Computer Science

University of Toronto

Toronto Ontario

Canada M5S 1A4

Karen Bennet

IBM Centre for Advanced Studies

Stn. 2G, Dept. 894

1150 Eglinton Avenue East

North York, Ontario

Canada M3C 1H7

Abstract

This paper proposes the idea of vision agentsover Internet, outlines the performance mod-els of Live WWW with agents, and describesan object search agent and its communicationswith other agents. The goal is to reduce thetra�c on the Internet for Live WWW by in-tegrating the WWW technology, agent theoryand computer vision technology together.

1 Introduction

The most signi�cant recent developments en-abling the creation of global information re-sources and world-wide computer-based inter-personal communications are the incrediblegrowth of the Internet and the success of theWorld Wide Web (WWW). The Web imple-ments a global hypermedia that ultimatelycould incorporate much of the world's knowl-edge that exists in tangible form. The Net im-plements a global communications system thatultimately could facilitate dialogue and inter-action among many of the world's people.The sheer joy of browsing the world and �nd-

ing the right nugget of information in the formof hypermedia has attracted researchers frommany �elds to the problem of how to solvecore technical problems posed by the growth

of global networks, how to improve the currentWWW and Internet technology, and how topredict or in uence what the Internet or in-tranet will look like.

The WWW technology has advanced sorapidly that even graphics, audio, and videocan be integrated into the Web. Most previ-ous research has concentrated on handling pre-recorded data on the WWW. Recently, e�ortshave been made to make the WWW live. Onespawn of these e�orts is the MBONE (Mul-ticast Backbone) [12] technology |{ an in-terconnected set of routers and subnets thatprovides IP multicast delivery in the Internet.The MBONE is a virtual network. It is lay-ered on top of portions of the physical Inter-net to support routing of IP multicast pack-ets since that function has not yet been inte-grated into many production routers. The net-work is composed of islands that can directlysupport IP multicast, such as multicast LANslike Ethernet, linked by virtual point-to-pointlinks called \tunnels". The tunnel endpointsare typically workstation-class machines hav-ing operating system support for IP multicastand running the \mrouted" multicast routingdaemon. In the IP multicast tunnels, IP multi-cast packets are encapsulated for transmissionthrough tunnels, so that they look like normalunicast datagrams to intervening routers and

1

subnets. A multicast router that wants to senda multicast packet across a tunnel will prependanother IP header, set the destination addressin the new header to be the unicast addressof the multicast router at the other end of thetunnel, and set the IP protocol �eld in the newheader such that the next protocol is IP. Themulticast router at the other end of the tun-nel receives the packet, strips o� the encapsu-lating IP header, and forwards the packet asappropriate. Driven by the availability of pop-ular new multicast applications, particularlythose providing real-time audio and video tele-conferencing capabilities to Internet hosts, theMBONE has grown exponentially. Another ef-fort to make the WWW live is represented bythe WWW browser Vosaic, or Video Mosaic,which extends the architecture of the WWWto encompass the dynamic, real time informa-tion space of video and audio [2]. In Vosaic,video and audio transfers occur in real time;there is no �le retrieval latency. The video andaudio result in compelling Web pages. Realtime video and audio data can be e�ectivelyserved over the present day Internet with theproper transmission protocol. A real time pro-tocol VDP is used to handle real time videoover the WWW. VDP reduces inter-frame jit-ter and dynamically adapts to the client CPUload and network congestion. Vosaic dynam-ically changes transfer protocols, adapting tothe request stream and the meta-informationin requested documents. Bellcore [17] also pro-poses the real-time data services for the Web,RAVE. RAVE supports live and stored videoand audio as well as less typical real-time ser-vices such as information services (e.g., newsor stock market feeds). The system is exten-sible so that new real-time services can eas-ily be added. It supports unicast and mul-ticast, as well as both real-time data sourcesand sinks. RAVE integrates well into the Webso that many multimedia applications may bewritten using simple HTML extensions. Whenacting as a live-data source, the RAVE servercan connect to hardware input devices (e.g., avideo capture device, a sound card, or a serialinterface connected to an external text feed).The real time data is packaged in a uniform for-mat and sent to clients over the network. Rightnow, various Internet sites o�er the possibil-

ity to view an image captured by a computerwith refresh rates varying from one frame ev-ery few second to one frame every four hours.In net lingo these images came to be knownas \live" in contrast with the rest of the usu-ally static sites. The MCS-WebCam might beone of the most popular site [16]. The Web-Cam is an easy to install real time video serverthat allows transmission of live video and au-dio to a number of remote clients. Rather thango the beaten path of plug-ins, the WebCamserver seeks to use established standards sucha server-push and JPEG and it works with-out any change to the con�guration of the end-users machine.

The live WWW opens many exciting appli-cation areas forWWW. Several problems, how-ever, make the real time transmission of thevideo data from the server site to the client siteunappealing, such as the intense use of band-width, poor quality of video image, and highprice. When the modem at the client site istoo slow, the real time video transmission sim-ply does not work.

The goal of the CAS-LWWW (Centre forAdvanced Studies-Live World Wide Web)project is to solve the above problems and re-duce the Internet tra�c by integrating com-puter vision technology, agent theory andWWW technology together. Instead of fre-quently transmitting the image data grabbedfrom the camera to the client and let the clientto analyze the video image, we simply buildagents at the server site and ask the agents tohelp executing the tasks required by the clients.Therefore, the image grabbed from the cam-era is �rst pre-processed by the agents, onlythose images that are of interest to the clientare sent through the Internet. Thus the band-width used is greatly reduced. This strategycan be applied to many areas, such as securitysurveillance, etc. It is even more interestingwhen the camera need to be controlled in or-der to perform a certain task, such as search-ing for an object in the server's site. The clientdoes not need to interactively control the cam-era and get the video data over the Internetagain and again. The search agent will controlthe camera and perform the image processingoperation at the server's site, and only the mostpromising images that might contain the target

2

are transmitted to the client over the Internet.

2 The High Bandwidth Re-

quirement for Live

WWW

It can be predicated that as more and morepeople begin to use live WWW, the WWWwillburst at the seams due to the lack of bandwidthfrom servers to clients. To make the video overInternet to appear \live", a high-bandwidth isrequired. A large percentage of web clients,however, are run over low-speed 28.8 or 14.4modems, and users are increasingly consider-ing wireless services such as cellular modemsat 4800-9600 baud. A recent study by a pop-ular server of shareware, Jumbo, revealed thatabout 1 in 5 users were connecting with graph-ics turned o�, to eliminate the annoying latencyof loading web pages [6], not to say when livevideo is included in the web pages. To makethe live video over Internet popular, a study ofhow to reduce the bandwidth used by the liveWWW is required.

A nature step in reducing the bandwidthused is to reduce the amount of data that hasto transmit over the Internet. Thus, imagecompression, that is to remove the data re-dundancy, code redundancy, inter-pixel redun-dancy, and psycho-visual redundancy in an im-age before it is transmitted, is a necessary stepin live WWW. Various compression standards,most of which are based on JPEG [11] andMPEG [7], have been used for live WWW andother multi-media WWW. JPEG o�ers fourmodes of operation: lossless encoding, sequen-tial encoding, progressive encoding, and hier-archical encoding. The basic JPEG algorithm�rst represents the original image in the fre-quency domain and then achieves data com-pression by concentrating most of the signalpower in the lower spatial frequencies and re-ducing to zero the high frequencies with smallcoe�cients during the quantization step. Theprogressive JPEG encodes the image in mul-tiple coarse-to-detailed passes, thus when dis-played, a blurry image �rst appears and it isre�ned as more image data arrives. The hi-erarchical JPEG encode the image at multiple

resolutions, thus lower-resolution versions maybe accessed without �rst having to decompressthe image at its full resolution. The MPEG isbased on two basic techniques. One is block-based motion compression for the reductionof temporal redundancies, which is achievedby motion prediction and motion interpolationwith the assumption that the blocks of the cur-rent picture can be modeled as a translation ofblocks of some previous picture or a combina-tion of references to past and future pictures.The other is Discrete Cosine Transform basedcompression for the reduction of spatial redun-dancies.

The sizes of the images or video sequencescompressed by JPEG and MPEG can begreatly reduced. JPEG achieves a 15:1 av-erage compression ratio and MPEG achievescompression ratios up to 200:1 on average bystoring only the di�erence between successiveframes. In spite of the high compression ra-tio achieved by JPEG and MPEG, the amountof data needed to be transmitted over the In-ternet is still large for live video. Thus, inorder to make the site appear \live", manylive WWW sites have to reduce the transmit-ting rate. For example, Telepresence Systems,Inc. [9] is working on the ProRata systemwhich provides groups of geographically sepa-rated people with a sense of shared presence.It takes video snapshots of participating indi-viduals every 5 minutes, merging the picturesinto a composite and then sending the compos-ite image back to each person in the group viathe Internet.

All of the above approaches for reducing thebandwidth used over the Internet does not in-volve intelligent agent. There is no considera-tion of the browser's requirement at the brows-ing time. All the images are processed in a �xedmode, with no consideration of the perceptionability of the browser and the usefulness of theimage to be transmitted. We propose an ideaof reducing the data by using vision agents tointelligently select and transmit only the im-ages that are interesting to the browser andcompress the original image according to theperception abilities of the browser.

3

3 Vision Agents Over Inter-

net

The concept of an agent has become importantin user interfaces, arti�cial intelligence, andmainstream computer science. Research onand discussion about agents has mushroomedin the past few years. Although it is not easyto provide a universally accepted de�nition, wecan say that an agent is a hardware or software-based computer system that has some of thefollowing properties:

� Awareness: agents have knowledge aboutitself and the world;

� Autonomy: agents operate without the di-rect intervention of humans or others, andhave some kind of control over their ac-tions and internal state;

� Social ability: agents can interact withother agents (and possibly humans) viacertain communication methods;

� Reactivityand knowledge adaptation: agents can per-ceive their environment (which can be thephysical world, a user via a graphical userinterface, a collection of other agents, theInternet, or perhaps all of them combined)and respond in a timely fashion to changesthat occur in it;

� Pro-activity: agents do not simply re-spond to their environment: they can ex-hibit goal-directed behavior by initiatingactions;

� Trust: agents usually do what they aresupposed to do according to their percep-tion of the task and the world situation;

An Internet agent refers to an agent thatacts on the browser's behalf [5] [4] [19]. Themost popular agents on the Web are indexingagents such as Lycos, the WebCrawler, and In-foSeek. Indexing agents carry out a massive,autonomous search of the WWW, and storean index of words in document title and docu-ment texts. The user can then query the agentby asking for documents containing certain key

words. The indexing agents operate by sug-gesting locations on the Web to the user basedon a relatively weak model of what the userwants, and what information is available at thesuggested location. The Internet Softbot, how-ever, represents a more ambitious attempt toboth determine what the user wants and un-derstand the contents of information servicesin order to provide better responses to infor-mation requests.

The Live WWW agents are new kind ofagents that aims at reducing the bandwidthused by Web pages that contain live video.They act as intelligent assistants between thecamera and the server or between other agentsto perform di�erent kinds of image processing,vision, or robotics tasks required by the remotebrowser. The general system architecture forLive WWW with agents (LWWW-A) is shownin Figure 1. The LWWW-A is basically a tra-ditional WWWarchitecture plus a set of agentsthat connect the camera and the server.

Browsers

Communication

Servers

LWWW Agents

WWW

Figure 1: The LWWW system architecture. Itis basically a traditional WWW architectureplus a set of agents that connecting the cameraand the server to perform tasks required by thebrowser.

4

4 Performance Modeling

4.1 LWWW With Passive Cam-era

LWWW with passive camera means that thezoom, pan, and tilt of the camera at the server'ssite is �xed. So, when the vision agent is notused, the camera grabs images at a certainrate, and the grabbed images are transmittedthrough the Internet. In many situations, vi-sion agents can be used to reduce the band-width used. For example, a security person isresponsible for the security of a warehouse. Inthe warehouse, there are several cameras �xedin various places such that any part of the ware-house can be examined by at least one of thecameras. The images of the cameras are trans-mitted through Internet to a LWWW site thatis used by the security person to check the sit-uations of the warehouse. Since the number ofimages displayed at the Web site is large (whichequals the number of cameras within the ware-house), the bandwidth used by this LWWWsite is also large. The security person has tocheck every image on the computer screen, thushe might easily get tired. Vision agents canbe used in order to reduce the bandwidth usedand to make the task of the security person eas-ier. Instead of immediately sending the imagegrabbed by a camera to the browser, a changedetection agent is employed to analyze the im-age �rst, only those images that have enoughchanges are sent through the Internet and dis-played at the client site. The number of im-ages transmitted is decreased, thus the band-width used is also decreased. The security per-son's task is also easier, because he has lessimages to check. Another example is to reducethe size of the image at the server's site by us-ing vision agents that can act according to thebrowser's requirement and perception ability.For example, a browser visiting a LWWW pageis �rst asked several questions about his under-standing of image processing technology. If thebrowser is an image processing expert and hehas a deep understanding of the line detectiontechnology. Then the images grabbed by thecamera can be �rst processed by a line detec-tor and only the resulted images that containonly lines or curves are transmitted through

the Internet. Because the compressed line im-ages take very few space, the bandwidth usedis greatly reduced. In this case, we can evenincrease the frame rates such that the browsercan see a \true" live video without feeling theinterval between the frames.

4.1.1 TheWaitingTime for a Single Im-

age

For Live WWW, we de�ne the waiting timefor a single image frame as the time betweenthe submission of the viewer's request and thedisplay update resulting from the request.When agents are not used, the waiting time

Twait for a single image frame is given by

Twait = Tsetup + Tgrab + Tprop

+Ttrans + TcomDecompre

+Tdisplay (1)

where Tsetup is the network connection setuptime (the overhead in setting up the connec-tion between the source and the destinationcomputer); Tgrab is the image grabbing timefor the camera; Tprop is the propagation delayof the connection (the di�erence between thetime the last bit is transmitted at the headnode of the link and the time the last bit isreceived at the tail node); Ttrans is the net-work transmission time for the compressed im-age (the di�erence between the time the �rstbit of the �rst packet of the communicationpackets that form the image object and thelast bit of the last packet of the communicationpackets that form the image object are trans-mitted); TcomDecompre is the compression anddecompression time (the time needed to com-press the image before it is transmitted at theserver site and the time needed to decompressthe image before it is displayed at the browser'ssite); Tdisplay is the time needed to display thedecompressed image at the browser's site.When agents are used, the waiting time is

given by

TwaitA = TsetupA + TgrabA + TpropA

+TtransA + Tagent

+TcomDecompreA + TdisplayA (2)

5

where TsetupA is the network connection setuptime; TgrabA is the image grabbing time; TpropAis the propagation delay of the connection;Tagent is the agents operation time (the timefor the agent to process the grabbed image);TtransA is the network transmission time forthe image after being processed by agents andbeing compressed; TcomDecompreA is the com-pression and decompression time; TdisplayA isthe time needed to display the decompressedimage at the browser's site.For the above two equations, we can as-

sume that Tsetup = TsetupA, Tgrab = TgrabA ,TcomDecompre = TcomDecompreA , and Tprop =TpropA . From the aspect of waiting time for asingle image, the network setup time and trans-mission time can occupy a signi�cant portion ofthe total waiting time. The services providedby the Internet architecture are best e�ort ser-vices |{ they perform their functions as wellas possible, using as much of a shared resourceas available, but do not guarantee their perfor-mance or resource availability. When the num-ber of users, especially the LWWW users of theInternet increases, several simultaneous high-bandwidth sessions might easily saturate net-work links and routers. As a result, the band-width available decreases and the transmissionincreases. To maintain an acceptable waitingtime and send a useful stream, it is very im-portant to reduce the data to be transmittedover the Internet. The Internet agents are usedfor this purpose.From Equation 1 and Equation 2, we get

Twait � TwaitA = Ttrans + Tdisplay

�Tagent � TtransA

�TdisplayA (3)

Given a �xed network bandwidth, the trans-mission times Ttrans grow proportionally withthe size of the �le being transmitted. It canbe obtained by the size of the image �le to betransmitted Simage

1 and the average networkbandwidth B,

Ttrans =Simage

B

1which is proportional to the dimensions of the 2Dimage, its resolution (includingcolor quantization), andits quality.

The value of Tdisplay can also be approxi-mated by the size of the image to be displayed

Tdisplay = cSdisimage

where c is a constant.

Similarly, TtransA =SimageA

Band TdisplayA =

cSdisimageA. SimageA is the size of the image �le

to be transmitted after the transformation byvision agents. SdisimageA

is the size of the image�le to be displayed at the browser's side. SinceSimageA (e.g. the image only contain lines) andSdisimageA

can be much smaller than Simage and

Sdisimage respectively, the waiting time by usingagents TwaitA can be much smaller than thewaiting time without using agents Twait, al-though the agents have to spend some time toprocess the image.

4.1.2 Transmitted Data for a Video Se-

quence

We study how the vision agents can reduce theInternet tra�c in this section.

Suppose the video sequence contains N im-age frames. When agents are not used, theamount of data that is transmitted over theInternet is:

Mdata = NSimage (4)

When agents are used, suppose they �lteredout N� images that are not interesting, thenthe amount of data that is transmitted is:

MdataA = (N �N�)SimageA (5)

Thus, the amount of data that is reduced byusing agents is

Mdata �MdataA = NSimage

�(N �N�)SimageA(6)

Since SimageA can be much smaller thanSimage (e.g., in the line image case) and N�

can be pretty large (e.g., in the security personcase), the bandwidth used when the agents areused can be much smaller than the bandwidthused when the agents are not used.

6

4.2 LWWW with Active Camera

The spontaneous growth of the WWWover thepast several years has resulted in a plethoraof remote controlled mechanical devices whichcan be accessed via the WWW. LWWW withactive camera means that the zoom, tilt, pan,and even the position of the camera at theserver's site can be interactively controlled bythe remote browser. The use of an on-line con-trolled camera can make the browser experi-ence the state at a particular moment in timeas if he is in the actual remote space where thecamera is situated.

Right now, there are more than a hundredinteresting mechanical devices are connected tothe WWW [18]. For example, Paul Cooper et.al developed a interactive, telerobotic systemInterCam [1]. Ken Goldberg et al. [8] de-veloped a three axis telerobotic system whereusers were able to explore a remote world withburied objects and alter it by blowing burstsof compressed air into its sand �lled world.Eric Paulos and John Canny [18] developed aWWW browser Mechanical Gaze, which allowsmultiple remote WWW users to actively con-trol up to six degrees of freedom of a robot armwith an attached camera to explore a real re-mote environment. The initial environment isa collection of physical museum exhibits whichWWW users can view at various positions, ori-entations, and levels of resolution. Cooperstocket. al [3] proposed the idea of World-Wide Me-dia Space (WMS) for video-conferencing, whichsupports navigation with an active oor planand exploits sensors to provide additional in-formation for remote activity.

LWWW with active camera requires a highbandwidth to operate because the browser needto check the images constantly while interac-tively controlling the camera. Vision agentscan be used to greatly reduce the bandwidthused.

4.2.1 Waiting Time

For LWWW with active camera, we de�ne thewaiting time as the time between the submis-sion of the viewer's request and the display ofa image at the client's site that satis�es theviewer's request.

When agents are not used, the browser con-trols the camera remotely and check the imagestransmitted over the Internet until a satisfac-tory image arrives. Suppose the satisfactoryimage is the N th image that is transmittedover the Internet, then the waiting time Twaitis given by

Twait = Tsetup

+N�Tgrab + Tprop

Ttrans + TcomDecompre

+Tdisplay�

(7)

When agents are used, instead of directlytransmitting each image that is grabbed by thecamera, the agents analyze the image and de-cide whether it is the one the browser required.If the agents believe that the image is not awanted one, they will control the camera intel-ligently and grab the next image to analyze. Ifthe agents believe that the image is a wantedone, they will send the image through the Inter-net to the browser. When the browser checksthe transmitted image and is satis�ed, the taskis �nished. If the browser is not satis�ed withthe result, he/she will inform the agents to tryagain. Suppose the camera grabbed totally N�

images when the browser is satis�ed during theabove process, and suppose among them thereare N

0

images are transmitted through the In-ternet, then the waiting time is:

TwaitA = Tsetup +N�Tgrab +N�Tagent

+N0

Tprop + N0

TtransA

+N0

TcomDecompre

+N0

Tdisplay (8)

Generally speaking, N is smaller than N�

(because agents are not as smart as thebrowser), but much bigger than N

0

(becausemany unwanted images are �ltered out byagents). So, when the value of Tagent is notsigni�cantly large, the waiting time by usingagents can be much smaller than the waitingtime without using agents.

7

4.2.2 Transmitted Data Reduced

We study in this section the Internet tra�c re-duction by using agents that intelligently con-trol the camera and analyze the images.It is clear that

Mdata = NSimage (9)

MdataA = N0

SimageA (10)

The amount of transmitted data that is re-duced by using agents is given by

Mdata�MdataA = NSimage�N0

SimageA (11)

The above data reduction can be very signi�-cant when the agents are \smart" enough, thatis, when N

0

is kept very small compared to N .

5 An Exam-

ple: the LWWW Object

Search Agent

In the following section, we formulate theLWWW Object Search Agent (LWWW-OSA),outline its working mechanism, and present ex-perimental results to illustrate how the agentperforms its task.The LWWW-OSA is a useful feature of

LWWW equipped with an active camera. Forexample, a researcher of CAS wants to playbaseball after work. The researcher is supposedto bring the baseball to the game, but cannot�nd it in the o�ce. The researcher wants toknow whether the baseball is in someone else'so�ce. The researcher can open the home pageof the the possible person and pass the requestto the server site of LWWW. At this time, theWWW agent can be activated to automaticallycontrol the camera to search for the baseball.

5.1 The LWWW Object SearchAgent Formulation

The LWWW Object Search Agent (LWWW-OSA) is a software package that assists theserver in its searching for an object in the serversite's space environment. It has the properties

of autonomy, social ability, reactivity, proactiv-ity, trust, and knowledge about itself and theenvironment.

5.1.1 Awareness

The LWWW-OSA has the following knowledgeabout the sensor and the environment.The sensor unit is a camera with zoom,

pan, and tilt capabilities. The state of thesensor unit is uniquely determined by 7 pa-rameters (xc; yc; zc; p; t; w; h), here (xc; yc; zc)is the position of the unit center which can-not be changed, (p; t) is the direction of thecamera viewing axis (p is the amount of pan0 � p < 2�, t is the amount of tilt 0 � t < �),and w; h are the width and height of the solidviewing angle of the camera.

The LWWW-OSA has a set of recognition al-gorithms that can be used to analyze the imagegrabbed by the camera control agent to detectthe target.The LWWW-OSA knows the geometric con-

�guration of the search region . It tessel-lates the region into a series of elements ci, =

Sn

i=1 ci and ciTcj = 0 for i 6= j. co is the

region outside .According to the information the browser

supplies about the possible position of the tar-get, the LWWW-OSA forms a target proba-bility distribution function p. The p(ci) givesthe probability that the center of the target iswithin cube ci. The LWWW-OSA uses p(co)to represent the probability that the target isnot in the environment.

5.1.2 Reactivity and Knowledge Adap-

tation

The LWWW-OSA perceives the physical worldby executing actions and adjusting its knowl-edge.An operation f = f (p; t; w; h; a) is an action

of the searcher within the region , here a isthe recognition algorithm used to detect thetarget. An operation f entails two tasks: (1)the LWWW-OSA instructs the LWWW-CCA(Live World Wide Web |{ Camera ControlAgent) to take a perspective projection im-age according to the camera con�guration of fand to pass the image to LWWW-OSA; (2) the

8

LWWW-OSA analyzes the result image usingthe recognition algorithm a.The cost to(f ) gives the total time LWWW-

CCA needed to manipulate the hardware to thestatus speci�ed by f , take a picture and passthe image data to LWWW-OSA, and run therecognition algorithm and update the environ-ment if the target is not detected.Each operation is associated with a detection

function. The detection function on is a func-tion b, such that b(ci; f ) gives the conditionalprobability of detecting the target, given thatthe center of the target is located within ci andthat the operation is f . It is obvious that theprobability of detecting the target by applyingaction f is given by P (f ) =

Pn

i=1 p(ci)b(ci; f ).The LWWW-OSA not only senses the en-

vironment, it also updates its knowledge ac-cording to the sensing result. It uses Bayes'formula to update the probability distributionwhenever an action fails (the target is not de-tected). Let �i be the event that the center ofthe target is in cube ci; let �o be the event thatthe center of the target is outside the searchregion; and let � be the event that after ap-plying a recognition action, the recognizer suc-cessfully detects the target. Then P (:� j �i) =1� b(ci; f ) and P (�i j :�) = p(ci; tf+), wheretf+ is the time after f is applied. Because theevents �1; : : : ; �n; �o are mutually complemen-tary and exclusive, the following updating ruleholds

p(ci; tf+) p(ci; tf )(1� b(ci; f))

p(co; tf) +Pn

j=1p(cj; tf)(1� b(cj; f))

where i = 1; : : : ; n; o.

5.1.3 Trust and Pro-activeness

According to the search task assigned by thebrowser, LWWW-OSA can explicitly representthe goal of the task and analyze its di�cultiesand propose a practical solution.Let O be the set of all the possible opera-

tions that can be applied. The e�ort allocationF = ff1; : : : ; fkg gives the ordered set of oper-ations applied in the search, where fi 2 O.The probability of detecting the target by thisallocation is as follows:

P [F] = P (f1) + [1� P (f1)]P (f2)

+ : : :+

fk�1Yi=1

[1� P (fi)]gP (fk)

The total cost for applying this allocation is

T [F] =kXi=1

to(fi)

If K is the total time that is allowed for thesearch, then the goal of object search can bede�ned as �nding an allocation F � O, whichsatis�es T (F) � K and maximizes P [F].Because the above task is NP-Complete (for

details of our proof, refer to [28] and [24]) andthe number of the available candidate actionsis usually large, it is necessary to simplify theoriginal problem. Instead of looking for an al-gorithm that always generates an optimal solu-tion, LWWW-OSA simply uses heuristics thatwill generate a feasible solution for the origi-nal problem. Here, the greedy strategy is used,which suggests that one can devise an algo-rithm that works in stages, considering one in-put at a time. At each stage, based on someoptimization measure, the next candidate is se-lected and is included into the partial solutiondeveloped so far. The objective function for the

selection of the next action is E(f ) = P (f )to(f )

. It is

interesting to note that the above greedy strat-egy can generate the optimal answers in manysituations and that it can generate a sequenceof actions so that the expected time used to de-tect the target is minimized (for detail of ourproof, refer to [28][24]).

5.1.4 Autonomy

The LWWW-OSA selects the camera's stateparameters (p; t; w; h) without the interventionof the browser or the server. It determines thenext action according to its perception of theworld and passes its decision to LWWW-CCA.For a given recognition algorithm, there are

many possible camera viewing angle sizes toselect from. However, the whole search regioncan be examined with high probability of detec-tion using only a small number of them. Fora speci�ed angle, the probability of success-fully recognizing the target is high only whenthe target is within a certain range of distance

9

from the camera. This range is called the ef-fective range for the given angle size. TheLWWW-OSA only considers angles whose ef-fective ranges cover the entire depth D of thesearch region and have no overlap of their ef-fective ranges. If the largest viewing angle forthe camera is w0 � h0 and its e�ective rangeis [N0; F0], then the necessary angle sizes <

wi; hi > and the corresponding e�ective ranges[Ni; Fi] are as follows:

wi = 2arctan[(N0

F0)itan(

w0

2)]

hi = 2arctan[(N0

F0)itan(

h0

2)

Ni = F0(F0

N0)i�1

Fi = F0(F0

N0)i

where 1 � i � bln( D

F0)

ln(F0N0

)� 1c.

For each angle size derived, an in�nite num-ber of camera viewing directions can be con-sidered. We have designed an algorithm thatcan generate only the directions that can coverthe whole viewing sphere without overlap. TheLWWW-OSA only uses as candidate actionsthe viewing angle sizes and the correspondingdirections obtained by the above method. Anin�nity of possible sensing actions is reducedto a �nite set of actions that must be tried.The LWWW-OSA uses E(f ) to select the bestviewing angle size and direction from the can-didate actions. For each recognition algorithm,LWWW-OSA can �nd a best action. The nextaction to be executed is then selected fromthese best actions.After the next action is selected, LWWW-

OSA passes the result to LWWW-CCA. Ifthe selected action does not �nd the target,LWWW-OSA updates its world knowledge ac-cording to Section 5.1.2 and selects anotheraction to execute.

5.2 The Behaviors of LWWW-OSA

The behavior of LWWW-OSA can be summa-rized as follows:

1. The LWWW-OSA receives an objectsearch request from the server. The re-quest contains the target for which tosearch, the time allowed, and the possiblelocations of the target that are supplied bythe remote browser.

2. The LWWW-OSA initializes the targetprobability distribution according to theinformation provided by the browser, andcollects the available object recognition al-gorithms needed to detect the target.

3. For each available recognition algorithm,the LWWW-OSA selects a best action.

(a) The LWWW-OSA selects possiblecamera angle sizes < w; h > neededto examine the search region.

(b) The LWWW-OSA selects the best di-rection < pk; tk > for each cameraangle size < wk; hk >.

(c) The LWWW-OSA compares the bestdirections for each camera angle sizeto �nd the best action for the givenrecognition algorithm.

4. The LWWW-OSA compares the bestactions for each available recognitionalgorithm to �nd the parameters <

w; h; p; t; a > for the next action.

5. The LWWW-OSA sends the best <

w; h; p; t; a > to LWWW-CCA.

6. The LWWW-CCA manipulates the hard-ware, grabs an image, and sends the imagedata to LWWW-OSA.

7. The LWWW-OSA analyzes the image byusing recognition algorithma. If the targetis detected, it sends the FOUND signal tothe server and performs EXIT.

8. If the allocated time is used up, LWWW-OSA sends the FAILURE signal to theserver and performs EXIT.

9. The LWWW-OSA updates the probabilitydistribution and goes back to Step 3.

10

5.3 Experiments

Experiments are performed to test theLWWW-OSA. The Laser Eye sensor shown inFigure 2(b) is used in the experiment. It isa sensing unit with pan and tilt capabilities,and consists of a camera with controlled focallength (zoom), a laser range-�nder, and twomirrors. The mirrors ensure collinearity of ef-fective optical axes of the camera lens and therange �nder. The laser emits from the centerof the camera (see [10] for details) to measurethe distance from the center of the camera toobjects in the environment in a speci�ed direc-tion. The direction of the camera's viewing axisand the size of the camera's viewing angle canbe adjusted according to the visual task. Thesearch task assigned by the browser is to searchfor a baseball in the environment as shown inFigure 2(a). The browser also speci�es toLWWW-OSA that the baseball is probably ontables. After LWWW-OSA initializes the prob-ability distribution of the search environment,the search begins. Figure 2(c) shows the �rstaction selected by LWWW-OSA. The cameraviewing angle size is 41o � 39o. Although thebaseball is in the image, the given recognitionalgorithm does not detect the target becausethe target is outside the e�ective range of theview angle size 41o � 39o. Figure 2(e) showsthe third action selected by LWWW-OSA. Thecamera viewing angle size is 20:7o�19:7o. Thisaction detects the target. Figures 2(f) and2(g) show the image analyze result, where thebaseball is detected (refer to [27][24] for a moredetailed discussion of the experiments).

6 Conclusion

This paper proposes the concept of visionagents over Live World Wide Web. It outlinesthe general architecture of the LWWW agenttechnology and gives the performance mse

The Web now contains over 100 million docu-ments, and is said to be doubling in size every52 days. The incredible interest in the Net andthe Web, and in the more general concept ofinformation highways, is evident in the mediatoday and is shared by peoples of various back-ground. We believe that by o�ering the visionagents over Internet can make the Live WWWmore attractive and open more applications ofWWW technology. Vision agents over Internetwill become one of the features of the futureInternet technology.

The results presented in this paper can betaken as a starting point for further study ofthe LWWW technology. It opens new avenuesof research, such as, how to extend the cur-rent communication protocol so that it is suit-able for LWWW technology, how to integratereal time channels to make LWWW work ef-fectively, and how to maintain the consistency,integrity and security of LWWW.

Acknowledgments

This work2 was supported by IBM Centrefor Advanced Studies and ARK (AutonomousRobot for a Known environment) Project,which receives its funding from PRECARNAssociates Inc., Industry Canada, the Na-tional Research Council of Canada, Technol-ogy Ontario, Ontario Hydro Technologies, andAtomic Energy of Canada Limited. The au-thors are grateful to Karel Vredenburg and DonHameluck for stimulating discussions and toCASCON96 reviewers for valuable comments.

About the Author

2The CAS-LWWW team now consists Karen Bennetand Abhay Shah from IBM side, and John K. Tsotsosand Yiming Ye from University of Toronto side.

Yiming Ye is a Ph.D candidate at De-partment of Computer Science, University ofToronto and a Research Fellow student at IBMCanada Centre for Advanced Studies. Hehas papers in the areas of Internet agent andWWW, computer vision and image process-ing, robotics, computational geometry, ma-chine translation, computational complexity,and knowledge based systems been publishedin journals and conference proceedings suchas: Proceedings of CASCON96, Proceedings ofSecond International Conference on MultiagentSystems (ICMAS96), Proceedings of the Inter-national Symposium on Intelligent Robotic Sys-tems (SIRS96), Proceedings of 1995 IEEE In-ternational Symposium for Computer Vision(ISCV95), Proceedings of the 4th InternationalSymposium on Arti�cial Intelligence and Math-ematics (AIM95), Proceedings of the SecondAsian Symposium on Computer Mathematics(ASCM96), Proceedings of the IJCAI Work-shop for handicapped Children, Proceedings of1990 ACM Eighteenth Annual Computer Sci-ence Conference, and Science in China etc. Hiscurrent research interests are multimedia, im-age processing, computer vision, robotics, In-ternet agent, and multiagent system, etc. Hecan be reached at [email protected] K. Tsotsos is a professor of Department

of Computer Science, University of Torontoand a Fellow of Canadian Institute of Ad-vanced Studies. He can be reached at [email protected] Bennet is a Principal Investigator and

Operational Manager for Center for AdvancedStudies, IBM Canada. She can be reached [email protected].

References

[1] David Abrams. Toronto web society. Inhttp://www.cs.utoronto.ca/ abrams/tws.html,Toronto, 1996.

[2] Z. Chen, S.E Tan, R.H Campbell, andY. Li. Real-time video and audio in theworld wide web. In Proceedings WWW95,1995.

[3] Jeremy Cooperstock, Kelvin Ho, KimiyaYamaashi, and Bill Buxton. Opening the

12

doors of communication: The world-widemedia space. In Submitted to UIST'96:Ninth Annual ACM Symposium on UserInterface Software and Technology, 1996.

[4] L.L. Daigle and P. Deutsch. Agents forinternet information clients. In CIKM'95Intelligent Information Agents Workshop,Baltimore, MD, December 1995.

[5] Oren Etzioni and Daniel S. Weld. Intelli-gent agents on the internet: Fact, �ction,and forecast. IEEE Expert, August,1995.

[6] Armando Fox and Eric A. Brewer. Reduc-ing www latency and bandwidth require-ments by real-time distillation. In Fifth In-ternational World Wide Web Conference,Paris, France, May 1996.

[7] Didier La Gall. Mpeg: A video compres-sion standard for multimedia applications.Communications of the ACM, 34(4):47{58, 1991.

[8] K. Goldberg, M. Mascha, S. Gentner,N. Rothenberg, C. Sutte, and Je� Wiegley.Robot teleoperation via www. In IEEEInternational Conference on Robotics andAutomation, Toronto, May 1995.

[9] Telepresence Systems Inc. Prorata: Groupawareness and collaboration software. InA Deme at: The Internet Beyond the Year2000, Toronto, 1996.

[10] P. Jasiobedzki, M. Jenkin, E. Milios,B. Down, and J. Tsotsos. Laser eye - anew 3d sensor for active vision. In Intelli-gent Robotics and Computer Vision: Sen-sor Fusion VI. Proceedings of SPIE. vol.2059, pages 316{321, Boston, Sept. 1993.

[11] Tom Lane. Jpeg image compression: Fre-quently asked questions. In IndependentJPEG Group:http://www.smartpages.com/faqs/jpeg-faq/faq.html, November 1994.

[12] Case T. Larsen. Introduction to videocon-ferencing and thembone. In http://www.lbl.gov/ctl/vconf-faq.html, 1996.

[13] Y. Lashkari, M. Metral, and P. Maes. Col-laborative interface agents. In Proceedingsof the National Conference on Arti�cialIntelligence, USA, 1994.

[14] Chris Lilly. Not just decoration: Qual-ity graphics for the web. In Fourth In-ternational World Wide Web Conference,Boston, 1995.

[15] P. Maes, T. Darrell, B. Blumberg, andS. Pentland. Interacting with animatedautonomous agents. In Working NotesAAAI Spring Symposium on 'BelieveableAgents', USA, 1994.

[16] Jacques A Mattheij. Web camera info. Inhttp://www.mattheij.nl/webcam, 1996.

[17] England P., Allen R., and Underwood R.Rave: Real-time services for the web. InFifth International World Wide Web Con-ference, Paris, France, May 1996.

[18] Eric Paulos and John Canny. A worldwide web telerobotic remote environmentbrowser. In Fourth International WorldWide Web Conference, Massachusetts,USA, 1995.

[19] D. Riecken. Intelligent agents: Introduc-tion to special issue. Communications ofthe ACM, 37(7):107{116, July 1994.

[20] J.K. Tsotsos. Analyzing vision at the com-plexity level. The behavioral and brain sci-ence, 13:423{469, 1990.

[21] Jacco van Ossenbruggen and AntonElikns. Bringing music to the web. InFourth International World Wide WebConference, Boston, 1995.

[22] Xiaohuan Corina Wang. An adaptive ren-dering and display model for networkedapplications, 1996.

[23] MikeWooldridge and Nick Jennings. Intel-ligent agents: Theory and practice. USA,1994.

[24] Yiming Ye. Sensor planning in 3D ob-ject search. PhD thesis, Department ofComputer Science, University of Toronto,Toronto, Ontario, Canada M5S 1A4, 1996.

13

[25] Yiming Ye and John K. Tsotsos. Sensorplanning for object search. Technical Re-port RBCV-TR-94-47, Computer ScienceDepartment, University of Toronto, 1994.

[26] Yiming Ye and John K. Tsotsos. The de-tection function in object search. In Pro-ceedings of the fourth international confer-ence for young computer scientist, pages868{873, Beijing, 1995.

[27] Yiming Ye and John K. Tsotsos. Whereto look next in 3d object search. In 1995IEEE International Symposium for Com-puter Vision, Florida, U.S.A, November19-21 1995.

[28] Yiming Ye and John K. Tsotsos. Sensorplanning in 3d object search: its formu-lation and complexity. In The 4th Inter-national Symposium on Arti�cial Intelli-gence and Mathematics, Florida, U.S.A,January 3-5 1996.

[29] Frank Yellin. Low level security in java.In Fourth International World Wide WebConference, Boston, 1995.

.

14