wireless cellular control of time-shared robotic web …daveab.com/files/truelookwireless.pdfdavid...

To appear in Internet Imaging IV, IS&T/SPIE’s 15th Annual Symposium of Electronic Imaging Science and Technology, 2003

Wireless cellular control of time-shared robotic web cameras David H. Abrams and Peter N. Prokopowicz, Ph.D.

Collaborative Software Group, divine, inc. Chicago, IL 60602

[email protected]; [email protected]

ABSTRACT We present a novel user-interface and distributed imaging system for controlling robotic web-cameras via a wireless cellular phone. A user scrolls an image canvas to select a new live picture. The cellular phone application (a Java MIDlet) sends a URL request, which encodes the new pan/tilt/optical-zoom of a live picture, to a web-camera server. The user downloads a new live picture centered on the user’s new viewpoint. The web-camera server mediates requests from users, by time-sharing control of the physical robotic hardware. By processing a queue of user requests at different pan/tilt/zoom locations, the server can capture a single photograph for each user. While one user downloads a new live image, the robotic camera moves to capture images and service other independent user requests. The end-to-end system enables each user to independently steer the robotic camera, viewing live snapshot pictures from a cellular phone. Keywords: cell phone, wireless, web camera, robotic, telepresence

1. INTRODUCTION Empowering users to see live images of a remote location can solve many business and consumer problems. Web cameras are becoming an effective tool for remotely monitoring a manufacturing plant, construction site, oil refinery, traffic intersection, operations center, retail store, tradeshow, or daycare center. In all of these cases, a user can make decisions based on a live picture, which shows the current status of the site as it is changing. Real-time visual information can be essential for an operations manager to make decisions, save expensive travel costs, and avoid miscommunications that lead to project delays, or failures to meet quality standards. 1.1 Making Decisions from Real-time Visual Data In many real-world business applications, it is imperative that decisions are based on accurate, real-time, and descriptive information of a changing environment. For example, a general contractor or construction manager could use a web camera to remotely diagnose potential problems before continuing to the next phase of the project. But, the manager could only make a decision if he/she had enough visual information to inspect the area of interest close-up. Typically, a low-resolution wide-angle view of the entire site does not provide enough resolution to be useful for remote diagnosis, inspection, or quality control. In addition, workers are often geographically distributed and do not regularly have access to a personal computer. A fixed-view web camera that is only available from a desktop computer is not enough to enable users to make time-critical decisions to solve business problems. They need a mobile wireless device that provides high-resolution visual data for inspecting and making key decisions. 1.2 Empowering Mobile Users to See Remote Locations We present a novel user experience and distributed imaging system that enables mobile users to remotely inspect a location, diagnose problems, and make decisions. A wireless cellular phone application sends requests to a web camera server that controls a robotic camera. The user can interactively pan/tilt/zoom the camera, seeing live pictures. This phone application is an extension of the commercial TrueLook telepresence system. TrueLook (www.truelook.com) is the distributed imaging system that we have developed to enable real-time telepresence for remote web viewers, including mobile cell phone users. TrueLook has been implemented using existing PC, camera, phone, network, and wireless infrastructure and industry standard software protocols. The design decisions that we describe in the paper are based on 6 years of actual deployments of telepresence systems in manufacturing plants, construction sites, retail stores, and major sporting events (e.g. the NBA Finals, Summer and Winter Olympics, and the MLB World Series).

2. USER EXPERIENCE

The user can see live images of the remote site by using a cellular phone with the client application that communicates with the web camera server. The user interactively steers the pan/tilt/zoom robotic camera, taking live snapshot photographs of the remote location. 2.1 Mobile Webcam Client This client application resides on the user’s phone, and sends requests to the imaging system for live pictures. Our implementation uses a Java MIDlet1 to control the robotic web camera. The user loads the MIDlet onto his/her phone and specifies a default server to acquire the index of cameras. 2.2 Selecting a Camera and Preset Shot The user selects from a text menu of camera categories, and drills down the hierarchy of menus until a camera is selected by name. An example set of menu selections may be:

Construction | Illinois | Evanston | BERL Construction Cam

Figure 1. (top) MIDlet displays preset shots from dynamic database of cameras available for a particular application. (bottom) TrueLook Java MIDlet displaying a picture of a construction site in the Motorola i90c phone using the Nextel cellular data network to communicate to a telepresence server with a public Internet address. The user presses the arrow keys to interactively scroll a 110x90 pixel display on a 320x240 image canvas and steer a remote robotic camera.

The server dynamically generates the camera list from a database and it filters out any cameras that are not currently serving live images (e.g. power or networking is off), or has restricted access privileges. The client downloads the camera index before displaying the menu. Upon selecting the web camera, the user chooses a preset shot from a text menu of predefined starting pan/tilt/zoom positions for the camera. Once a preset has been selected, a request is immediately encoded and sent to the web camera server to take a new live picture at that pan/tilt/zoom. 2.3 Interactive Viewport Once the image is downloaded, the user can press the up/down/left/right arrows on the keypad to scroll the image. Since the image canvas is larger than the viewpoint, the user interactively explores the image by scrolling the viewpoint to different parts of the image. To take a new live image centered on the current location of the viewport, the user simply presses a number on the keypad that represents the optical zoom level of the next live picture. A request is sent to the web camera server to take a new live image at the requested pan/tilt/zoom. The robotic camera physically moves to the pan/tilt/zoom coordinates, and a photograph is captured. The photograph downloads to the user’s phone. The user sees a viewport centered within the canvas of the new live image. Clearly there is a usability tradeoff between image resolution and download time. In fact, we have tested our application with pictures that match the size of the cell phone screen, essentially removing the scrollable canvas. Our tests have shown that when a user simply wants an overview picture of the scene (e.g. for identifying a location) the smaller picture can download in approximately 3 seconds, as opposed to 8 seconds for a larger 320x240 image. But, the interactive canvas has shown to be essential for object inspection, and decision-making. 2.4 Inspecting with Steerable Optical Zoom Once the user finds an area of interest that needs to be inspected, he can zoom in for a close-up photograph. The user scrolls the viewport to the area of interest and presses a number on the keypad that represents the optical zoom magnification of the next live picture. The robotic camera’s lens moves to the requested optical magnification, then a new live image is captured, encoded, and downloaded to the client. The user can see extremely high-resolution close-up shots of the remote site.

3. DISTRIBUTED SYSTEM ARCHITECTURE AND OPERATION The TrueLook system architecture (Fig. 3) consists of four types of entity: various robotic cameras with their controller units, directory servers, telepresence servers, and telepresence clients. The directory server constantly monitors all

Figure 2. (L) Skyline scene at widest field of view (actual resolution is about twice shown). (R) Detail of tower near the center of picture at left, taken with 25x optical zoom. Actual pictures taken with TrueLook system and Sony SNC-RZ30 camera.

telepresence servers to determine which cameras are currently accessible (1,2). When the user activates the application, it connects to a pre-configured directory server (3), which delivers real-time hierarchical menus of active cameras to the Java MIDlet (4). The user navigates menus to select the desired camera and view, which the MIDlet transmits to the telepresence server (5). The telepresence server converts request into pan, tilt, zoom, and image size parameters and transmits the request to the networked robotic camera (6). The robotic camera acquires and returns the requested image (7). The image is saved in an image archive for possible re-use in image caching or collaborative applications (8). Finally, the image is delivered to the client (9). The user scrolls the image to view it and to direct the remote camera for the next picture. With the image scrolled to adjust the viewpoint, the user directs the camera to take a new picture, which continues the interactive picture-taking process (5). When the user requests a new live picture, the (x, y) pixel offset of the center of the viewport on the image canvas is encoded in a URL request, along with the pan/tilt degrees, and optical zoom level of the current image. Since the Java MIDlet1 specification does not currently support floating-point operations, our client simply captures the integer (x, y) location and sends it to the server to perform the transformation computation. We project the (x, y) target pixel offset from a plane tangent to the viewing sphere at (0, 0), into spherical coordinates, to determine the new pan/tilt positions. The resulting target pan/tilt degrees and optical zoom magnification are sent to the encoding unit, which converts the requested pan/tilt/zoom into the device-specific reference frame and control metrics. Figure 3. Distributed network architecture for time-shared interactive control of remote robotic cameras (telepresence). Client devices include (right, top to bottom) PDAs, Java-enabled cellular phones, and standard PCs. Analog video robotic cameras are connected to the Internet by an encoder PC (left, top). Network-enabled cameras connect directly to the Internet via HTTP.

When the resulting image capture is encoded into a JPEG, the HTTP response to the MIDlet includes the state information <pan, tilt, zoom, time, url> in the HTTP header fields. Thus, a single HTTP response includes the actual JPEG image data and its associated state. This meta-data is necessary for calibrating the user’s next request, ensuring that each user independently steers the camera. 3.1 Client TrueLook supports a wide range of clients, requiring only that they are TCP-IP and HTTP enabled. No single client software architecture could provide optimal performance and usability across this range. In this paper we emphasize particularly our client built on the Java MIDlet software platform, as supported by the Motorola iDen2 handset and Nextel IP service. We also have deployed a small-area, pure HTML client for use in wireless Palm OS browsers, a Java applet engineered for the Pocket PC form factor, as well as Java and pure-HTML clients for the whole range and history of PC and workstation browser clients (although well over 90% of our client use comes from Microsoft Internet Explorer version 4 or later). 3.2 Directory Server In the TrueLook system, end user clients do not communicate directly with networked cameras. The client system communicates with a directory server that presents a hierarchical, searchable, directory of telepresence applications. The directory server returns the address of the corresponding telepresence server that hosts the requested application. The directory server is aware of which cameras are currently on-line, so that the client is not directed to an unavailable application. 3.3 Telepresence server The telepresence server hosts up to several hundred applications, each of which can have many live cameras. For example, an application might comprise several cameras located on a construction site to monitor progress and quality, or several dozen cameras for monitoring the operations of a factory floor. The server maintains a dynamic database of the cameras associated with an application, and configures the client in real-time with a menu of preset viewpoints with thumbnail images of them, and panoramic viewfinders, depending on the client’s ability to display these navigational aids. 3.4 Networked robotic camera The telepresence server acts as a caching proxy between the client and the robotic cameras. Many users can simultaneously request images from a camera. The server re-orders the requests to minimize camera movement, and delivers the requests to the networked camera, which rapidly acquires the images and returns them. Under heavy loads, the server will use its cache to return an earlier image directly to the client. In practice, between 10 and 20 users can share a camera without resorting to the cache. In the TrueLook architecture, IP-enabled cameras3,4,5 are self-contained entities that communicate over the internet to the telepresence servers. Non-IP enabled USB6 and analog7,8 video cameras must be directly connected to an encoding unit, which is an off-the-shelf PC capable of capturing video.

4. DESIGN CRITERIA AND TRADEOFFS The system is intended to support diverse robotic cameras, servers, and clients, and to work effectively across a very broad range of bandwidth and latency environments. It also must scale to large numbers of cameras and clients in use at the same time. The application itself requires high image resolution, timely image acquisition, delivery, and display, and precise camera control. 4.1. JPEG/M-JPEG vs streaming image transmission A key architectural decision was to standardize on JPEG image encoding for capture, transmission, storage, and display. Several considerations led us to this approach. First of all, we aim to support a wide-range of network bandwidth and latency environments, at both the client and image-acquisition portions. This is especially critical for the cellular client environment as of Y2002, where effective data rates are typically in the low kilobit range9. On the other hand, in LAN environments the entire system often operates with 100 mbps available end to end on Ethernet. The system should be

scalable in that users on the LAN as well as on the wireless phone can both use the application at the same time effectively. In our architecture, the client drives image capture by requesting images as large as it can usefully display (including image scrolling as described below), and as fast as they can be delivered. The client pipelines the image request, response decoding, and image display operations so that they can occur in parallel. Likewise, the server pipelines several requests to the encoding system. We have measured a Pentium III LAN-based client at 20 frames per second with an image size of 640 x 480 pixels. The compressed JPEG data is typically around 20-30Kbytes per frame. Thirty frames per second (NTSC video standard frame rate, at 640 x 480 pixels) will be achievable with standard PC and LAN technology in the very near future. Our wireless client requests images of 320 x 240 pixels at a rate of about 8 seconds per image. This is not fast enough to show useful motion, but the images are large enough to capture useful detail, which is a key requirement. We decided not to use more efficient interframe compression encoding techniques, such as MPEG-410. All current streaming interframe compression video technologies introduce a latency of at least several seconds. This is because they require a fairly large image buffer in order to compress the image stream, and to ensure that it can be transmitted at a fixed average bandwidth. This latency makes it very difficult to control a robotic camera using visual feedback. We have built prototype telepresence systems using commercial off the shelf encoders and servers from Microsoft and Real Networks, and we unable to reduce the latency enough to allow for usable real-time control over a wide-area network. Interframe compression techniques also have the drawback that each individual frame has relatively poor spatial resolution, especially if there is motion in the scene. Individual still frames, on the other hand, can easily freeze motion and deliver high resolution uniformly across the scene. Finally, wireless handsets are not yet commercially available with MPEG-4 decoding. Overall, JPEG and Motion JPEG image capture and display have proven more effective than streaming video for real-world telepresence applications. 4.2. Time -shared vs. queued control Truelook telepresence uses a novel time-sharing technique to give potentially many simultaneous users apparently exclusive control of a particular camera. When a client requests a view of the remote scene, the system moves the camera and encodes the picture on demand. While the image is being transmitted and viewed, many other requests from other clients can be handled. Thus each client can request a unique series of pictures independently of all the other clients. Another approach commonly taken is to provide exclusive control of the camera to one client for a short period, while other users wait in a queue for control3,11. All clients see the same stream of images whether in control or waiting. This approach doesn’t meet the requirement that end users be able to get the exact view needed in a timely manner, because queue times can get very long with only a few users sharing the system. 4.3. High-resolution pan-tilt-zoom vs. wide -angle acquisition The telepresence research community has long recognized the central role of remotely controlling a viewpoint over the widest possible range, but have split into two approaches: mechanical pan-tilt cameras12,13,14 vs solid-state wide-angle lens/mirror systems 15,16,17. Solid-state systems have two major advantages. First, there are no moving parts, which reduces cost, improves reliability, and, most importantly, allows the viewpoint to be shifted at any speed, and controlled independently and simultaneously by any number of viewers. The last feature is approached in two ways: either the entire wide-range field of view is comp ressed, delivered, and the selectively displayed by each client, or each client directs the encoding system to encode and deliver the current field of view requested by that client. The first approach is highly scalable in that the server compresses a single stream and delivers the same stream to each viewer. Multi-tier serving architectures can be deployed effectively. The client can shift the viewpoint instantly because the entire field of view is always available. The big downside is the cost in bandwidth required to achieve these benefits: the entire field of view must be streamed while only a portion is viewed.

This downside becomes more severe as the system captures the scene with higher resolution. Unfortunately, very high resolution is necessary for real-world inspection and monitoring applications when each image must contain an entire hemisphere of view. What capture resolution would be adequate for a wide-angle lens system? A hemispherical image captured at 640 x 480 resolution will have spatial resolution of 1.8 pixels/degree or worse, depending on the projection used, whereas a typical pan/tilt zoom video camera system has a field of view of roughly 45 degrees horizontally, giving a resolution of 14.2 pixels/degree. Coupled with an optical zoom typically from 10 to 25 power3,7, resolution can be as high as 350 pixels per degree, roughly 200 times sharper than a 640 x 480 hemispherical image. It is not possible to use optical zooming with a wide-angle lens system, because zooming will by definition reduce the field of view accordingly. To reach 350 pixels per degree across a viewing hemisphere would require a sensor with 360x90x3502 = 4 billion pixels. Our usability studies indicate that the end-to-end resolution of an NTSC analog video camera, with 640 x 480 image capture, 12x optical zoom, and moderate JPEG compression/decompression, is very nearly the same as what the 20/20 naked eye can resolve at the same distance. In other words, it is very much like being there. The hemispherical image described above, by contrast, has resolving power similar to uncorrected myopia of 20/4000; 20/200 is considered legally blind18. Hemispherical images are adequate for teleconferencing applications where all participants are very near the camera, but not effective if it is important to read or inspect anything.

5.CONCLUSION We have presented a novel mobile phone user interface for controlling robotics cameras, within a distributed imaging system called TrueLook. Mobile users can interactively pan/tilt/zoom a remote camera via a wireless cellular phone application. Our Java MIDlet sends requests to a telepresence server, which controls a networked robotic camera. The MIDlet provides an interactive viewport that lets users scroll an image canvas, repositioning the viewpoint and zooming in to take another live picture centered at the target view. The system we presented enables mobile users to visually inspect objects at a remote location, and make time-critical decisions from live pictures of the scene. The future of ubiquitous telepresence is being driven by the decreasing cost and complexity of capturing digital images, the rapidly spreading footprint of the Internet, and the increasing functionality and mobility of Internet-ready client hardware. It is now practical to monitor and inspect almost any location or process worldwide, using remote-controlled, interactive cameras, from anywhere in the world.

6. ACKNOWLEDGEMENTS The authors would like to thank Paul Cooper, Michael Halleen, Rick Otto, Andy Minor, and Hisham Petry for their contributions to the design and operation of the TrueLook network.

7. REFERENCES 1. Mobile Information Device Profile (JSR-37), JCP Specification Java 2 Platform, Micro Edition, 1.0a Sun

Microsystems, Inc., 2000 2. “iDen Technical Overview”, Software Release 9.1, 1999, Motorola, Inc. 3. “Sony SNC-RZ30 CGI Command manual”, Version 1.01 9/9/2002, SONY Corporation. 4. “JVC V. Networks API Specification”, Version 2.0 4/1/2001, Victor Company of Japan, Limited, Optical

Communication Division. 5. “What is a Network Camera?”, Axis White Paper, 2002, Axis Communications. 6. Logitech “QuickCam Pro 4000” User’s Manual. 7. “Sony EVI-D30/D31 Programmer’s Manual”, SONY Corporation. 8. “VC-C4 Communication Camera Programmer’s Manual”, Canon Corp. 9. “Handheld’s Cellular Data FAQ”, John E. Bartley, III, http://celdata.cjb.net, 2002 10. ISO/IEC 14496-1,2:1999, “Information technology – Coding of audio-visual objects – Parts 1&2”, December 1999

11. “Canon VB101 User’s Manual”, Canon, Inc., 2002 12. Eric Paulos and John Canny. Delivering real reality to the World Wide Web via telerobotics. In Proceedings IEEE

International Conference on Robotics and Automation, pages 1694--1699, Minneapolis, Minnesota, April 22-28 1996.

13. K. Goldberg, M. Mascha, S. Gentner, N. Rothenberg, C. Sutter, and J. Wiegley. Desktop tele-operation via the World Wide Web. In Proceedings of the IEEE International Conference on Robotics and Automation, 1995.

14. The Mercury Project: A Feasibility Study for Internet Robots, K. Goldberg, S. Gentner, C. Sutter, J. Wiegley IEEE Robotics and Automation Magazine. Special Issue on Internet Robotics, December 1999.

15. Robert Collins, Alan Lipton, and Takeo Kanade, "A System for Video Surveillance and Monitoring," In Proc. American Nuclear Society (ANS) Eighth International Topical Meeting on Robotics and Remote Systems, 1999.

16. D. Kimber, J. Foote, and S. Lertsithichai, "FlyAbout: Spatially Indexed Panoramic Video," Proc. ACM MM'2001. 17. Ho-Chao Huang and Yi-Ping Hung. Panoramic stereo imaging system with automatic disparity warping and

seaming. Graphical Models and Image Processing, 60(3): 196-208, May 1998. 18. International Organization for the Blind, http://www.io4b.org/terms.htm

wireless cellular control of time-shared robotic web …daveab.com/files/truelookwireless.pdfdavid...

Documents