a support system for visually impaired persons using ... · the sports that visually impaired...

A Support System for Visually Impaired Persons Using Acoustic Interface– Recognition of 3-D Spatial Information –

Yoshihiro KAWAI and Fumiaki TOMITA

National Institute of Advanced Industrial Science and Technology (AIST)AIST Tsukuba Central 2, 1-1-1, Umezono, Tsukuba, Ibaraki 305–8568, Japan.

[email protected] [email protected]

AbstractWithout visual information, visually impaired people incur many inconveniences in their daily and sociallives. It is important to prepare an infrastructure that enables visually impaired users to easily understandthe circumstances of their surroundings. However, many problems cannot be solved by infrastructure main-tenance and development alone. Therefore, we are developing an active support system − an intelligentsupport device − that would provide access to 3-D spatial information surrounding the user via an acousticinterface. The proposed system is expected to be useful in the situations where the infrastructure is incom-plete and the situation changes in real-time. This paper describes a prototype system, processing of 3-Dvisual information, and some basic experiments on an acoustic interface using 3-D virtual sound.

1 IntroductionMuch of the information that humans acquire from the outside world is obtained through sight. Without thisfacility visually impaired people incur many inconveniences in their daily and social lives. Much research hasbeen done worldwide on support systems for the visually impaired [2, 5, 7]. However, there are still manyproblems in representing the real-time information that is changing around the user.

Canes and seeing-eye dogs are widely used as walking support devices in active operator support systems.However, the range that a user can sense with a cane is limited, and the use of seeing-eye dog, still entailsproblems of availability and practicability. Support devices using electronic technologies have been developed,but considerable training is needed to use them. It is important to prepare an infrastructure that enablesvisually impaired users to easily understand the circumstances in their surroundings. Surface bumps, braillepanels, and audio traffic signals are in use. However, economic realities limit their installation and availability.Many problems cannot be solved by infrastructure maintenance and development alone. Therefore, we aimedto develop an active support system − an intelligent support device − that would provide access to 3-Dspatial information surrounding the user via an acoustic interface.

Among other reported visual aid systems using sound are a system that uses ultrasonic waves, Kaspa(Sonic Vision), one that displays images by differences of frequency pitch and volume, one that utilizes aspeaker array, and one that utilizes stereophonic effects [6]. However, the target of these systems is mainly atwo-dimensional space. Three-dimensional sound can provide more real-world information because it includesan intuitive feeling of depth, as well as a feeling of front and rear. There is a GUI access system that uses3-D sound, but as it shows only the relative position between windows, 3-D sound is not being fully utilized.

We are developing a support system that displays 3-D visual information using 3-D virtual sound. Oursystem is unique in that 3-D environment information is acquired for a task set by the user, and it isrepresented by 3-D virtual sound. Images captured by small stereo cameras are analyzed in the context of agiven task to obtain a 3-D structure and object recognition is performed. The results are then conveyed tothe user via 3-D virtual sound. This system is expected to be useful in situations where the infrastructureis incomplete and the situation surrounding the user is changing in real-time. In addition, this systemcan be used without much learning because it provides information via virtual sound superposed on actualenvironment sounds. This method dose not replace or impede the user’s existing external auditory sense.We assume that it could be used, for example, to assist while playing sports or walking, as shown in Figure 1and Figure 4 (a). While the importance to walking assistance needs no further explanation, assistance whileplaying sports is certainly also important. The sports that visually impaired persons can do are limited tothose with two-dimensional movements, such as jogging, roller skating, floor volleyball, etc. Many visually

Ball

Figure 1: Application example(Assistance of playing sports).

(d) Composition

Sampler

Patcher

Processor

Mixer

MIDI

AudioStereo

MD Player

Computer

Image

Voice

User

[ Environment sounds ]

[ Assigned sounds ]

Sound Spacecamera

Headset

(a) Overview

(c) Headsetb) Stereo camera system

Figure 2: Overview and composition of the system.

impaired persons have voiced requests to be able to play sports that involve 3-D movements together withsighted people. So we believe that our proposed system is one answer that satisfies their needs.

This paper presents an outline of our prototype system, the 3-D visual information processing methodsto acquire the 3-D information surrounding the user, and some basic experiments on an acoustic interfaceusing 3-D virtual sound.

2 SystemWe have built the prototype system shown in Figure 2 (a). Figure 2 (b) is a stereo camera system with threesmall cameras, mounted on a helmet, and (c) is a headset with a microphone and headphones. Figure 2 (d)shows the configuration of this system.

The images captured by the stereo cameras are sent to a computer and analyzed. After the 3-D datais reconstructed, the user’s targets are recognized and tracked. Both the status of the objects acquired andthe sound assigned to each object from a sampler are input to the sound space processor. A sound imagefor each target is mapped in a 3-D virtual sound space and transmitted to the user. Although as yet notimplemented, we may also have a microphone for the user to set a task by voice, which may be processedby a voice recognition engine.

2.1 Stereo camera systemIt is desirable for the visual information input unit to be small and lightweight because these devices will bemounted on the user’s head. We mounted an aluminum frame with three small cameras on a helmet (shownin Figure 2 (b)). Each camera is a 1/4-inch color CCD camera. The total weight of the helmet is about650 g. We have set the focus of the lens at more than 2 m, a point beyond the reach of a visually impairedperson’s cane. The reason for using three cameras is to reduce the calculation of correspondence problemswith horizontal lines during stereo image analysis.

The advantages of using cameras are that they are suited for object recognition and tracking, measurementof distant objects (e.g., discrimination of the red/green color of a traffic signal from far away) and forcharacter information readability (e.g., characters on a sign). Although it is still difficult to analyze imagesto obtain 3-D visual information, there exists a potential use for recent pattern recognition techniques inour application. For example, we have been developing the Versatile Volumetric Vision (VVV) system, ageneral-purpose system which can be used for many purposes in many fields [9]. The details are describedin section 3.1.

2.2 Acoustic system using three-dimensional virtual soundWith recent developments in virtual reality technologies, the technical progress of acoustics in virtual spacehas been remarkable. We can use 3-D virtual sound readily, since some 3-D sound equipment is already inthe commercial market. We assembled our acoustic system using the sound space processor RSS-10 made by

(b) Results of segment-based and correlation-based methods

(c) Color segmentation result

Yellow

Blue

Red

Yellow

(d) Results of ball tracking experiment

Pitcher

Cameras

1m

(a) Captured images

Frame 000

Frame 1228.5 sec.

Figure 3: Results of thrown ball tracking experiment.

Roland corporation (the left side in Figure 2 (a)). This is a device which calculates an arbitrary 3-D virtualsound space using Head Related Transfer Functions (HRTFs).

As an output device, we selected bone conduction headphones, which do not entirely cover the user’sears, and therefore do not impede the user’s hearing and understanding of environment sounds. It is veryimportant not to impede a user’s auditory sense.

3 Three-dimensional visual information and auditory informa-tion

3.1 Three-dimensional visual informationWe briefly explained that our method of obtaining 3-D information would use three captured images. Afeature of our proposed method is that 3-D shape recognition and tracking are possible even though onlyrange information and color information is obtained. The details of these algorithms have been reported[3, 8, 9]. In this paper, we will show the results of a 3-D visual information processing experiment in whichthe user plays catch ball as one application example in Figure 1.

The technology for tracking a moving object is necessary to help visually impaired persons play sports.Figure 3 (a) shows a captured stereo image using the stereo camera system in Figure 2. In this experiment,we performed a test of tracking a sponge ball. The ball is colored red, blue, and yellow. First, a pitcher held abeanbag in both hands, and then threw it to the system as shown in Figure 3 (a). The number of total imageswas 123 frames per 8.5 sec. (14.5 fps). The overall image size was 160×120 pixels, but at first, some imageswere captured at 640×480 pixels (4.5 fps) to detect the position of the target ball. The pitcher had to showthe ball clearly to enable it to be recognized easily (to calculate initial parameters). After it was recognized,its positions were tracked using the algorithm. The reconstructed result for one scene (frame number 099)using the segment-based method is shown in the left figure of Figure 3 (b). Distance measurement using thecorrelation-based method was also done at the same time to detect obstacle objects (the right figure in Figure3 (b)). A tracking process using only the color segmentation method was performed in this experiment totrack a fast-moving object, the thrown ball. The ball included the primary colors (red, blue, and yellow),which are easily segmented in the RGB color space, as shown in Figure 3 (c). In each camera’s image, theregions with each of the color parts were segmented, and only one region, the region boundary of each colorspace, was selected. For this region, a circle was fitted and the 2-D coordinates of its center were calculated.Then, the 3-D position of the moving ball was calculated using three positions. This process was iteratedrepeated to trace the ball. The time required to search the ball’s new position was reduced by guessing itfrom obtained transformation parameters. The result of tracking the ball is shown in Figure 3 (d). Althoughthe position resolution was low because small images were used, the tracked path was fairly accurate. Theseposition parameters were sent to the sound space processor with an assigned sound to calculate the virtualsound based on HRTFs.

(b) Virtual sound source locations

(c) Environment sound recorded at traffic intersection

(a) Walking support

Figure 4: Experiment setup.

Table 1: Results of tests on auditory localization.

Subject T1 T2 TimeA3 11.5% 48.1% 2.8 sec.B3 19.2% 53.8% 3.7 sec.C3 11.5% 38.5% 9.4 sec.

Avg. 14.1% 46.8% 5.0 sec.

a) Case A: without environment sound

Subject T1 T2 TimeA3 13.5% 44.2% 2.9 sec.B3 5.8% 34.6% 4.6 sec.C3 9.6% 36.5% 9.5 sec.

Avg. 9.6% 38.4% 5.7 sec.

(b) Case B: with environment sound

Errorrate

Subject Error typeA3 Front&Rear Up&Down Other

Case A 51.9% 14.8% 59.3% 33.3%Case B 55.8% 41.4% 51.7% 20.7%

(c) Analysis of mis-recognized answers of subject A3 at T2

Small fixed focus cameras mounted on a frame are used as the input device. They do not comprise anactive camera system, so there is a problem in that the object information is not reconstructed closer than2 m (See Figure 3 (d)), because it was not captured in every camera. However, this camera setting wasdecided based on the reach accessible using a cane, as was described in section 2.1. Also, this experimentwas carried out off-line, so the computation time was not in real time. We are now developing a real-timeon-line system.

3.2 Three-dimensional auditory informationOne method that uses the auditory sense for sensory substitution informs the user of the environmentsituation using a synthetic voice, however, playing sounds is a superior means for immediate understanding.

Not all visual information is converted to auditory information. The only sounds output are thoserepresenting the targets needed to do the task. However, for obstacles or in dangerous situations, such asthe presence of stairs, walls, or cars, an alarm or voice alert is output to warn the user of dangerous objects.

Regarding the output sound, if it already exists in the real world, the same sound is used. Otherwise, asound that the user can easily recognize is assigned. It is important to create and superimpose a sound spacethat is the same as the real environment as much as possible. The user can change these sound settings andparameters at will, and choose favorite sound sources from the sampler in Figure 2 (d).

4 Basic tests on auditory localizationWe conducted some basic tests on auditory localization of 3-D sound in a virtual space to develop anacoustic interface. The results of some experiments have been described in [4]. In this paper, we explainsound localization tests to investigate the influence of environment sounds. These tests were performed onthe assumption of an application of the walking support system shown in Figure 4 (a).

The subjects were two males and one female. The sound locations for all directions in the virtual soundspace are shown in Figure 4 (b). Twenty-six sound sources were arrayed on a sphere with a radius of 3.0m. They were located at both north and south poles, and at 8 points each on a horizontal plane and crosshorizontal planes at ±45 degrees, at 45-degree intervals. The height of the sphere center from the floor was3.2 m. The sound was presented at random 52 times. Each sound source was assigned a unique number.After having been briefed as to the correspondence between numbers and directions in advance, the subjectsresponded with the number that they felt. For the sound source in these tests, a sound similar to that heardat audio traffic lights was used. The environment sounds included the sounds of automobiles which wererecorded at a pedestrian crossing shown in Figure 4 (c). We performed two sound position recognition tests,one without the environment sounds (Case A), and one with the environment sounds (Case B). Table 1 (a)and (b) shows the rates of correct answers and recognition time for Case A and Case B.

Tolerance n (Tn) is the rate of correct answers according to the following:

T1: rate when the answer exactly matched the actual direction.

T2: rate when the answer was accepted as correct if its distance was within one position from side to sideor up and down.

The average of completely correct answers at T1 was very low (in Case A only 14.1% and in Case B 9.6%).This result shows that environment sounds degrade the sound image localization. There are two mainreasons. One is the narrow range of the bone conduction headphones. The effective frequency band isonly 30 to 3000 Hz at more than 80 dB. Another is the problem of the sound source. We used a simplesound, which was similar to the sound at an audio traffic signal, instead of pink noise, which is similar toenvironment sounds and easy to recognize. The frequency band is narrower than pink noise, so it was moredifficult to localize sound images.

Table 1 (c) shows an analysis of subject A3’s errors at T2. He mis-recognized to almost the same degreein both Case A and Case B. However, the errors between the front and rear directions in Case A were largerthan in Case B. We can assume that the influence of the environment sounds is large for sound positionrecognition between the front and rear directions. It is known that frequency is related to the recognitionof front and rear directions [1], so the difference of frequencies should be emphasized to improve recognitionbetween the front and rear directions.

5 ConclusionsWe are developing a recognition support system using 3-D virtual sound to enable visually impaired personsto obtain reliable information about their environment. We described the design of our prototype system,3-D visual and auditory information processing methods, and some basic experimental results of our acoustictests. We showed that it is a usable system if tasks are limited, though there is still a problem of imageprocessing time. Problems in sound image localization were clarified through the experiments. For example,we found that there were many mis-recognitions between the front and rear directions.

In the future, we will first complete this project as an on-line system and develop algorithms for computervision to analyze 3-D visual information. We will take these experimental results into account and developan acoustic interface so that the rates of sound image localization are improved, for example, by altering thesound source frequencies for specified locations and directions that are likely to be mis-recognized. We willalso develop a user interface that allows task setting by voice, and investigate the influence of virtual soundssuperimposed on environment sounds.

References[1] J. Blauert, “Spatial Hearing (Revised Edition)”, The MIT Press, 1996.

[2] T. Ifukube, T. Sasaki, C. Peng: “A blind mobility aid modeled after echolocation of bats”, IEEE Trans.BME-38, 5, pp.461–465, 1991.

[3] Y. Kawai, T. Ueshiba, Y. Ishiyama, Y. Sumi, F. Tomita, “Stereo Correspondence Using SegmentConnectivity”, Proc. of ICPR’98, I, pp.648–651, 1998.

[4] Y. Kawai, M. Kobayashi, H. Minagawa, M. Miyakawa, F. Tomita, “A Support System for VisuallyImpaired Persons Using Three-Dimensional Virtual Sound”, Proc. of ICCHP2000, pp.327–334, 2000.

[5] J. M. Loomis, R. G. Golledge, R. L Klatzky., J. M. Speige, J. Tietz, “Personal guidance system for thevisually impaired”, Proc. of ASSETS’94, pp.85–91, 1994.

[6] J. M. Loomis, C. Hebert, J. G. Cicinelli, “Active localization of virtual sounds,” J. Acoust. Sot. Am.,88, 4, pp.1757–1764, 1990.

[7] P. B. L. Meijer, “An Experimental System for Auditory Image Representations”, IEEE Trans. Biomed.Eng., 39, 2, pp.112–121, 1992.

[8] Y. Sumi, Y. Kawai, T. Yoshimi, F. Tomita, “Recognition of 3D free-form objects using segment-basedstereo vision”, Proc. of ICCV’98, pp.668–674, 1998.

[9] F. Tomita, T. Yoshimi, T. Ueshiba, Y. Kawai, Y.Sumi, “R&D of Versatile 3D Vision System VVV”,Proc. of SMC’98, TP17-2, pp.4510–4516, 1998.

a support system for visually impaired persons using ... · the sports that visually impaired...

Documents