[vdi/vde soc. meas. & autom. control international conference on multi-sensor fusion and...

Towards a Learning Model for Feature Integration in Attention Control *

Luiz M. G. GonGalves Cniversidade Estadual de Campinas

AV. ;\lbert Einstein, 1251, 13083-970 Campinas, SP - B r a d IniarcosQic.unicamp. br

Abstract We present current efforts towards un approach for

the integration of features wtvacled from multi-modal sensors, with which t o guide the attentional behattior of robotic agents. The model can be applied in many situations and different tasks iricludzng top-down or bottomup ospects of attention control. Basically, U pre- attention mechanism enhances atteritional features that are relevant to the current task according to a weight function that can be learned. Then, an attention shift mechanism can select one between the various activated stimuli, in order for 0 robot to foveate on it. Also, in this approach, we consider the robot moving resources DT so to improve the (uisual) sensory information.

1 Introduction We present current efforts done toward developing

behaviorally active mechanism for integration of multi- modal sensory features, to be eniployed in real-time attention control. The proposed mechanisms support complex, behaviorally cooperative, active sensory systems as well as different types of tasks including bottom- up and top-down aspects of attention. Our goal is to develop an active system able to select regions of interest in its underlying space and to fisically foveating (verging) robotic camera on the selected regions. The system may also need to keep attention on the same region as necessary for example to categorize an object or while moving an arm to reach and grasp an object, and to eventually shift its focus of attention to another region as a task is finished. Signal processing techniques provide data reduction and feature abstraction from the sensory input data. Based on the current functional state, a learning strategy is currently under develop- ment to on-line decide what are the most relevant features of this “perceptual buffer”. As a result, the robot can use this strategy to make control decisions based on the information contained in the perceptual state, selecting the right actions in response to environmental stimuli, based on the current task being executed.

‘This work is supported by CNPQ and FAPESP - B r a d

Attentional control has some facets that must he con- sidered: botton-up (stimuli guided) or top-down (or task guided, as for example, the search for an object, given a niodel of it); overt (focus of atterition is in the fovea) or covert (attentional focus is out of the fovea). In this work we try not to make distinctions between the various facets bg adding the task specification in the learning model. Thc resulting system architecture can be applied to tasks i u d v i n g both sides of the di- cotomies. In other words, in practice, our agents can employ bottom-up and overt att,ention in monitoring tasks and topdown and overt attention in a search and reaching for an object, and other moving tasks. We remark that we do not aau t a system designed to perform specific tasks. We want a behaviorally active attentional system that can perform several, different tasks in different environments or situations, automatically responding, in real-time, to environment changes. In this way, we believe that an on-line weight tunning for integration of features extracted from multi-modal sensory information, according to the task being executed, is the main issue that must be treated on a system with these requirements.

2 Related works Studies suggesting use of multi-features for selective

attention in biological mechanisms have initiated so far with ’heisman [I]. More recently, in [2], he provided a better description of this model for low-level perception, with the existence of two distinct phases in visual information processing: a parallel (simultaneous) low-level feature extraction folowed by a sequential processing of selected regions. Tsotsos [3] also depicted an interesting approach to visual attention based on selective tunning. Besides these works, just recently other works towards explaining the neuro-physiology of attention and towards producing reasonable mathematical models for the definition of low-level features have been produced. Van der Laas (41, Itti [5], Vandapel (61, Luo (71, and Mi- lanesi [SI use a transfer function to gather information from the feature maps to construct a salience map that governs attention. Rybak e t al. [9] treat perception

and cognition as behavioral processes. Westelius [lo]. using a sioiulation platform for a robot, proposes an interesting and relatively coiiiplete approach. The al- gorithrn turns the regions visited invisible to the attentional mechanism in an internal attentional map. In a purely descriptive nork, Kosslyn [ l l] suggests that features extracted from visual iniages are combined with mental image completion for recognition purposes.

hcost of the above work colisider rising only sta- tionary, monocular image frames [A. 3, 8, i ] or post- processed sequences of iniages [& 61: not including teni- poral aspects like niotion or furrctional and behavioral (real-time) aspects: uor do these approaches provide real-time feedback to enrironnierital stimuli: that is, do not explicitly deal on-line with the real-time constraints experimented in robotics problems. Further- more, the approach used for feature integration and to decide where to put the next attention window is not cleared in some of them. In our work, we use a strategy for weight tunning similar to Breazeal et al. 1121 besides not usiug exactly the same strategy for the fusion problem. In her approach, the robot is guided by social constraints, while our approach considers using a learning strategy for further selecting the right weight tunning according to the current task. So, in relation to the above works, our behaviorally active model pro- vides a better and working approach for integration of attentional features. In the current implementation, our feature set includes visual features as static spatial prop- erties (like intensity, texture), other temporal features (like motion), stereo disparity features, haptics based features, and other task dependent, empirically deter- mined features. We generate salience maps (one for each eye/camera) by using a task dependent weighted transfer function based on the values of each region in the attentional feature maps and on some constraints, referring to an attentional map. After this integration, a region of interest can be simply chosen as the most salient region in the salience maps.

3 Reducing/abstracting data A key issue in real-time sensory systems is the defi-

nition of a strategy for data reduction and abstraction, in order to allow the host computer to execute high- level processes (decisional level) in real-time. In this work, we use real-time data provided by real and simulated platform in order to validate the fusion process for the attentional mechanism. Our simulated robotic agents have devices for simulation of sensory information from a simple world representation and include the basic physics specifications for their motion behavior, as a real robot. To reduce visual sensory information, we use a dry (in the sense of data reduction and ab-

straction) structure constituted of multi-features (MF) extracted from a multi-resolntion (MF) fovea representation of the perceived scene. This technique was a p plied to a real stereo head robot (131 and also adapted to the simulators to reduce their sensory information. The haptics buffer, similar to the visual one, contains arm, legs, and hand sensory (t,actile and proprioceptive) information. Both structures represents the mapping of topological/spatial indexes from multiple sensors to multiple attentional features. For the visual buffer, we adopted a fovea representation cont.aining 4 image levels with very small sizes each (15 x 16 pixels). In practice, in the stereo head robot, the coarsest level contains the whole image sanipled from 512 x 460 pixels to a coarse representation of 13 x 16 pixels. The finest level contains only the most central part of the image (15 x 16 central pixels, without any sampling). The result of this process can be seen in Figurc 1: where it is shown four MR intensity images, just after the ahove process. In Figure 2, the images are re-scaled and super-posed to compose one single image, giving a better idea of our fovea representation. It has a coarse resolution in the pheriphery and a fine resolution in the image center.

Figure 1: Multi-resolution fovea representation

Figure 2: MR images re-scaled and super-posed.

In order to enhance the desired attentional features, we convolve each resolution level of the ahove fovea representation with the 6 Gaussian (derivative) filters (three first Gaussian derivatives in two orthogonal directions each). Let d = z, y define the two directions (X and Y ) for the Gaussian filter kernels and let k = 0,1,2. The Equation that defines this convolution, performed to an image I acquired in the instant t , is given by:

In the same way, two MF motion images are computed for each camera by applying the following E q u s

312

tion in a “difference” fovea representation also computed in the previous phase:

(2) Md=r.v = gp’ * [ I t - I t - i ]

In relation to the original image sizes, the m o u n t of pixels is decreased by a factor 32 in the stereo head and by almost 8 in simulation, besides using multi-feat,ures. A resulting LIRLfF representation (left eye) for a scene contaioing a sphere obtained directly from the image processing device (from Datacube Corp). inside which it is generated, is shoan in figure 3.

Figure 3: .\Iulti-Resolutioii-~liilti-Feature vector. The first. si,: column images are the Gaussian partial

derivatives of gray-level intensity images and the last two are the derivatives of frarnes difference representing motiun.

4 Attentional behavior control The desired attentional behavior for our system is to

focus attention on the currently most salient region of its environment, selected according to a task dependent policy. This involves computing salience maps (one for each resource), taking the winning region, and gener- ating saccadic eye movements (possibly involving pan and/or tilt besides vergence) to foveate on that region.

Each salience map has a multi-resolution structure. For its generation, several attentional feature maps are computed from the above MRMF representation: stereo disparity D,, Gaussian magnitudes I;’), I;’), I?’, motion magnitudes Ma, proximity ps, mapping Ta, interest E,, and arm attentional features H,,. (Subscript a denote “attentional” phase). The map Ma is calculated as the square of magnitude of notion MF images (last two columns of Figure 3) as:

Each map I:*’, made for Gaussian kernel response k = 0,1,2, is also the square of magnitude of the Gaussian MF images (first three pairs of columns in Figure 3), calculated as:

1:) = (GP))2 + (Gr))2 (6) The values in map T, tells whether a region has been previously visited. In the map P,, the value of each position can be proportional to its distance to the fovea (this can be used for overt attention, for example). The value of each position in map E, represents the current level of “interest” in a given region. To calculate stereo disparity D,, we use a simple cascade correlation approach over the second order Gaussian intensity I?’ computed above. In this work, we have used only one haptics attentional feature oiap ( H a ) which is computed as a binary map. The value 1 represents regions which have a high haptics interest (for exemple, i f an arm or a leg bumps an object, what can be carried out by tactile sensors, the value in the corresponding region can he set to 1). Note that if the task is to read a book, the corresponding weight can be set to zero, thus the arni or leg does not interfere in the attentional process.

Once all of the ahore attentional feature maps are computed, salience maps can be generated. Starting at the coarsest resolution level, an activation value (S) for each position in the salience maps can be calculated as a simple weighted summation of all attentional features as:

(7) S = t i ~ . D , + ~ ~ ~ ~ l ~ + t i , ~ I ~ ’ + t i , ~ I ~ ) + ti~~I~)+ti~E~+ti~p~P.+w~.T~+w~.H~.

In the first version ofour algorithm [13], the stereo head robot performed some mainly bottom-up and covert tasks, so the weights (ti()) of this function were deter- mined empirically. We concluded that the same weight- ing strategy can be used for other more general tasks, involving other aspects of attention as topdown or covert. So, we conjecture that the weight set can be learned. In the current work, we speculate the use of a reinforcement learning approach (141, more specifically, Q- learning to derive an strategy for weights selection. We consider the set of tasks that the robot can perform (the functional state) as part of the state space.

5 Learning selective attention An attentiond mechanism must operate at every step

of a control loop, no matter what task the robot is per- forming at a given time instant. The main functionality of a preatentive mechanism is to get activation from extern and intern sensors and to put this in attentional feature maps. Each feature map translate the perceived activation for all its inner regions (pixels) in getting the attention window. The main problem is exactly how to determine weights for each position, in order to inte- grate all attentional feature maps into one single acti-

313

vation map which determines the attentional focus (see Figure 4).

/ Figure -1: Feature integration

In this work, we change the weights of each atteu- tional feature map according to the task. \Ve hare ideo- tified a t least two levels of abstraction. In a high-level processing, a global strategy will sec global weights for each map, according to the task. We are trying to use a learning approach with Q-learning [I41 to derive these high-level weights. In a low-level processing, weights for each position in each attentional map can be set according to the selected global strategy (task). We intend to use a backpropagation classifier to learn the neights of each position, according to models given a priori to the system. For example, in a top-down search, given a set of models, the backpropagation learns the weights associated to each position that can be used to enhance the most probable instances of each model. A matrix correlation approach can be used to identify these weights.

Weight Map Monit.

WM. W D . W E var UJPa var WT. 1 W H . 0

~

Reach 1 1 1 1 1 3 3 0 3

~

~

Search 1 1 1 1 1 0 0 3

__

Var

Figure 5: task dependent weights for ach feature map.

6 Experiments and results We performed several experiments, using different

tasks and weight tunnings. In case of a mainly bottom- up task, the attention mechanism can have a simple strategy to tell what are the relevant features to be en-

hanced. To exemplify this case, we defined a monitoring behavior, in which the system keeps an attentional map of the environment compatible with the current perception. As we have an attentional map previously constructed, a model for each region in this map can be used to define the weights. In fact, in a previous train- ing phase, we tune low-level individual weights for each position in each feature map, for our robot to accomplish the desired behavior. The system uses high-level weights as shown in Table 5.

Figure 6: Attentional map construction.

As a result of this strategy, the system keeps visit- ing all regions of the environment, eventually returning to previously visited regions and updating the map for changes that may occur. We note that a region is never visited twice in a row, but, depending on the weight function, a region may be revisited before all other regions are visited. Figure ? shows an evaluation for one of the monitoring tasks in which Roger visits the regions of its environment. The lines show, from bottom to top, the number of salient regions detected (lower), the number of patterns positively identified (middle) and the number of regions effectively inserted in Roger’s attentional maps (top). We remark that a region is “turned OR’ by reseting the mapping (To) and interest (E.) attentional features after each attentional visit so they are variable (w) in Table 5. Figure 8 shows some pictures selected from a sequence in which the simulated robot visits all regions in its environment, while constructing its attentional maps. The associated high-level weights are shown in Table 5. A similar practical result can be visualized in Figure 6 where the stereo head robot changes its focus of attention for construction of an at-

314

. .

tentional map of the environment (a typically bottom- up task).

In the sequence illustrated in Figure 9, the system uses this top-dou-n search in its perceptual buffer to locate the object (a chair in between the lits). Then, our sim- ulator moves its right arm to reach and grasp the chair. We recall attention to the retunning of the weights according to this new task (reaching), shown in Table 5.

Unknown

4 = 1"

B I" 1 -

~ - . IUI

0 ' ~~-

CDDIrlll c>e1er

Figure i : Overall evaluation

x.=c -.-= c=:-=aC ---- Figure 8: Roger construct,ing attentional maps.

If the task is mainly top-down, the system will have a model of the region to be foccused on. To exemplify, we devised a search task. Given a model, the system has to find a similar occurrence in the visual field. In this case, the low-level weights can be retrieved for all levels if they are previously associated to each position of each attentional feature map in a learning phase. Then, after the weights are retrieved, the most probable positions in the attentional maps can be visited sequentially, allow- ing the system to confirm or else discard a representation for the searched model. In this case, we used a self organizing map to classify the possible instances. The associated high-level weights for each map can be seen in Table 5 and the internal weights are retrieved on-line, according to the object model. In practice, the system computes the likelyhood that an object model appears in each position of the image in a coarse level, according to weights retrieved from the associated learned model.

Figure 9: Roger reaching an object.

We finaly show an example in which the virtual hu- man avoids an obstacle detected by using bottom-up attention (it has a simplified visual system which sim- ulates vusual sensory information in its head). When the agent walks over a virtual scenario with an obstacle, a collision is expected to occur (Figure 10). Using this new perceptual information, the robotic system of this agent detects the obstacle position and its internal motion engine can calculate the necessary warp factor that must be applied to a group of joints such as to avoid the collision with the obstacle (Figure 11). The associated weights are shown in Table 5. In this case, internal, low-level weights are not set, since the system does not know a priori model of the stimulus. So, all positions in each attentional map have the same weight hut we can see that disparity will be strongly activated. Also, in this task we use the coarsest resolution of the MRMF for detecting the obstacles.

7 Conclusion and future work We have built useful mechanisms involving feature

integration for attentional control, currently used by a multi-modal sensory system in the execution of different tasks. Besides using only visual and haptic information in this work, similar strategies can be applied to a more general system involving other kind of sen-

315

sory information, to provide a more discriminative feature set. \Ve believe that the ability of changing attentional focus is the basis not only for the tasks described, but also for other more complex tasks involved in robot cognition. The algorithm presented can accomplish a real-time pcrforniaricc duc to the data reduction and abstraction performed \Ve believe that the main result obtained ~v\-as the definition of a methodology that can be applied to diffcreiit types of tasks involving attention control, without needs of strong adaptation, just by changing the weights aiid set of operating controllers on the robot platforms.

Figure 10: Predicted motion

Figure 11: Agent avoiding the obstacle

An immediate extension of this work is to increase the feature space and/or the set of tasks that the agents can perform. We can also define other tasks based on the current feature space. Then, it would be possible to derive various policies, each one appropriate for a given task. Furthermore, we believe that tasks involving other behavioral aspects can also be done by using the same architecture, hut deriving other policies. In this way, an agent can learn and perform tasks without strong inter- action with an operator, augmenting its autonomy. We are currently trying to improve the mechanism by developing a learning strategy to allow the robot to operate automatically, given a mission composed of tasks with several different aspects, that is, involving top-down, bottnm-up, cnvert or overt attention. In this new strategy, given a set of tasks, the system can be rewarded for detection of regions important to each one. Thus, another possibility for future works is to derive learning policies for general tasks considering these aspects. This approach can yet be improved by setting variable

weights, according to the current phase of the task b o ing executed. For example, in a shift of attention, the weights can change while the robot is moving moving. In each time step, the integration of attentional features can be computed for evaluation of the attentional prc- cess. Finally, another possibility of future work is to introduce, by software, a moving fovea representation (currently, our fovea is defined in the image center). The result would be a more versatile function for directing attention, perhaps closer bo a biological model.

References [I] A Reisman, "Selective attention i n man," British

Medical Bulletin, 1964. [2] A Tteismlan, "Fcatures and objects in visual process

ing," Scientific American, vol. 255, no. 5, 1986. [3] J . K. Tsotsos, S. Culhane, W. \Vai, Y. Lai, K. Davis,

and F. Nuflo, "llodeling visual attention via selective tuning," Artificial Intelligence, vol. i 8 , no. 1-2, pp. 507-547, October 1995.

[4] P. \'an de L-, T.11. Hejkes, and S. Gielcn, "Task- dewndent learninn of attention." Neural Nefworks. vol. " 10, no. ti, pp. 981-992, August 1997.

151 L. Itti, J. Braun, D. K. Lee, and C. Koch, "A model . . of early visual processing," in Proc. of NIPS'SX, Cam- bridge, MA, 1998, pp. 173-179, MIT Press.

[6] N. Vandapel, P. Hebert, and R. Chatila, "Ac- tive and attentive vision for automatic natural land- mark selection," Rapport LAAS-CNRS NSSI.94 (http://doe.loas.fr:8889/), Octabre 1999.

171 J Luo and A. Singhal, "On measuring low-level saliency in photographic images," in Pme. of the IEEE IC- CVPR. June, 13-15th 2000, vol. 1, pp. 84-89, IEEE Computer Society Press.

181 R. Milanese. S. Gil. and T. Pun. "Attentive mechanisms for dynamic and static scene analysis," Optical Engineering, vol. 34, no. 8, 1995.

[9] I. A. Rybak, V. I. Gusakova, A. V. Golovan, L . N. Pod- ladchikova, and N. A. Shevtsova, "A model of attention-guidcd visual perception and recognition," Vision Research, vol. 38, no. 2, pp. 387400, 1998.

[lo] C. J. Westelius, Focus of Attention and Gaze Control jov Robot Vi'ision, Ph.D. thesis, Linkoping University, Sweden. S-581 83 Linkonine. Sweden. 1995. Dissert+ tion No'379, ISBN 91-78'7I1530-X. '

Ill] S. M. Kosslyn, Image and Bmin. The Resolution of the . . Imagery Debate, MIT Press, Cambridge, MA, 1994.

attention system for a social robot," [I21 C. Breazeal and B. Scassellati, "A context-dependent

in Pme. o j the IJCAI, Stockholm, Sweden, July 31 ~ August 6 1999, pp. 1146-1151.

1131 L. M. Garcia, R. Grupen, A. Oliveira, D. Wheeler, and A. Fagg, "%acing patterns and attention: Humanoid robot cognition," IEEE Intelligent Systems, vol. 15, no. 4, pp. 70-77, July/August 2000.

1141 Rich S. Sutton and Andrew G. Barto. Reinforcement I '

Learning: an Introduction, The MIT Press, Ckbridge, MA, 1998.

316

http://doe.loas.fr:8889

[vdi/vde soc. meas. & autom. control international conference on multi-sensor fusion and...

Documents