object recognition for home robots with cortex like...

Abstract—One of the major challenges for the home robots is to recognize objects under different transformations and viewpoints. In this paper, we propose a hierarchical neural network for object recognition inspired by two properties of neurons in the visual cortex: invariance of their responses to stimulus transformations and coding the efficient features in images. To achieve the first goal, we used the trace learning rule to develop neurons with invariance responses to stimulus transformations. For the second goal, we used a model of hierarchical redundancy reduction to extract the most efficient features from images. This hierarchical redundancy reduction uses signals from surrounding neurons in each layer to provide the next layer with most informative features. A set of experiments performed on the coil100 dataset and custom images reveal the high performance of the proposed model for object recognition.

I. INTRODUCTION OME applications require robots to deal with different objects from different viewpoints. For example, the

robot must locate and recognize objects requested by people such as; bringing a dish from kitchen, locating water jar and closing the door. Such applications require the robot to recognize objects under different view-points and backgrounds. Therefore robust object recognition is a principal necessity for home robots. In general, no solution is available to the problem of object recognition. Several methods have been proposed that can recognize objects with high accuracy. However these methods are limited in a variety of conditions: number of objects must be kept small, number of training examples must be large, lighting and view-point should not change, etc. in contrast, the ability of human object recognition is supreme compared to the most successful methods in machine vision. Human can perform object recognition for different view-points, distances, lighting and transformations (affine, 3D).

The structure and function of the primate visual cortex has inspired many models and algorithms in the field of machine vision. Starting with the paper of Hubel and Wiesel in 1962 [1], there has been a large growth of studies of the neuronal selectivity in the visual cortex. These studies have revealed a wealth of information about the processing of images in the cortex. It is now widely believed that object recognition pathway has a hierarchical structure along which the neurons extract features with increasing complexity [2]. Neurons in the primary areas of this hierarchy detect contrast changes in small spatial scales while in higher areas neurons detect complex shapes like hands, faces or objects [1], [3]. Several authors have proposed that neurons in the visual cortex extract features which are statistically optimal for object recognition [4]-[5].

In this paper, we propose a hierarchical model for object

recognition in robots based on structure and neuronal properties of the primate visual cortex. Two main properties of the proposed model are the invariance of neuronal responses to object transformations and their selectivity to statistically efficient features. One important property of neurons in the visual cortex is that their responses are invariant to object transformations [6]-[7]. The trace learning rule has been proposed to provide such invariance selectivity for neurons. We use this learning rule in a hierarchical architecture to provide invariance selectivity to 3D object rotations. Another neuronal property in the visual cortex is the coding of locally optimal features for object recognition [4]-[5], [8]. We train the horizontal connections between the neurons of different layers to code the efficient features in each layer. Results of experiments demonstrate the ability of this model to perform robust object recognition under a variety of transformations.

II. LITERATURE REVIEW Primary models of object recognition used correlation

based template matching to detect objects in images [9]. However these models are very sensitive to object transformations and can be used only in controlled environments. In order to overcome this shortcoming, component based models have been proposed that extract features from objects and use them for object recognition [10]-[12]. For these models, a trade-off between selectivity and invariance is unavoidable; the generality of components may help in object recognition under transformations but reduces the accuracy when different objects have to be distinguished. For example, histogram based models are very robust to object transformations but lack specificity [13]. In the opposite side, components such as grayscale patches of objects that are view-specific cannot be used to recognize transformed versions of the same object [10], [12].

Hierarchical models inspired by the visual cortex have been used for transformation invariant object recognition. One of the earliest hierarchical models is the Neocognitron which was designed to perform position invariant object recognition [14]. Recent hierarchical models have been used to perform face and car recognition with high accuracy [10], [12]. These models are based on the idea of Perrett and Oram, who proposed that transformation invariance can be achieved by pooling over units that are tuned to different views of the same feature [15]. Convolutional neural networks are a subclass of hierarchical methods which have shown high performance in face and generic object recognition [16]-[17].

The models mentioned above lack the biologically plausibility for various underlying mechanisms; for example weight sharing has not been observed in brain neurons.

Object Recognition for Home Robots with Cortex like Features

H

There are other models which have used more biologically plausible operations to perform object recognition and predict the properties of neurons in the visual cortex. A series of models of VisNet have been proposed that perform object recognition under translation, transformation and lighting variations [18]-[20]. Lissom is a set of hierarchically connected areas that utilize long range inhibitory and excitatory connections to produce the selectivity and recognition similar to neurons in the Inferotemporal cortex [21]. Serre and Poggio introduced a hierarchical model based on the HMax model that could perform object recognition in cluttered environments with high accuracy [22]. The main characteristic of their model was the alternating layers of simple and complex cells that provide recognition and invariance respectively.

In this paper, we propose a model for object recognition with emphasize on the issues of invariant object recognition and efficient feature coding. Compared to the Lissom and VisNet, our model uses efficient coding in each layer to extract the most informative features from images which results in higher recognition rates. In the next section, we describe the properties of the proposed model.

III. THE PROPOSED MODEL Our model is composed of 3 layers corresponding to areas

V1, V2 and V4 in the visual cortex. These areas are organized in a hierarchical manner as shown in Fig.1. Units in each layer receive inputs from the neurons in previous layer via bottom-up connections and in the same layer through horizontal connections. Units in the first layer receive bottom-up input from a small region of the input image. Bottom-up connections provide excitatory inputs to the units that determine their selectivity. Incoming signals through horizontal connections are inhibitory and facilitate the extraction of optimal features in images. Neurons in the first layer are modeled by Gabor filters with different orientations:

( ) ( )( ) ( )( )( ) ( )( ) ( )θθ

θθλπσγ

cossinsincos

2cos2exp),(

0

0

022

022

0

yxyyxx

xyxyxF

+−=+=

+−= (1)

Where x and y are 2D coordinates of the filter in image, γ is the aspect ratio,θ is the orientation,σ is the effective width and λ is the wavelength of the filter. We used a set of filters with 6 different orientations and only one spatial frequency (Fig.2).

For neurons in areas V2 and V4, bottom up activities are calculated as a weighted sum of responses of neurons in their receptive field from the previous layer:

∑∈

=iRFj

jbupij

bupi xwy (2)

here the bup

iy is the bottom-up activity of neurons i, iRF is the receptive field in the previous layer of the neuron i and

jx is a neuron in iRF . TABLE.I displays the parameters of the

proposed model.

A. Learning Selectivity Invariance with Trace Learning Rule The size of receptive field of neurons along the ventral

visual pathway gradually increases and their preferred stimuli become more complex [2]. According to different electrophysiological experiments, neurons in V2 and V4 exhibit invariance selectivity to different transformations of

Fig. 1. Hierarchical organization of the proposed model. Neurons in the first layer are Gabor filters with orientations ranging from 0° to 150° with 30° steps. Neurons in each layer receive bottom-up input from the previous layer and lateral input from neurons in the same layer.

Fig. 2. Set of Gabor filters with 6 different orientations used to model V1 neurons.

TABLE.I

MODEL PARAMETERS

Layer Dimensions Bottom-up RF Horizontal Connections

V1 128*128 7*7 21*21

V2 104*104 21*21 21*21

V4 80*80 41*41 21*21

their preferred stimuli [6]-[7]. It has been suggested that complete invariance selectivity can be developed in IT neurons by partial invariance responses of neurons in the areas V2 and V4 [23]. A neural mechanism has been proposed to achieve response invariance in preferred stimuli transformations [24]. This mechanism is based on the observation that activated neurons retain their activity level for hundreds of milliseconds. Fig.3 describes this mechanism. The basic idea is that the persistent activity level of neuron aids in strengthening the connections between this neuron and neurons that represent the transformed version of their preferred stimuli (see legend of Fig.3).

Of special interest in this paper is the Trace rule proposed

by Földiák to learn the invariance selectivity of neurons for spatial and temporal transformations of the objects [25]. This rule extends the simple Hebbian learning rule to include a trace of the previous responses of neuron to update its weights:

( )1−−=∇ tij

ti

tj

tij wxyw λ (3)

In this equation, tjy is the trace of neuron j in the iteration t,

ijw is the weight of connection from input i to neuron j, tix is the input i in iteration t, and λ is the learning rate.

The term 1−− tij

tj wyλ is added to (3) to avoid unlimited

increases in weights of connections. Trace value for a neuron is calculated using (4):

( ) tj

tj

tj yyy ηη −+= − 11 (4)

here η is the trace constant which is selected in the interval

[0,1] and tjy is the output of neuron j in iteration t.

The trace learning rule is applied to develop the selectivity of neurons on their bottom-up input. The activity of each neuron induced from bottom-up connections is calculated as a weighted sum of activities of neurons in its receptive field from the previous layer. We used a subset of

images from the coil100 dataset [33] to train the connectivity between layers of the model. Coil100 dataset contains images of objects from different view-points and is therefore appropriate to train the invariance selectivity of neurons in the model.

B. Redundancy Reduction For an object recognition system to achieve high rates of

recognition, efficiency of the features extracted from images is very important. A model that only tries to achieve invariance responses cannot result in high recognition rates for different objects. One must use features with high information content that can discriminate between objects. The redundancy reduction becomes an important issue when dealing with different objects from different categories. In the proposed model, we used a redundancy reduction mechanism that is based on a statistical model of the natural images. This mechanism is used in different layers to provide a set of globally optimal features for object recognition.

Primary efforts to reduce the redundancy in natural images revealed that linear filters like Gabor or Wavelet are optimized feature detectors as they are uncorrelated [26]-[27]. However dependencies in natural images are nonlinear and one cannot achieve independence with a linear transformation. Studies on the statistics of natural images discovered the nonlinear variance dependency between responses of linear filters in natural images [28]-[29]. Wainwright and Simoncelli proposed the Gaussian scale mixtures model to provide an explanation for variance dependency [30]. Schwartz and Simoncelli used divisive normalization to produce the independent responses over a set of patches selected from natural images [31]. This model was extended to a hierarchical architecture and simulated the properties of neurons in higher order visual area V2 [32]. We adapt this model in the proposed hierarchy to extract the most efficient features from natural images.

Neurons in each layer of the proposed model are connected to a set of neighboring neurons with horizontal connections. The weights of these connections are learned to predict the variance of responses of each neuron:

( ) 22,|var xCy

yyxxyxx

LwCyLL σ+=∈ ∑∈

(5)

where xL and yL are responses of neurons x and y

respectively and yxw is the weight of connection between

them, xC is the neighboring region around x and 2xσ is

the part of variance of neuron x that is independent of other neurons. Response of the neuron is then divided to its variance to remove the variance dependency. This mechanism is repeated in different layers of the model to produce neuronal responses that are globally independent:

22

2

xCy

ylyx

xx

x

LwLR

σ+=

∑∈

(6)

Fig. 3. The idea of Continuous Transformation. (a) Neuron shown in blue is activated by a set of input from green neurons and the weights of connections between blue and green neurons are established. (b) As the stimulus transforms, the set of active neurons in the lower layer changes. Because the blue neuron preserves its activity level for a short time, it can establish connections with these new activated green neurons that represent the transformed version of the stimulus.

the result of (6) is the independent response xR of filter x . The weights of horizontal connections are learned in a

training procedure after the bottom-up connections to the layer has been established. The goal is to predict the variance of responses of each neuron using responses of its neighboring neurons. An unbiased estimate for variance of responses of a neuron is:

( )∑

∑

=

=

−−

=

=

N

tx

txx

N

t

txx

LN

LN

1

22

1

11

1

µσ

µ

(7)

An iterative form for the above equations is: ( )( )

( ) ( ) ( ) ( ) ( )1/2

/12212

1

−

−+−×=

+−×=

−

−

tLt

tLt

tx

tx

tx

tx

tx

tx

tx

µσσ

µµ (8)

We used gradient descent to minimize MSE between variance estimates using (5) and (8). The update rules for horizontal connections are as follows:

( ) ( )∑=

−=

N

i

ix

ixLVar

NE

1

222 1 σ

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )ty

tx

tx

tyxt

yx

tyx

tyx

iy

N

i

ix

ix

yx

ix

N

i

ix

ix

yx

LLVarwwEww

LLVarN

wLVar

LVarNw

E

×

−×−=

∂

∂−=

×

−

=∂

∂×

−=

∂∂

−−

−

=

=

∑

∑

211

21

2

1

2

1

22

2

2

σηη

σ

σ

(9)

( ) ( ) ( )

( ) ( )

( ) ( ) 1211

21

1

2

21

2

2

2

22

2

−−−

−

=

=

×

−×−=

∂

∂−=

×

−

=∂

∂×

−=

∂

∂

∑

∑

tx

tx

tx

txt

x

tx

tx

x

N

i

ix

ix

x

ix

N

i

ix

ix

x

LVarE

LVar

LVarLVarE

σσησσ

ησσ

σσ

σσ

σ

(10)

IV. EXPERIMENTAL RESULTS We performed a set of experiments to evaluate the

performance of the proposed model for object recognition in general, under transformations and with different backgrounds. We used the coil100 dataset that contains images of different objects under different viewpoints to measure the generalization of the proposed model. The results are shown in Table.II for objects that were not used in training the network. It can be seen that the proposed model is superior to the other models reported in [21].

Images in the coil100 dataset do not challenge the ability of proposed model to recognize objects in different lighting conditions or background texture. Therefore, we generated a set of images from 3 objects (Fig.4) in different distances,

viewpoints and backgrounds to examine the model. Samples for one object are shown in fig.5. For each object, we used two third of images for training and one third for test. In the first experiment, images for training were selected from all backgrounds. Results are shown in the second column of table.III. In the second experiment, we used objects with two backgrounds for training and the third background was only used for test. Results are shown in third column of the table.III. It can be seen that the performance of the model is on average above 95%.

V. CONCLUSION In this paper, we proposed a hierarchical model for object

recognition based on well known properties of neurons in the primate visual cortex. Two key characteristics of the proposed model are the invariance of responses of its neurons with respect to object transformations and extraction of efficient features from images that enhance the object recognition accuracy. We used the trace rule to develop neurons with invariance responses to object transformation. In order to increase the accuracy of the recognition, we used a model of redundancy reduction previously used to develop neurons with similar selectivities to neurons in the visual cortex. We examined the performance of the model with a set of images of objects from different view-points and backgrounds. On average, the recognition rate was higher than 95% on coil100 and custom images.

Previous studies on redundancy reduction tried to develop neuronal selectivities similar to that of the primary visual cortex. In a recent study, redundancy reduction was extended to simulate the selectivity of extrastriate neuronal selectivities. In this paper, we proposed a hierarchical architecture to perform object recognition with features similar to selectivities of neurons in the V2 and V4. Neurons in these areas exhibit some degree of invariance response to stimuli transformations. Previous studies reported a high recognition rate with features inspired from visual cortex. In this paper, we provided a more biologically plausible mechanism for developing neurons with properties similar to the visual cortex neurons.

TABLE.II RECOGNITION RATE OF THE PROPOSED MODEL ON COIL100 DATASET COMPARED TO LISSOM2 AND SNOW METHODS.

# OF VIEWS SNOW ONE AGAINST ALL

LISSOM2 OUR MODEL

4 0.760 0.798 0.8887 8 0.819 0.816 0.9173 18 0.913 0.845 0.9337

Fig. 4. Three objects used to generate a set of images with different view-points, backgrounds and distances.

REFERENCES [1] D. H. Hubel, T. N. Wiesel, “Receptive fields, binocular interaction and

functional architecture in the cat's visual cortex,” J. Physiology, vol. 160, pp. 106–154, 1962.

[2] D. C. Van Essen, J. H. Maunsell, “Hierarchical organization and functional streams in the visual cortex,” Trends. Neuroscience, vol. 6, pp. 370–375, 1983.

[3] K. Tanaka, “Inferotemporal cortex and object vision,” Ann. Rev. Neuroscience, vol. 19, pp. 109–139, 1994.

[4] J. Atick, “Could information theory provide an ecological theory of sensory processing?” Network: Computation in Neural Systems, vol. 3, pp. 213–251, 1992.

[5] H. B. Barlow, “Possible principles underlying the transformation of sensory messages,” Sensory Communication. ed. WA Rosenblith, pp. 217–234, 1961.

[6] R. Desimone, S. J. Schein, “Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form,” Journal of Neurophysiology, vol. 57, Issue 3, pp. 835-868, 1987.

[7] E. Kobatake, K. Tanaka, “Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex,” J. Neurophysiology, vol. 71, pp. 856–867, 1994.

[8] S. Laughlin, “A simple coding procedure enhances a neuron's information capacity,” Z. Naturforsch, vol.36, pp. 910–912, 1981.

[9] R. Vaillant, C. Monrocq, Y. Le Cun, “An original approach for the localization of objects in images,” In International Conference on Artificial Neural Networks, pp. 26–30, 1993.

[10] B. Heisele, T. Serre, M. Pontil, T.Vetter, T. Poggio, “Categorization by Learning and Combining Object Parts,” Advances in Neural Information Processing Systems, vol. 14, 2002.

[11] S. Ullman, M. Vidal-Naquet, E. Sali, “Visual Features of Intermediate Complexity and Their Use in Classification,” Nature Neuroscience, vol. 5, no. 7, pp. 682-687, 2002.

[12] T. K. Leung, M. C. Burl, P. Perona. “Finding faces in cluttered scenes using random labeled graph matching,” In Proc. International Conference on Computer Vision, pp. 637–644, Cambridge, MA, 1995.

[13] D.G. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. Int’l Conf. Computer Vision, pp. 1150-1157, 1999.

[14] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 93-202, 1980.

[15] D. Perrett, M. Oram, “Neurophysiology of shape processing,” Imaging Vis. Comput., vol. 11, pp. 317–333, 1993.

[16] Y. LeCun, F.J. Huang, L. Bottou, “Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2004.

[17] S. Chopra, R. Hadsell, Y. LeCun, “Learning a Similarity Metric Discriminatively, with Application to Face Verification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.

[18] E. T. Rolls, T. Milward, “A model of invariant object recognition in the visual system: learning rules, activation functions, lateral inhibition, and information-based performance measures,” Neural Computation, vol. 12, pp. 2547–2572, 2000.

[19] E. T. Rolls, S. M. Stringer, “Invariant visual object recognition: A model, with lighting invariance,” Journal of Physiology – Paris, vol. 100, pp. 43–62, 2006.

[20] G. Perry, E. T. Rolls, S. M. Stringer, “Spatial vs temporal continuity in view invariant visual object recognition learning,” Vision Research, vol. 46, pp. 3994–4006, 2006.

[21] A. Plebe, R. G. Domenella, “Object recognition by artificial cortical maps,” Neural Networks, vol. 20, pp. 763–780, 2007.

[22] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, “Robust Object Recognition with Cortex-Like Mechanisms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 3, 2007.

[23] M. Riesenhuber, T. Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. 2, pp. 1019–1025, 1999.

[24] S. M. Stringer, G. Perry, E. T. Rolls, J. H. Proske, “Learning invariant object recognition in the visual system with continuous transformations,” Biological Cybernetics, vol. 94, pp. 128–142, 2006.

[25] P. Földiák, “Learning Invariance from Transformation Sequences,” Neural Computation, vol. 3, vo. 2, pp. 194-200, 1991.

[26] B. A. Olshausen, D. J. Field, “Emergence of simple cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, pp. 607–609, 1996.

[27] A. J. Bell, T. J. Sejnowski, “The ‘independent components’ of natural scenes are edge filters,” Vision Research, vol. 37, no. 23, pp. 3327–3338, 1997.

[28] C. Zetzsche, B. Wegmann, E. Barth, “Nonlinear Aspects of Primary Vision: Entropy Reduction Beyond Decorrelation,” In: Morreale, J. (Ed.), Proceedings of the SID, Society for Information Display, Playa del Ray, CA. pp. 933–936, 1993.

[29] E. P. Simoncelli, “Statistical Models for Images: Compression, Restoration and Synthesis,” In 31st Asilomar Conf on Signals, Systems and Computers, 1997.

[30] M. J. Wainwright, E. P. Simoncelli, “Scale Mixtures of Gaussians and the Statistics of Natural Images,” Advances in Neural Information Processing Systems, vol. 12, pp. 855–861, 2000.

[31] O. Schwartz, E. P. Simoncelli, “Natural signal statistics and sensory gain control,” Nature Neuroscience, vol. 4, pp. 819–825, 2001.

[32] M. Malmir, S. Shiry, “A Model of Angle Selectivity in Area V2 with Local Divisive Normalization”, IEEE Symposium Series on Computational Intelligence, In Press, Nashville, Tennessee, USA, 2009.

[33] http://vision.ai.uiuc.edu/mhyang/object-recognition.html

Fig. 5. Samples of images of object 1 from different view-points, distances and backgrounds.

TABLE.III

RECOGNITION RATES OF THE MODEL FOR THREE OBJECTS OF FIG.4. OBJECT TRAIN AND TEST

WITH DIFFERENT BACKGROUNDS

TRAIN WITH TWO BACKGROUNDS AND

TEST WITH THE THIRD

OBJECT1(DOLL) 0.957 0.956 OBJECT2(SHAMPOO) 0.916 0.975

OBJECT3(COFFEE MATE)

0.975 0.963

http://vision.ai.uiuc.edu/mhyang/object-recognition.html

object recognition for home robots with cortex like...

Documents