spie proceedings [spie second international conference on digital image processing - singapore,...

Feature extraction inspired by visual cortex mechanisms

Xing Du, Weiguo Gong, Weihong Li

Key Lab of Optoelectronic Technology and System of Ministry of Education, Chongqing University, Chongqing, China 400044*

ABSTRACT

Motivated by the mechanisms of mammalian primary visual cortex (V1), we propose a hierarchical model of feature extraction for object recognition. The proposed model consists of two layers, each of which emulates the functions of V1 simple cells and complex cells respectively. Filters learned from training images are applied at every position of the input image to get an edge feature representation. Then a maximum pooling operation is taken to increase shift-invariance of the feature. Experiments on face recognition and crop-wise object detection show that our model is competitive with the state-of-the-art biologically-inspired method.

Keywords: Visual cortex, Object recognition, Feature extraction

1. INTRODUCTION Humans and primates can effortlessly recognize objects from among tens of thousands of possibilities within a fraction of a second, in spite of tremendous variation in the appearance of each one. However, recognizing objects in a natural scene is a tough challenge for computer vision. Researchers have always been pursuing methods that emulate visual cortex mechanisms to solve the difficulties in computer vision. Most researches have focused on the use of Gabor functions to imitate the simple cell receptive field1, 2, 3. Recently, Serre et al have proposed a feature extraction model4 based on a quantitative theory of the ventral stream of visual cortex5. This four layer hierarchical model extracts a set of scale and shift invariant feature which has excellent performance in a variety of recognition tasks. Inspired by this model, we propose a feature extraction method similar to the first two layers of the model.

Serre’s multi-layer model, each stage of which increases the scale and position tolerance of the feature, is good at dealing with an object’s large scale and shift variance in a complex scene. However, the processing in each layer discards some information of the object while increasing the invariance of the feature. Among the vast variety of recognition tasks, there is a type of task, in which the object is cropped and has limited scale and shift variance (e.g. the face recognition). In this situation, a simplified model that simulates the first stages of visual processing in cortex rather than the whole ventral stream may show better performance since fewer processing stages may cause less information loss. Therefore, our feature extraction model merely simulates the function of cells in V1 to cope with the crop-wise object recognition.

Studies in neurobiology have proved that plasticity and learning exist in visual cortex6, 7, 8. Several researches have proposed a theory that the visual cortex contains a probabilistic model of visual environments, and the activities of neurons are representing images in terms of this model9, 10. However, the most part of the biologically-like computer vision methods4, 11 merely take Gabor filters to model the receptive fields (RFs) of V1 simple cells. Although Gabor filters can characterize the properties of spatial localization, orientation and frequency bandpass12, 13 of the simple cells, they have some drawbacks. Gabor filters only describe the simple cells qualitatively. In practical applications, it is difficult to set the parameters of the filters. Most of the time, they are set according to the biological experiments4, 14. This leads to a problem that the filters are fixed and cannot adapt to different image sets in different applications. Theories that can learn the RFs of simple cells from natural images have been proposed15, 16, 17. However, these methods cannot be used in object recognition systems directly for some reasons. First, as the RF of a simple cell corresponds to a small part rather than the whole of an image, the filter representing the RF of a simple cell is learned from patches of images, and merely reflects some local structure shared by the training images. Second, the learning algorithms get a

Further author information: (Send correspondence to Weiguo Gong)

Weiguo Gong: E-mail: [email protected], Telephone: +86 23 65112779

Second International Conference on Digital Image Processing, edited by Kamaruzaman Jusoff, Yi Xie,Proc. of SPIE Vol. 7546, 75460M · © 2010 SPIE · CCC code: 0277-786X/10/$18 · doi: 10.1117/12.852798

Proc. of SPIE Vol. 7546 75460M-1

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 04/17/2013 Terms of Use: http://spiedl.org/terms

large number of filters, and it is impossible to employ so many filters in a practical visual system due to the limitation of computation capability. Some studies have extended these algorithms to process the whole image18, 19. But these extended algorithms can only handle images of small size, as their original counterparts are originally computational expensive. Moreover, from the perspective of biology, these measures should not be interpreted as simulations of the functions of V1 simple cells, but simulations of functions of higher lever cells which have wider and more complex RFs.

In our model, we first train a group of filters representing RFs of V1 simple cells by the sparse coding method mentioned in10. Then a small part of the trained filters are selected to extract the feature of an input image. The experiments show that the learned filters have better performance than the Gabor filters with fixed parameters.

2. THE FEATURE EXTRACTION MODEL The presented model focusing on the information processing mechanisms in V1 is a partial implementation of the “standard model” 4 that follows a theory of the feed-forward path of object recognition in cortex5. As shown in Figure 1, our model consists of two layers of computational units. The first layer, called S layer, combines the input image with a tuning function to get feature with selectivity. The second layer, called C layer, pools its input through a maximum operation so as to increase the invariance of the feature.

Figure 1. Structure of the feature extraction model

In the S layer, a gray image is analyzed by a group of filters that emulate the properties of classical simple cells found in the primary visual cortex12, 13. The simple cells are often described by multi-scale and multi-orientation Gabor filters. By properly setting the parameters, Gabor filters model the RFs of simple cells well4, 14. Setting the parameters of the filters is done according to findings of biological experiments, and the parameters are fixed in all kinds of recognition tasks once after the parameters are set. To avoid using filters with fixed parameters, we use a series of filters learned from training images collected from the images to be processed in the specific task. We take the strategy of sparse coding10, which has been proved a good model to learn the RFs of V1 simple cells, to learn a group of filters. Then a subset of the filters is selected, which captures the structure of the images best. Details of how to learn and select the filters will be discussed in the next section. Suppose the filters have already been obtained, the response of a patch of an image P(x, y) to a particular filter Fi(x, y) is given by:

,

2

,

( , ) ( , )( , )

( , )

ix y

i

x y

P x y F x yS P F

P x y=∑

∑ (1)

When P runs over the entire image, the image is normalized and filtered by Fi. This layer consists of linear units which model the linearity of simple cells.

The next layer, C layer, acts as a nonlinear system with properties similar to that of the cortical complex cells. The C layer pools over units from the previous layer:



( , ) max ( , )k

i iP DC k F S P F

∈= (2)

The maximum operation on the local region Dk, the size of which is identical to that of the filter in the S layer, takes the strongest response of the previous layer in this area to get a set of feature with increased position tolerance. The maximum operation scans the S feature plane in a step of half the size of the filter in the S layer. This operation on one hand subsamples the feature, and on the other hand reflects that a complex cell has a RF twice as large as a simple cell.

Compared with the “standard model” 4, the proposed model has the following traits:

Firstly, the model has only two layers that correspond to the first two layers of the “standard model”. As we consider the situation where objects have limited position and size variance, the simplified model is more appropriate.

Secondly, the filters in the S layer are of one size. One reason for this simplification is the limited scale variance of the target. Another important reason lies in that our filters are learned from training images rather than Gabor filters with pre-settled parameters. Because the learned filters are adaptive to the structures of the images, it is not necessary to use multi-scale filters. This simplification reduces the computational complexity effectively. As will be discussed in detail in the next section, we use 4 filters, each with a size of 8×8 pixels, in the S layer. But in the “standard model”, the

corresponding layer has 16 scales of filters, the sizes of which range from 7×7 pixels to 37×37 pixels, and each scale consists of 4 Gabor filters with different orientations. According to (1), the operation in the S layer is to convolve the input image with each filter. Not considering the size of the filters, the computational cost of the “standard model” is 16 times as large as that of ours. When taking the size of the filters into account, the computational cost of the “standard model” is even larger.

Thirdly, as there is only one scale of filters in the S layer, there is no pooling over adjacent filter scales in the C layer in our model.

Fourthly, the absolute operation is shifted from the S layer to the C layer. Though this modification makes no difference to the final output of the C layer, it makes the function of each layer in the model closer to that of simple cells and complex cells in the primary visual cortex.

3. LEARNING AND SELECTING FILTERS One theory about the visual cortex is that it contains a probabilistic model of images, and the activities of neurons are representing images in terms of this model. Olshausen et al have proposed a sparse coding model10, 15 which describes a strategy that may be employed by V1 simple cell. Filters with properties similar to RFs of V1 simple cells are obtained by applying the sparse coding strategy to natural images. Here we take this sparse coding method to learn the filters in the S layer in our feature extraction model.

The sparse coding method takes a linear model to describe the observed image. An image I(x, y) can be represented by the linear combination of a set of basis functions ( , )i x yφ :

( , ) ( , )i ii

I x y a x yφ= ∑ (3)

The idea of sparse coding is that natural images are supposed to have “sparse structure” — that is, a given image can be described in terms of a small part of a large basis function set, and the basis functions are adaptive so as to best account for the image structure. The basis functions can be obtained by solving an optimization problem:

2 2

,min : ( , ) [ ( , ) ( , )] log(1 )i i i

x y i iE a I x y a x y aφ φ λ= − + +∑ ∑ ∑ (4)

The first term of the objective function measures the reconstruction error, and the second term assesses the sparseness of the code. The positive parameter λ determines the importance of sparseness relative to reconstruction error. The optimization problem can be solved through a gradient descent technique15.

By applying the sparse coding method to image patches with sizes consistent with that of RFs of simple cells (the filters we use in the experiment are 8×8 pixels, so the patches are 8×8 pixels), a set of basis functions can be learned. Although it requires only a small number of basis functions to represent an image, different basis functions are activated when reconstructing different images. So there are a large number of basis functions, which can represent any image



perfectly. We do not intend to rebuild images but extract feature for classification. It is unnecessary and unfeasible to take all the basis functions as the filters of the S layer, and a small representative subset must be picked out.

The basis functions are localized in the spatial frequency domain. Taking the basis functions as filters of the S layer means that the S layer works as a band of bandpass filters. For object recognition, it only needs information on some particular frequencies to describe an image. We divide the frequency domain into 5 areas. The center frequency of each basis function must fall in one of the 5 areas, so the functions are separated into 5 groups (denoted as I~V, as shown in Figure 2). In each group excerpt for Group V, the basis function whose center frequency response has the largest value is selected. Thus we finally get 4 basis functions to work as S layer filters. We divide the frequency domain in this way to ensure that the selected filters are orientation specific — filters in Group I are all vertically oriented; filters in Group II have orientations between 90o and 180o (not including 90o and 180o); filters in Group III are horizontally oriented; and filters in Group IV have orientations between 0o and 90o (not including 0o and 90o). No filter in Group V is taken, because in object recognition the discrimination information is often edge information which lies in the high frequencies, and the low frequency components contain little information that is helpful for classification.

Figure 2. Illustration of the division of the frequency domain. It is divided into 5 regions: I — the fx axis (not including the

origin), II — the first quadrant, III — the fy axis (not including the origin), IV — the second quadrant, V — the origin.

4. EXPERIMENTS We test our model on several databases to evaluate the performance of the extracted feature in different object recognition tasks.

4.1 Experiments on face recognition

Image samples used in face recognition are usually cropped and resized face regions which have little shift and scale variance. So our model is suitable for this application. Moreover, our research group has been working on face recognition for years20, 21, so we try to apply this biologically inspired method to face recognition. We take the ORL face database24 for our experiment. The main challenge of this database is that each class contains large within-class variance due to different poses and expressions. This database contains 40 classes and each class has 10 samples (40 people and each has 10 pictures).

The face images are copped and resized to the size of 64×64 pixels. The database is randomly divided into 2 subsets

— the first subset contains 4 images per class, the second one contains 6 images per class. The first subset is used to train the S layer filters. The second subset is used to test the performance of classification. The second subset are randomly divided into two parts — 3 images per class for classifier training, and 3 images for testing. A nearest neighborhood classifier is taken for classification. We train the S layer filters 5 times. For each group of trained filters, 10 classification tests are implemented. The average recognition rate (RC) over 10 runs and average feature extraction time cost per image (FETCPI) are shown in Table 1 (all of our experiments are conducted on an Intel P4 2.0GHz 512M PC). We compare our method with the C1-SMFs method in4 which has been verified a good biologically-inspired approach. Moreover, we replace the learned filters by Gabor filters, the size of which matches that of the learned filters, to verify that the filters adaptive to images have a better performance. The parameters of the Gabor filters are identical to that of the first scale Gabor filters of the model in4. Results of these two experiments are also listed in Table 1.

From Table 1, we can see that our method has the best performance, the recognition accuracy of which is a little higher than the C1-SMFs method, and the feature computation speed is more than 20 times faster than the C1-SMFs



approach. It can be seen that both increasing the number of scales of Gabor filters and replacing Gabor filters with adaptive filters can increase the recognition rate, and using adaptive filters is even better. The recognition rate from the five different groups of trained filters shows that though the filters are learned from randomly selected images, the performance of our method is stable.

Table 1. Recognition rate and feature extraction time cost on ORL database

Filter Training Times 1 2 3 4 5

RC (%)

FETCPI (s)

RC (%)

FETCPI (s)

RC (%)

FETCPI (s)

RC (%)

FETCPI (s)

RC (%)

FETCPI (s)

Our Method 90.83 0.069 88.08 0.066 90.75 0.062 91.33 0.064 89.08 0.065 C1-SMFs 88.67 1.58 86.50 1.57 88.33 1.63 89.00 1.62 86.08 1.62

Gabor 85.75 0.065 81.92 0.066 82.25 0.065 84.42 0.065 84.08 0.067 4.2 Experiments on street scene object recognition

We further test our model on another task — street scene object recognition. This experiment is run on the MIT StreetScenes database23. The database has three subsets. In each subset, the positive samples are images containing one type of objects, and the negative samples are images not containing this type of objects. The three types of objects are cars, bicycles and pedestrians. Each sample in this database has a size of 120×120 pixels.

Three object recognition experiments are conducted on the car, bicycle and pedestrian subset respectively. For each subset, 10 positive samples are randomly selected to train the filters, 2/3 of the samples (both positive and negative) are taken to train a classifier, and the rest 1/3 are for testing. We use a gentleBoost classifier24 for this task. We run the experiment 10 times on each subset using randomized splits. The average equal error rate (EER) and average FETCPI are reported in Table 2. The C1-SMFs method is taken as a benchmark to evaluate the performance of our method. The results are also listed in Table 2.

Table 2. Equal error rate and feature extraction time cost on StreetScenes database

Subset Car Bicycle Pedestrian

EER (%) FETCPI (s) EER (%) FETCPI (s) EER (%) FETCPI (s) Our Method 4.60 0.20 7.37 0.20 6.41 0.19

C1-SMFs 4.40 2.68 6.82 2.58 5.74 2.71

We can see that the C1-SMFs method works slightly better than our method on all three subsets. Comparing with the face database, we can find that samples in this database have a different property: the object in each image only occupies a small part of the whole image and its size and position exhibit more variance. The multi-scale filters in the C1-SMFs method are more capable to capture these variances. Though using single scale filters increases the EER a little, it significantly decreases the computational burden.

5. CONCLUSIONS In this paper, we have described a simplified biologically-inspired feature extraction model according to the early states of information processing in visual cortex. A learning mechanism of the simple cells is introduced into our model, which increases the performance of the model. The tests on face recognition and street scene object recognition show that our model has good performance. Compared with most biologically motivated methods, the major advantage of ours is that it can be implemented in real time. The main limitation of our model is that it cannot be used in a large scene where objects exhibit large size and position variance. This may be overcome by extending the model in a hierarchical fashion, but this will increase the processing time.

ACKNOWLEDGMENTS This work is supported by the National High-Tech Research and Development Plan of China under Grant no. 2007AA01Z423, the Foundation Research Project of the ‘Eleventh Five-Year-Plan’ of China under Grant no. C10020060355, and the National Science Foundation Project of CQ CSTC under Grant No. CSTC2008BB2199.



REFERENCE [1] Daugman, J., G., “Two-dimensional Spectral Analysis of Cortical Receptive Field Profiles,” Vision Research

20(10), 847-856 (1980). [2] Tan, T., N., “Texture Edge Detection by Modeling Visual Cortical Channels,” Pattern Recognition 28(9), 1283-

1298 (1995). [3] Smeraldi, F., Carmona, O., Bigun, J., “Saccadic Search with Gabor Features Applied to Eye Detection and Real-

time Head Tracking,” Image and Vision Computing 18(4), 323-329 (2000). [4] Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T., “Robust Object Recognition with Cortex-like

Mechanisms,” IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 411-426 (2007). [5] Riesenhuber, M., Poggio, T., “Hierarchical Models of Object Recognition in Cortex,” Nature Neuroscience 2(11),

1019-1025 (1999). [6] Vrensen, G., Cardozo, J., N., “Changes in Size and Shape of Synaptic Connections after Visual Training: An

Ultrastructural Approach of Synaptic Plasticity,” Brain Research 218, 79-97 (1981). [7] Fagiolini, M., Pizzorusso, T., Berardi, N., Domenici, L., Maffei, L., “Functional Postnatal Development of the Rat

Primary Visual Cortex and the Role of Visual Experience: Dark Rearing and Monocular Deprivation,” Vision Research 34(6), 709-720 (1994).

[8] Brioners, T., L., Klintsova, A., Y., Greenough, W., T., “Stability of Synaptic Plasticity in the Adult Rat Visual Cortex Induced by Complex Environment Exposure,” Brain Research 1018(1), 130-135 (2004).

[9] Li, Z., Atick, J., J., “Towards a Theory of Striate Cortex,” Neural Computation 6(1), 127-146 (1994). [10] Olshaausen, B., A., Field, D., J., “Emergence of Simple-cell Receptive Field Properties by Learning a Sparse

Code for Natural Images,” Nature 381(6583), 607-609 (1996). [11] Wersing, H., Korner, E., “Learning Optimized Features for Hierarchical Models of Invariant Object Recognition,”

Neural Computation 15(7), 1559-1588 (2003). [12] Hubel, D., H., Wiesel, T., N., “Receptive Fields and Functional Architecture of Monkey Striate Cortex,” J.

Physiol. 195(1), 215-244 (1968). [13] Parker, A., J., Hawken, M., J., “Two-dimensional Spatial Structure of Receptive Fields in Monkey striate cortex,”

J. Opt. Soc. Am. A. 5(4), 598-605 (1988). [14] Mutch, J., Lowe, D., G., “Object Class Recognition and Localization using Sparse Features with Limited

Receptive Fields,” International Journal of Computer Vision 80(1), 45-57 (2008). [15] Olshaausen, B., A., Field, D., J., “Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?,”

Vision Research 37(23), 3311-3325 (1997). [16] Harpur, G., F., Prager, R., W., “Development of Low Entropy Coding in a Recurrent Network,” Network-

Computation in Neural Systems 7(2), 277-284 (1996). [17] Hoyer, P., O., “Modeling Receptive Fields with Non-negative Sparse Coding,” Neurocomputing 52-54, 547-552

(2003). [18] Hasler, S., Wersing, H., Korner, E., “Combining Reconstruction and Discrimination with Class-specific Sparse

Coding,” Neural Computation 19(7), 1897-1918 (2007). [19] Sun, J., Zhuo, Q., Ma, C., Wang, W., “Sparse Image Coding with Clustering Property and its Application to Face

Recognition,” Pattern Recognition 34(9), 1883-1884 (2001). [20] Liang, Y., Li, C., Gong, W., Pan, Y., “Uncorrelated Linear Discriminant Analysis Based on Weighted Pairwise

Fisher Criterion,” Pattern Recognition 40(12), 3606-3615 (2007). [21] Yang, L., Gong, W., Gu, X., Li, W., Liu, Y., “Bagging Null Space Locality Preserving Discriminant Classifiers

for Face Recognition,” Pattern Recognition 42(9), 1853-1858 (2009). [22] Samaria, F., Harter, A., “Parameterisation of A Stochastic Model for Human Face Identification,” In 2nd IEEE

Workshop on Application of Computer Vision, 138-142 (1994). [23] Bileschi, M., “CBCL StreetScenes Challenge Framework,” http://cbcl.mit.edu/software-datasets/streetscenes

(2007). [24] Friedman, J., Hastie, T., Tibshirani, R., “Additive logistic regression: a statistical view of boosting,” The Annals

of Statistics 38(2), 337-374 (2000).



spie proceedings [spie second international conference on digital image processing - singapore,...

Documents