neural object recognition by hierarchical learning and...

Neural object recognition by hierarchicallearning and extraction of essential shapes

Daniel Oberhoff and Marina Kolesnik

Fraunhofer Institut FIT, Schloss Birlinghoven, 53754 Sankt Augustin

Abstract. We present a hierarchical system for object recognition thatmodels neural mechanisms of visual processing identified in the mam-malian ventral stream. The system is composed of neural units organizedin a hierarchy of layers with increasing complexity. A key feature of thesystem is that the neural units learn their preferred patterns from visualinput alone. Through this “soft wiring” of neural units the system be-comes tuned for target object classes through pure visual experience andwith no prior labeling. Object labels are only introduced to train a clas-sifier on the system’s output. The system’s tuning takes place in a feed-forward path. We also present a neural mechanism for back projection ofthe learned image patterns down the hierarchical layers. This feedbackmechanism could serve as a starting point for integration of what- andwhere-information processed by the ventral and dorsal stream. We testthe neural system with natural images from publicly available datasetsof natural scenes and handwritten digits.

1 Introduction

Even after many years of active research in computer vision, approaches to objectrecognition only seldomly yield the desired performance. This is despite the easewith which humans and many animals perform these tasks. In the hope to reachat least part of this performance, more attention is being paid to algorithmsthat, in more or less detail, model visual cortical organization as identified inhumans and other mammals by means of psychophysics and neurophysiology.There, several processing streams have been identified, of which ventral and dor-sal are the most pronounced ones [1]. Each stream has a hierarchical multi-layerstructure in which the complexity of the neuron’s selectivity increases graduallyfrom bottom to top layers. The ventral stream mainly performs recognition andclassification tasks [2] while the dorsal stream is specialized for the processingof motion and place as well as depth. The ventral stream on the other handis largely ignorant to motion and place information as well as the exact ar-rangement of object features. The mechanism for this has been clearly identifiedin area V1 by Hubel and Wiesel [3] who found separate populations of simpleand complex cells: Simple were found to act as detectors of oriented intensityvariations with high specificity to the position, orientation, and phase of thesestimuli, while complex cells exhibit the same selectivity but are tolerant to alimited amount of shift of these stimuli. Hubel and Wiesel were also the first to

identify columnar organization of the visual cortex: Columns are assemblies of(simple and complex) cells that have receptive fields in mostly the same retino-topic area and cover the whole spectrum of different features.This neural structure has been reflected in several schemes for learning and recog-nition of image patterns. Probably the first such network, called “Neocognitron”,was suggested by Kunihiko Fukushima in 1980 [4]. Neocognitron consists of aseries of S- and C-layers (mimicking simple and complex cell types, respectively)with shared weights for a set of local receptive fields and inhibitory and excita-tory sub-populations of units with interactions resembling neural mechanisms.Neocognitron learns through a combination of winner-take-all and reinforcementlearning, autonomously forms classes for presented characters, and correctly clas-sifies some slightly distorted and noisy versions of these characters. In 1989 YanLeCun et al. [5], introduced a similar but much more powerful successor to thisnetwork that generated local feature descriptors through back-propagation. Alater version of this network, called “LeNet”, has been shown to act as an efficientframework for nonlinear-dimensionality reduction of image-sets [6]. “LeNet” issimilar in architecture to Neocognitron, but does not learn autonomously andrequires labels to initiate the back-propagation. The latter is not biologicallyjustified.In 2003 Riesenhuber and Poggio [7] suggested a computational model for objectrecognition in the visual cortex with a similar layout called “hmax”, initiallyfocusing on the correspondence between model components and cortical areas.“hmax” employs Gaussian radial basis functions to model the selectivity of sim-ple cells, and a nonlinear max-function, pooling input from a local populationof simple cells, to model functionality of complex cells. Learning in hmax is con-strained to the tuning of simple cells to random snapshots of local input activitywhile presenting objects of interest. It was, however, successfully applied to themodeling of V4 and IT neuron responses and also as an input stage to a classifierfor object and face recognition yielding very good performance.Another approach that focuses very much on the neural details of neural adap-tation and learning and does not use weight sharing is found in “VisNet”, pre-sented by Deco and Rolls in 2002 [8]. The most interesting ingredient to theirmodel is the fact that it can learn the shift invariance of the feature detectorsautonomously through a temporal learning rule called the trace rule.The system we present here, while structurally similar to hmax, incorporates anunsupervised learning strategy and feedback projections.

2 The object recognition system

The architecture of the system comprises a hierarchy of several processing layersrepresenting the visual areas in the ventral stream (see Fig. 1). Each layer,analogously to most cortical visual areas, has a columnar organization, suchthat for every spatial location there are several units with distinct receptivity.The output of such a layer is arranged along 3 dimensions: horizontal/verticaldisplacement and feature index. The layers are arranged in pairs called (following

established notation) S- and C- layers. The S-layers perform feature detectionon their input while the C-layers reduce the resolution of the S-layer output bypooling over local spatial regions using a max-nonlinearity [7] while keeping thesame columnar arrangement. This processing step performed by the C-layersmakes their response position invariant to a limited degree. The area over whichC-layer units pool is chosen to match the receptive field size of the preceding S-layer. The receptive fields of the S-layers only encompass a few spatial locationsof their input (S1:5x5, S2-S3:3x3), but due to the hierarchical arrangement, thereceptive fields of units in higher layers of the hierarchy (if projected all the wayback to the first layer) become very large. This accounts for one of the importantproperties in the ventral stream: growing receptive field sizes and reduction ofthe information about the spatial origin of visual percepts.The described hierarchy performs recognition of shapes in a multi-scale fashion:

S/C 1

S/C 2

S/C 3

Si Ci

Fig. 1. Architecture of the shape extraction hierarchy. It is exemplarily shown howinformation reaches a single S3-unit, and how the large effective receptive field of thisunit is constructed hierarchically.

units on lower layers of the hierarchy are receptive for local shape features suchas lines, curves, branches or corners, while units on higher levels are receptiveto complex combinations of such low-level features. The first feature extractinglayer, S0, employs a V1 simple cell model, based on Gabor filters and surroundsuppression, followed by a maximum pooling layer, C0, corresponding to thecomplex cells in V1. 8 sets of Gabor pairs were used, with spatial periods of4 and 8 pixels and at 4 orientations. The following S-layers (S1-S3) performautonomous learning as described in the next section.

2.1 Self-tuning feature extraction layers

Layers S1-S3 of the system “tune themselves” to shape features elicited by visualinput. Through an iterative learning process (Fig. 2) one column of selective unitsis generated and replicated across the whole spatial domain of the perceivedimage. The selectivity of the units in the column is modeled using Gaussian

radial basis functions operating on the input within unit’s receptive field:

ai(x, y) = |Ir(x, y)| × e(Ir(x,y)−pi)

2

2σi (1)

Here i is the index of the unit in the column, (x, y) is the retinotopic position ofthe column, pi is the unit’s preferred pattern, and Ir(x, y) is the input within aradius r around (x, y). The tuning width, σi, defines the sharpness of the unit’sselectivity. I denotes Euclidean normalization.To generate a representative set of unit’s for each S-layer, images are presented

determine

winning

unit

update

winning unit's

error create a new

unit tuned to

the input

adapt the

winning unit

to the inputupdate dynamic

thresholds and

eliminate inactive

units

threshold

exceeded?

yes

no

Fig. 2. Overview of one iteration of the learning process. These steps are performedfor each input position for each presented image and for each layer.

to the system and the best responding unit is selected in the neighborhoodBr(x, y) around each input location (x, y):

i′(x, y) = argmaxi

(max

(x′,y′)∈Br(x,y)ai(x′, y′)

). (2)

The effect of this selection is that the receptive fields of selected units do notoverlap. To initialize the system an arbitrary position with strong activity isselected in the input and a unit is generated that prefers exactly this pattern.The receptive field size r is kept constant for each layer.Since only a single column is trained, all the image positions have to be visitedin series, so, to avoid bias of any one region, the visiting order is randomized.Before the actual update is performed the difference of the experienced inputand the winning unit’s preferred pattern is measured and used to update anerror variable for the unit:

ei ← e−1/τ(ei + δii′ |Ir(x, y)| · |Ir(x, y)− pi|

)(3)

where τ is a time constant determining the layer’s speed of adaptation and δii′

is Kronecker’s Delta. Using this update rule has the effect that ei will take onhigh values when the unit wins repeatedly while experiencing input that is verydifferent from its preferred pattern.If the error exceeds a threshold emax a new unit is generated, the error is dis-tributed equally to the winning, and the new unit and both their tuning widths(σ) set to halve the winning unit’s tuning width. Only when the error did not

exceed emax, the winning unit is updated according to:

pi′ ← pi′ + α|Ir(x, y)| (Ir(x, y)− pi′) (4)σi′ ← σi′ + α|Ir(x, y)| (|Ir(x, y)− pi| − σi′) (5)

were α is a constant learning rate. Note that the updates in (3-5) are weighed bythe actual amount of input, |Ir(x, y)|, in the range of the unit’s receptive field.Thus, only regions containing information will actually have an impact on thesystem’s tuning. Because it is hard to estimate an effective bound on the desirederror, emax is updated based on the current size of the network n:

emax ← emax

(1 +

n + n+ −N0

N0

)(6)

where N0 is the desired number of units, and n+ is an exponentially weighed mov-ing average of the growth rate. Additionally to the tuning and growing process,units are removed from the system if they win the competition too infrequently.The threshold for removal is updated similarly to the one for growing:

f0 ← f0

(1− n + n+ −N0

N0

)(7)

The above steps have the effect that within a few time constants the columnwill reach the desired size and be appropriately tuned to the input patternsencountered during that time. Thus τ should be set a few times smaller thanthe training set size.

2.2 Feedback projections

Anatomical studies have shown that virtually all connections between succes-sive pairs of visual areas in the ventral stream are reciprocal[9]. The feedbackprojections are thought to serve top-down processing for object association andvisual attention.Our approach to feedback projections is based on the Selective Tuning modelby John K. Tsotsos et. al. [10], but with a less strict selection scheme, and someadaptation to suit our architecture:

– Activation from the output of the S-layers and the perceptron is “cleanedup” in that (at every retinotopic position) only the unit with the maximumresponse projects back. The back-projection through the perceptron is fur-thermore rectified since negative activation is undesirable.

– Only the afferent that contributed to the output of the C-layers is propagatedback. This introduces the spatial competition emphasized in [10].

– The actual selection is performed by multiplying the back-propagating signalwith the forward traveling signal before each max-pooling layer.

Through these mechanisms we can recover the low level input responsible for acertain classification event (Fig. 3) .

Fig. 3. Feedback begins at a selected output unit feeding back among it’s afferents tothat unit which yielded the strongest input in the forward sweep. The images showthe output from a perceptron trained to recognize cars from C3-output (left), and thefeedback result to the lowest layer, S0(right).

3 Applications

The learning system has been applied to various datasets ranging from natu-ral sceneries to images of handwritten digits. In all trials three pairs of S- andC-layers for learning and maximum pooling, respectively, were used. Learninglayers were trained sequentially from bottom to top. For the learning of eachlayer the training set was presented twice: once to establish a codebook andanother time to refine it.

3.1 Natural Images

Natural images of street scenes were selected from a publicly available database1

together with image annotations. The annotations were refined to make sure allinstances of pedestrians and cars were labeled. 600 crops were extracted with10 pixels padding on each side. 200 crops per class featuring cars, pedestrians,and other randomly selected objects were split by half for the training andtest set (Fig. 4,left). A linear classifier was employed for object classification.This classifier extracted the mean and covariance of the data for each class, andassigned unseen data to the class for which the corresponding normal distributionyielded the highest posterior probability. Since the output of the last learninglayer, C3, usually contained more than one spatial location (i.e. > 1x1 pixels),the location with maximum sum of activities over a column has been selected asinput to the classifier. The adaptation time constant, τ , was fixed to 100.

1 See http://labelme.csai.mit.edu/

layer C0 C1 C2 C3

cars 63.3 % 68.7 % 74.7 % 90.3 %people 46.4 % 57.6 % 76.5 % 84.7 %other 48.9 % 50.4 % 61.3 % 72.2 %

Table 1. Best recognition rates at each layer of the hierarchy

perc

ent c

lass

ified

inco

rrec

t

S1 capacity

number of training samples

Fig. 4. Left: examples of crops obtained from the LabelMe database. Right: dependenceof the total recognition error on N0 (top axis, dashed line) and training set size (bottomaxis, solid line)

To measure the increase in shape information as it travels up the hierarchy,we also trained and tested classifiers with the outputs from the lower C-layers,while adjusting the pooling in these layers to make the output resolution equalto that of C3. Table 1 lists the obtained recognition rates showing that theperformance increases with each layer. A jump in performance occurs betweenlayers C1 and C2 for pedestrians and between C2 and C3 for cars. This jumpcan be assigned to the fact, that the effective receptive field size is large enoughto cover the whole instances of the respective class at their typical size. Best per-formances are 90% for the car class and 84% for the pedestrian class. Clearly,pedestrians are harder to classify because of their variable shapes. The obtainedrates seem competitive for this kind of task, though we have not tried any estab-lished algorithms for comparison, but are planning to do so in the near future.To test the influence of the layer capcity N0 on the overall recognition perfor-mance, three different capacities of 7, 15, and 30 for the lowest layer were tried.The capacities were doubled for each higher layer (e.g. S1:30, S2:60, S3:120).For the largest capacity the size of the training set was varied between 10 and300 (the whole set) while adjusting the number of presentations of the set tokeep the total number of presented images constant. The whole training set wasused for the training of the classifier. The results (Fig 4, right) show that thesystem can generalize sufficiently well and shows no degradation in performanceeven for a training set of only 33 images per class. Reducing the N0 causes strongdegradation in performance, indicating a minimal capacity required for the task.Yet additional tests have shown that a further increase in N0 tends to decrease

Fig. 5. Handwritten digits from the MNIST dataset.

performance. It thus remains to find a way for automatic tuning of the system’scapacity.

3.2 Handwritten Digits

A sample of handwritten digits from the MNIST dataset2 is shown in Fig.5. Thedataset contains 60.000 training and 10.000 test examples, binarized, centered,and scaled to a common size. The speed of adaptation, τ , was set to 5000, and thecapacities of the learning layers were (from lowest to highest in the hierarchy) 40,80 and 160, respectively. No Gabor filtering stage was used in this trial becausethe images had low spatial resolution. Also, due to their near binary nature,these images could be directly processed by layer S1. The increasing complexityof the learned features is shown in Fig. 6. A two layer perceptron with 100 hiddenunits was employed for classification because the linear classifier failed to producereasonable results (we also tried Support Vector Machines with Gaussian kernelsand Gentle AdaBoost both yielding similar results). The necessity of a non-linearclassifier indicates that the C3-layer outputs do not form single clusters for eachof the ten digits. This is not so surprising since the digits appear in many andsometimes subtle variations.Running our system on this dataset yields a recognition rate of 94.2% on the

test set with the perceptron. This is comparable with the results exhibited bystate-of-the-art algorithms2. The system also exhibits some tolerance to rotationand scale changes(Fig. 7), even though only undistorted images were used fortraining. This tolerance is partly introduced by the C-layers discarding someinformation about the exact spatial origin of each feature. Rotation toleranceis also due to the facts that only a small number of units were part of the S1-column, so that a small rotation of the feature does not change the winningunit, and that some rotation variance was already present in the training set.The tolerance does not, however, hold for much larger rotations, which wouldhave to be explicitly learned (see also [11]).

2 See http://yann.lecun.com/exdb/mnist/

S1

S2

S3

Fig. 6. Examples of receptive fields learned by units in (S1)-(S3) as back-projectedto the retinal level. The projection is only approximate due to the position discardingnature of the C-layers. Nevertheless one can nicely observe how the complexity increasesup to full digits.

Fig. 7. Recognition rate as a function of change in size (left:) and rotation (right).

4 Conclusion

We have described the hierarchical learning system for shape based object recog-nition inspired by neurophysiological evidence on ventral stream processing inthe mammalian brain. The system exhibits a robust capability to develop selec-tivity to frequently occurring input patterns with the only constraining param-eters being it’s capacity and the time constant of the adaptation. Especially noclass information is required in the learning stage, in contrast to most currentapproaches to feature learning.This learned selectivity to characteristic image patterns generates a unique setof features at the highest layer of the hierarchy. When these are passed to a finalclassifier, a respectable recognition performance, comparable to the state-of-the-art algorithms, is achieved for very different images ranging from natural scenesto artificial objects. This capability to adapt and perform consistently comes al-most for free through natural tuning and with little change of a few parameters.We only know of one similar system that combines this kind of architecture withan unsupervised learning rule for object recognition [12]. There an energy mini-mization scheme has been used to generate a set of preferred patterns based onthe reconstruction error and an additional term enforcing sparsity of the responseof the feature selective layers, while we used a more biological competitive Heb-

bian learning rule. All other system either do not incorporate any unsupervisedlearning aside from random picking of input data as codebook entries or requirea much more powerful classifier to perform similarly or both; or use supervisedlearning for the whole system [4, 5, 7, 11, 13].We have also presented how the attention model of J. K. Tsotsos et al. [10] canbe adapted to our system to recover the exact location of responsible stimulifor a recognition event in large scenes. Future work will investigate how theseresponsible stimuli could facilitate the learning and, ultimately, recognition.

References

1. Ungerleider, L.G., Mishkin, M.: Two cortical visual systems. In: Analysis of VisualBehavior. MIT Press (1992) 549–586

2. Hung, C.P., Kreiman, G., Poggio, T., DiCarlo, J.J.: Fast readout of object identityfrom macaque inferior temporal cortex. Science Reports (2005)

3. Hubel, D.H., Wiesel, T.N.: Receptive fields and functional architecture of monkeystriate cortex. J. Physiol. 195 (1967)

4. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mech-anism of pattern recognition unaffected by shift in position. Biol. Cyb. V36(4)(1980) 193–202

5. LeCun, Y., Jackel, L.D., Boser, B., Denker, J.S., Graf, H.P., Guyon, I., Henderson,D., Howard, R.E., Hubbard, W.: Handwritten digit recognition: Applications ofneural net chips and automatic learning. IEEE Comm. (1989) 41–46 invited paper.

6. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an in-variant mapping. In: Proc. CVPR, IEEE Press (2006)

7. Serre, T., Kouh, M., Cadieu, C., Knoblich, U., G. Kreiman, T.P.: A theory of ob-ject recognition: Computations and circuits in the feedforward path of the ventralstream in primate visual cortex. CBCL Memo 259 (2005)

8. Rolls, E.T., Deco, G.: The Computational Neuroscience of Vision. Oxford Univer-sity Press (2002)

9. Felleman, D.J., Van Essen, D.C.: Distributed Hierarchical Processing in the Pri-mate Cerebral Cortex. Cereb. Cortex 1(1) (1991) 1–a–47

10. Tsotsos, J.K., Culhane, S.M., Wai, W.Y.K., Lai, Y., Davis, N., Nuflo, F.: Modelingvisual attention via selective tuning. Artificial Intelligence 78 (1994) 507–545

11. Deco, G., Rolls, E.T.: A neurodynamical cortical model of visual attention andinvariant object recognition. Vis. Res. 44 (2004)

12. Mutch, J., Lowe, D.G.: Multiclass object recognition with sparse, localized features.Proc. CVPR (2006) 11–18

13. Munder, S., Gavrila, D.M.: An experimental study on pedestrian classification.IEEE Trans. Pattern Analysis and Machine Intelligence 28 (2006) 1863–1868

neural object recognition by hierarchical learning and...

Documents