continuous transformation learning: a novel learning ...henning/resources/hilary-thesis.pdf ·...

Hilary Term 2005

Continuous transformation learning: A novel learning mechanism for invariant object recognition in feed-forward hierarchical neural networks

Candidate Number: 39312

Date: 10/3/06

Word Count: 8670

CT Learning Candidate: 39312

2

Abstract................................................................................................................................. 3

Introduction........................................................................................................................... 4

Trace learning.................................................................................................................... 5

Continuous Transformation (CT) learning ......................................................................... 7

Methods ................................................................................................................................ 9

The Model ......................................................................................................................... 9

Stimuli............................................................................................................................. 12

Training and test procedure.............................................................................................. 14

Experiment 1: Single-cell demonstration of CT learning...................................................... 14

Experiment 2: Comparison CT learning and Trace learning................................................. 16

Experiment 3: CT learning and trace learning with interleaved stimuli................................. 18

Experiment 4: CT learning with randomised views of objects.............................................. 20

Experiment 5: Generalisation in higher layers...................................................................... 21

Discussion ........................................................................................................................... 22

Conclusion .......................................................................................................................... 25

References........................................................................................................................... 26


3

Abstract

In this paper, we present a novel unsupervised learning mechanism, which allows networks to

recognise objects invariantly. Continuous Transformation (CT) learning relies on continual

synaptic modification (for example, by means of an associative Hebb rule) of the feedforward

inter-layer connection weights in a neural network during training on transforming (e.g.,

rotating, translating, etc.) visual stimuli. If there is large overlap between firing patterns of

successive views, with CT learning similar views of an object get continually associated with,

and mapped onto the same postsynaptic neurons. In this way it relies on the spatial continuity

of objects in the environment.

Here, we present five experiments that demonstrate the characteristics of the CT effect and

contrast its performance with that of another approach, trace learning, that relies on temporal

continuity of input stimuli.

In numerical simulations we find that the CT effect operates well when successive views of a

transforming stimulus are similar enough to drive the same post-synaptic neurons. This is

shown to occur largely independent of the specific order of presentation of the views.

Furthermore, if there is some invariance build into low-level feature detectors, the CT effect

drives the development of neurons in layer 4 that are tuned to specific objects independent of

rotation, even if the higher layers have only been trained on a few canonical views of the

object.


4

Introduction

One fundamental property of the primate visual system is its capability of identifying objects

invariantly of the particular viewing context. A chair can effortlessly be recognised as a chair

regardless of the specific viewing angle, its position relative to the retina, the lighting

conditions, and the background.

A general assumption adopted by many researchers is that any recognition mechanism has to

be based on a process of matching a view of an object to templates in memory (Poggio and

Edelman 1990; Tarr and Bülthoff 1995). However, objects in the natural environment occur

under conditions that are far from ideal for matching a template. If, for example, a stimulus is

shifted by only a small amount with respect to the template, a comparison using a dot product

will return a very low correlation between stimulus and template. Evidence on how the brain

solves this problem comes from single-cell physiology in monkeys. It has been shown that at

the level of inferior temporal cortex there are neurons that respond to specific objects

invariantly of the exact position on the retina, the size of the object, and the particular view

that is visible of the object (Desimone 1991; Tanaka et al. 1991; Rolls 1992; Rolls 2000).

A central issue in computational neuroscience is therefore how such invariant cells can

develop over successive processing stages in the ventral stream.

One approach to investigate the underlying mechanisms that might be involved in the

development of invariant neurons is to build biologically inspired artificial neural networks

and by implementing different learning algorithms and network architectures to study the

dynamics of self-organisation in such networks.

One specific type of network architecture that has been popular among modellers of the visual

system is based on the existence of feature hierarchies in the ventral stream (Fukushima

1980). Implementations of this approach typically involve routing visual information from the

input layer of a neural network in a feed-forward way through a hierarchy of layers (e.g.,

Poggio and Edelman 1990; Wallis and Rolls 1997). Competition between neurons within

layer helps to keep firing rates low, and convergence of connectivity between layers binds this

competition to neurons with spatially similar receptive fields and leads to a constant increase

of receptive field sizes from layer to layer.

Such a setup is supported by neurophysiological findings. Whereas in V1 individual cells

respond to an area of roughly 1.3° visual angle, in IT RFs tend to span the whole visual field

(!50° visual angle) (see Figure 3) (Boussaoud et al. 1991).

Within the framework of feature hierarchies, ‘simple’ feature detectors in V1 are tuned, for

example, to a bar of specific orientation (Hubel and Wiesel 1962). If several of these ‘simple’

cells project onto a cell in the next higher layer of the network, and this conjunction neuron’s

activity is calculated by means of a non-linear function, so that it only fires when several of

its afferent connections are active, one can build up a more and more complex representation

of objects as one ascends through the layers. Such a neuron could represent, for example, a

bar intersection. Several complex cells can then project to a neuron in an even higher layer,

and so forth. Due to the increasing receptive field sizes one convenient by-product of such a


5

feature hierarchy is that the higher up a cell is in the processing stream, the more invariant to

location its firing will be.

A major problem with feature hierarchies as a model of invariant object recognition is related

to capacity. In order to be able to represent objects in a way that mirrors the complexity of

their appearance in the environment, a high number of conjunction neurons have to be

present. The development of feature hierarchies in neural networks has therefore been

influenced by the question of how best to streamline the process of parsing natural scenes into

individual features. In this respect, nature is lending a helping hand in that visual input is not

chaotic but exhibits certain statistical regularities (Field 1994). Given that the evolution of

primate vision has taken place with these regularities remaining constant, it seems reasonable

to assume that the visual system will have adapted to these conditions.

Trace learning

In the context of feature hierarchies, one influential learning mechanism that draws on

statistical regularities in natural scenes was first proposed by Földiák (1991) and then

elaborated by Rolls (1992). Trace learning utilises the fact that objects under normal viewing

conditions show temporal continuity. It does so by assuming that an object that was

encountered at time step ! is likely to be the same object at time step ! + 1, regardless of any

transformations of the object that might have occurred between time steps. This is

implemented by modifying an associative learning rule to bias neurons to respond to patterns

of activity in the input layer that occur in temporal proximity (Equation 1). The trace that is

used to strengthen connections between coactive pre- and post-synaptic neurons is a function

of the activity at the current and previous time step as outlined in Equation 2. Here the trace

value " has to be optimised dependent on the length of the presentation sequence. Note, that if

one sets the trace value " to 0 the trace rule becomes a standard Hebb rule. Connection

weights between cells in adjacent layers are then updated according to the relative activity

patterns.

Equation 1

!

"w j =#y $x j

Equation 2

!

y "

= (1#$)y" +$y "#1

The variables are defined as:

xj: jth input to the neuron.

!

y " : Trace value of the postsynaptic neuron at time ".

wj: Synaptic weight between the jth presynaptic neuron and the neuron.

y: Activity of the postsynaptic neuron.

#: Learning rate [0, 1].

": Trace value [0, 1]. Optimal value dependent on presentation sequence length.

An illustration of trace learning is shown in Figure 1. In this example a bar stimulus moves

through the receptive fields (RF) of two input neurons. At time step 1 the bar enters the RF of

the upper input neuron. The weight of the connections between the pre- and post-synaptic

neurons is initially random but there is convergence between layers. By virtue of graded

competition between neurons in the post-synaptic layer, the pre-synaptic neuron will drive a


6

Figure 1. Illustration of trace learning. A moving bar is moved over the receptive fields of

two input cells at time steps t(1) and t(2). At t(1) the upper pre-synaptic neuron drives one

particular cell in the post-synaptic layer and its synaptic strength increases. At t(2) the bar has

moved into the receptive field of the lower input neuron, which is mapped onto the same post-

synaptic neuron that was driven by the upper input neuron at t(1) due to a trace of previous

activity (small circle next to post-synaptic neuron). Over many epochs this trace will lead to

cells getting associated that have been excited in temporal proximity.

small number of neurons in the post-synaptic layer, and hence the weight of the connection

between the co-active neurons increases. The small circle to the right of the post-synaptic cell

indicates the activation of the post-synaptic neuron at the previous time step, i.e. the trace.

At time step 2, the bar stimulates the pre-synaptic neuron on the bottom. Again due to random

weights and competition this cell in turn will activate a small number of neurons in the output


7

layer, and one of them might be the post-synaptic neuron from the previous time step. This

will lead to a strengthening of connections between the coactive cells but influenced by the

trace. Hence, pre-synaptic cell 2 will get associated with the same post-synaptic cell as pre-

synaptic cell 1, in addition to any other cells it might drive during that time step (In the figure

this is indicated by the strengthened connection to the lower post-synaptic cell). Over many

epochs the mapping of neurons will tend to settle down influenced by the trace

‘remembering’ previous activity patterns and hence learn different transforms of objects if

they are presented in temporal continuity.

Wallis and Rolls (1997) implemented this learning rule in a 4 layer competitive network with

convergence of connections between layers from topographically corresponding regions

(VisNet) and showed that such a network can indeed develop invariant representations of

objects and faces. Furthermore, it has been shown that by substituting the original trace rule in

Equation 1 with the asymmetric version shown in Equation 3 the performance of trace

learning can be significantly increased (Rolls and Milward 2000).

Equation 3

!

"w j =#y $%1

x j

$

While trace learning provides an attractive account for unsupervised learning of invariant

object recognition in hierarchical competitive neural networks, three major problems have

been identified:

(1) In order for trace learning to achieve invariant object recognition the training and test

stimuli have to be presented for a large number of training epochs.

(2) Trace learning is limited by the number of transforms it can learn. If too many transforms

are trained, trace learning breaks down.

(3) There is nothing about the features of an object itself that will make the trace learning

mechanism associate two transforms of an object together. For trace learning, all that counts

is that both transforms are presented in temporal proximity.

Continuous Transformation (CT) learning

In this paper we present a novel learning mechanism that relies on the spatial continuity found

in natural visual scenes. Objects typically transform in a continuous fashion: an object

shifting along a trajectory will start at the beginning and then travel through all intermediate

positions in a sequential order until reaching the target position. The same is true for other

types of transformation, like rotation or changes in size. By means of continuous updating of

the feed-forward connection weights between network layers, with CT learning, similar views

of an object get continually associated with, and mapped onto the same postsynaptic neurons

(hence, the name continuous transform learning). Over several layers of a feature hierarchy

this can lead to the development of neurons in the output layer that fire specifically in

response to a certain object, regardless of its state of transformation.

The main concept of CT learning is illustrated in Figure 2. Here we show a simplified 2-layer

network with an input (lower row of circles) and an output (upper row of circles) layer of 5

neurons each. If a moving bar is presented to the network the corresponding input units get

activated (solid circles) and by means of lateral inhibition in the output layer this activity will

converge onto a small number of neurons. At stimulus position 1 the three input neurons on


8

Figure 2. Illustration of the CT effect. A bar moves continuously across the retina and excites

corresponding input neurons. At stimulus position 1, input neurons 1-3 on the left fire and

drive cell 3 in the output layer. An associative learning rule strengthens the afferent

connections of neuron 3 for the 3 pre-synaptic neurons. If the bar moves to position 2, the

cells that have been active already at position 1 drive the same post-synaptic cell at position 1.

In addition the newly activated cell is also mapped onto the same output neuron.

the left get activated, and therefore their connections to the active post-synaptic cell will get

strengthened (thin lines).

The bar then continuously moves across the retina and at stimulus position 2 has shifted by

the equivalent of one neuron to the right. The key point is, that in this position, 2 out of the 3

neurons that were active at position 1 are still firing. These 2 cells had their efferent

connections to output cell 3 strengthened when the bar was in position 1 and will therefore

drive the same cell when the bar is in position 2. The weights get updated again, and the


9

connections between the active post-synaptic cells and the pre-synaptic cells get strengthened

(the thickness of the lines indicates the relative weight of connection between cells). It is

obvious how a repetition of the above process will lead to more and more cells being mapped

onto one output cell, which therefore will exhibit invariant properties even without a trace

term.

In general, the CT effect relies on the continuous transformation of the training stimuli,

expressed by the similarity of input vectors between transforms. In order to develop

invariance, a full set of overlapping transformations will have to be shown for each object.

Different associative learning rules could be used to achieve the CT effect, the simplest

example being a standard Hebb rule (Equation 4),

Equation 4

!

"wij =#yiy j

where $wij denotes the change in synaptic weights after activity patterns have been calculated,

and y is the firing rate of presynaptic neuron j and post-synaptic neuron i. The learning rate is

the constant #, and the weight vector w for each neuron is normalised at the end of every

training step, which is a condition of competitive learning (Hertz et al. 1991).

In this paper we present five experiments that demonstrate the characteristics of the CT effect

using a simple version of the Hebb learning rule to train a neural network to distinguish two

3D stimuli from different views. Note, that this rotation invariance is an isomorphic problem

to translation and size invariance.

All simulations were run in VisNet2 (Rolls and Milward 2000), a four-layer feed forward

hierarchical network with competition within layers that models the ventral stream of the

primate visual system (see Methods for details). This is essentially a scaled up version of the

model used to illustrate the CT effect in Figure 2.

Methods

The Model

All experiments presented here were conducted using the newest version of VisNet

(VisNet2), which is a model of the primate ventral stream consisting of four competitive

network layers that are set up hierarchically with initially random feed-forward connections

between the layers. However, CT learning is by no means constrained to this particular model

and could be implemented in many different architectures that exhibit the fundamental

properties of convergence and competition.

Each layer corresponds to one particular area of the primate visual system in terms of its

receptive field sizes (Figure 3), as identified by a number of different researchers (e.g., Rolls

1992; e.g., Van Essen et al. 1992).

Layer 1 corresponds to V2, layer 2 to V4, layer 3 to posterior inferior temporal cortex, and

layer 4 to anterior inferior temporal cortex.

Connection probabilities are taken from a Gaussian distribution and set so that each cell in

layer 2-4 will receive 67% of its feed-forward connections from a topographically

corresponding region in the preceding layer. Note, that not every neuron in one layer is

connected to every neuron in the layer above. The exact connectivity patterns were


10

Figure 3. VisNet architecture and receptive field sizes in the primate visual system. A four-

layer hierarchical network with connectivity converging onto topographically corresponding

regions in higher layers. Receptive field sizes are set to mirror those found in the macaque

monkey (adapted from Rolls and Deco 2002).

determined randomly, but then kept constant for all simulations in order to provide means of

comparing network performance with different learning rules. Network dimensions are

provided in Table 1.

Dimensions No of Connections Radius

Layer 4 32 x 32 100 12

Layer 3 32 x 32 100 6

Layer 2 32 x 32 100 6

Layer 1 32 x 32 272 6

Retina 128 x 128 x 32 - -

Table 1. Network dimensions.

Before presentation, images of the stimuli were run through filters that mirror the tuning

profiles of cells found in V1 (e.g., Hubel and Wiesel 1962). The input filters were computed


11

by weighting the difference of two Gaussians by a third orthogonal Gaussian according to

Equation 5:

Equation 5

!

"xy (#,$, f ) = # e%(x cos$ +y sin$

2 / f)2

%1

1.6e%(x cos$ +y sin$

1.6 2 / f)2&

' ( (

)

* + + e%(x sin$ %y cos$

3 2 / f)2

Here, f is the spatial frequency of the filter, % is the filter orientation, and & is the sign of the

filter. Filters exist for spatial frequencies of 0.0625 to 0.5 cycles/pixel, orientations of 0° to

135° in 45° steps, and sign ± 1. The number of filters for each spatial frequency is given in

Table 2.

Frequency (cycles/pixel) 0.5 0.25 0.125 0.0625

No of Connections 201 50 13 8

Table 2. Layer 1 connectivity.

The activation hi of each neuron i was calculated by summing the inputs yj from all afferent

neurons j weighted by the synaptic weights wij:

Equation 6

!

hi = wij y j

j

"

Within each layer there was graded competition between neurons, which was implemented in

two stages. Note, that this allows more than one post-synaptic neuron to fire because this

mechanism is not winner-take-all.

To achieve lateral inhibition the activation of all neurons within a layer was convolved with a

spatial filter I. Here, ' controls the contrast, ( specifies the width, and a and b index the

distance from the centre of the filter (see Table 3 for typical values).

Radius, ! Contrast, "

Layer 1 1.38 1.5

Layer 2 2.7 1.5

Layer 3 4.0 1.6

Layer 4 1.6 1.4

Table 3. Lateral inhibition parameters.


12

Equation 7

!

Ia, b

="#e

"a2+ b

2

$ 2

1" Ia, ba%0

b%0&

'

( )

* )

Secondly, a sigmoid activation function (Equation 8) was used to enhance contrast within

layers, where r is the firing rate after lateral inhibition (see above), y is the firing rate after

contrast enhancement, # is the sigmoid threshold, and ) is the slope.

Equation 8

!

y = fsigmoid

(r) =1

1+ e"2# (r"$ )

The parameter # was adjusted to control the sparseness of the firing rates in each layer. For

example, to set the sparseness to 5% one would have to set # to the value of the 95th

percentile point of the activations within a layer. Table 4 shows parameters for the sigmoid

activation function.

A comprehensive review of the VisNet architecture can be found in Rolls and Deco (2002).

Percentile Slope, #

Layer 1 99.2 190

Layer 2 98 40

Layer 3 88 75

Layer 4 91 26

Table 4. Sigmoid function parameters.

Stimuli

The stimuli used to train the networks were images of continuously rotating 3D objects. They

were created using the OpenGL API, which gives a maximum of control over all stimulus

parameters. In this way it was possible to fine-tune the amount by which each stimulus was

rotated between views of the objects. OpenGL builds a 3D representation of the objects and is

able to transform these with respect to the viewer. The transformed object is then projected

onto a 2-dimensional image of the current view. Lighting was mainly ambient with a diffuse

light source added for image realism.

The two objects used in the current experiments were a cube and tetrahedron matched for

size, which were rotated in front of a blank background around their vertical midline axis over

a range of 180° (after which transforms repeat itself, due to the inherent symmetry of the

objects). An example of a rotation with a step size of 18° is given in Figure 4.

if a # 0 or b # 0,

if a = 0 and b = 0.


13

Figure 4. Example of 3D stimuli rotated over 180° with a step size of 18°. Stimuli were

created using OpenGL computer graphics and could be set to turn around the vertical axis

with any angle of rotation between views.


14

Training and test procedure

At the start of each simulation the synaptic weights of the connections between layers were

randomised. Different simulations could therefore differ in the initial weight.

Networks were trained layer by layer, using the following procedure: (1) the stimulus picture

was filtered by means of Equation 5, then (2) the activity of individual neurons was

calculated. Lastly, (3) the synaptic weights were changed by means of the particular learning

rule used.

One training epoch consisted of presentation of all transforms of both objects and training

was repeated for 50 epochs in layer 1, 100 epochs in layers 2 and 3, and 75 epochs in layer 4.

The synaptic weight update was calculated using either the enhanced asymmetric trace rule

(Equation 3 and Equation 2) for trace learning, or the simple associative Hebbian learning

rule given in Equation 4 for demonstrations of the CT effect. The learning rate # was set to

25.0, 6.7, 5.0, and 4.0 for layers 1-4 respectively. For the simulations using trace learning the

trace value " was set to 0.8. These parameters were shown to be effective in previous VisNet

simulations (Rolls and Milward 2000; Stringer and Rolls 2000; Stringer and Rolls 2002).

After training, the network’s performance was tested by once more presenting all

transformations of both objects and monitoring the response patterns in the output layer.

Three different measures of performance were available. On an individual cell level it is

possible to calculate the invariance developed by each neuron by looking at response profiles

for all views of the full 180° rotation. In this way it is possible to quantify the amount of

information coded by each cell, the maximum being 1 bit as calculated with Equation 9.

Equation 9 Maximum cell information = log2(# of stimuli)

A neuron will therefore carry maximum information if it invariantly fires for presentations of

one object and never for the other object, regardless of the view shown.

In order to investigate to what extent a network has developed invariant cell response

properties in the output layer one can then order all output neurons by their informational

value in a single cell information plot.

If a number of cells exhibit maximum invariance this can be interpreted as evidence that the

network has mastered invariant object recognition. However, in order to make sure that the

network has developed invariance for both objects (cube and tetrahedron) it is necessary to

analyse the information contained across a population of output cells. The multiple cell

information plot shows how much information is available over the 5 neurons carrying the

most information about each object, and ideally should asymptote to 1, if the network is to

have developed invariance. The algorithm that was used for calculating information in a

population code is outlined in detail in Rolls, Treves, and Tovee (1997).

Experiment 1: Single-cell demonstration of CT learning

The aim of the first experiment was to provide a first demonstration of the CT effect in the

VisNet architecture. The network was trained on images showing the cube and tetrahedron

rotating through 180° with step sizes of 1° between individual views. For the CT effect to

occur it is crucial that there is high similarity of input vectors of subsequent transforms, that


15

Figure 5. Demonstration of the CT effect in typical cells of the output layer. (A) shows a

typical cell in layer 4 responding to the cube invariantly of the view shown. While there were

no cells found that responded to the tetrahedron in a completely invariant fashion (B), many

cells tended to respond invariantly to the same intervals of rotation (!65°-140°, and !0°-65°

and 160°-180°). These intervals were found to correspond to a novel surface of the

tetrahedron coming into view (C) at the same the same point in time as tuning switches

between neurons.


16

is, the angle by which an object is rotated between views has to be sufficiently small.

Synaptic weights were updated using a standard Hebb rule (Equation 4).

In each training epoch, all views of an object were shown in sequence, followed by all

transforms of the other object.

Single cell response profiles of typical neurons in layer 4 after training are shown in Figure 5.

The neuron with coordinates (10,10) has developed perfect rotation invariance for the cube

(Figure 5A). It has learned to respond whenever the cube is shown and never responds to the

tetrahedron. Similarly, neurons (22,13) and (14,17) fire specifically in presence of the

tetrahedron (Figure 5B), but for different intervals of rotation. Neuron (22,13) covers views

rotated by 65° to 140°. In turn, cell (14,17) has learned to respond to the complimentary

intervals from 0° to 65° and 160° to 180°.

Even though we were unable to find a single cell with maximum information for the

tetrahedron, many cells with similar response profiles like the ones shown here were present,

which might indicate that cells get tuned to distinct subregions of the tetrahedron. This is

illustrated in (Figure 5C) where a view of the tetrahedron rotated by 65° is shown,

corresponding to the angle at which neuron (14,17) ceases to fire and neurons (22,13) begins

firing. Incidentally, at this point of the rotation the surface highlighted by the arrow has just

come into view (see also Figure 4 for images of the objects at different rotations), that is, a

catastrophic change in the image occurs. This surface remains in view until just after the

viewing angle reaches 115°.

Experiment 2: Comparison CT learning and Trace learning

In Experiment 1, it was demonstrated that the CT effect is present at the single cell level when

objects were rotated during training with 1° difference between successive views. In order to

further investigate the effect on a network level, in experiment 2, CT learning (using a Hebb

rule, Equation 4) was contrasted with trace learning (Equation 3) by looking at the single cell

and multiple cell information plots comparing different step sizes between successive

transforms.

We discussed earlier how the CT effect relies on a significant overlap between activity

patterns of neighbouring views in a continuous rotation. The exact amount of overlap

necessary will depend on the specific simulation conditions (i.e. model architecture, learning

rate, stimuli). If views are presented in order, over adjacent time steps the same neuron in

layer 2 will be driven by input neurons that are active for both time steps. As the difference

between successive transforms increases CT learning will cease to work as input neurons will

start to excite new 2nd layer neurons rather than keeping the same 2nd layer neurons active and

mapping invariant representations onto higher layer cells.

Trace learning, on the other hand, does not require spatial continuity between transforms and

will therefore work fine with large step sizes between views. However, performance is

predicted to deteriorate with trace learning, if the network is overloaded with too many

transforms (Wallis and Rolls 1997).

The performance of networks after training was compared (Figure 6), one using a standard

Hebb rule, one using the trace rule, and one control network that was not trained at all. The


17

Figure 6. Comparison of CT learning with trace learning at different angles of rotation

between transforms. Results from single cell information measures (left column) multiple cell

information measures (right column) are presented for CT learning with a standard Hebb rule

(solid line), trace learning (dashed line), and an untrained control network with random

weights (dotted line).


18

same stimulus images of cube and tetrahedron rotating through 180° were used as in

Experiment 1, but at step sizes of 1°, 2°, 9°, and 36° between views.

The finding that the CT effect is present for the cube and tetrahedron, if the step size between

views is 1°, is replicated in this experiment: the single cell plot on the left reveals good

performance with the Hebb learning rule at 1°. A fair number of cells are carrying maximum

information of 1 bit. Interestingly, trace learning achieves good performance at this step size

as well.

An explanation for this is that the CT effect will drive trace learning when there is enough

overlap in input activity between successive transforms, even under conditions in which trace

learning would usually be compromised. The mechanism, in which the CT effect takes over

trace learning, is based on post-synaptic neurons being kept active over successive transforms

and hence strengthening connection weights of input cells onto one invariant 2nd layer neuron.

Note, that this is completely independent of the trace term

!

y " , as long as views are presented

separately for both objects. The multiple cell plots on the right mirror the single cell results.

When the step size between transforms is increased to 2°, one is confronted with a drastically

different picture. The CT effect has broken down, and so has trace learning. Even a small

increment in step size has reduced the overlap of firing to the extent that there are no

completely invariant neurons developing as a result of the CT effect. Trace learning without

help of the CT effect breaks down under the load of 90 transforms per object. This confirms

the explanation for trace learning showing good performance at 1° but being reliant on the CT

effect. The multiple cell plots reveal a better population performance for trace then for Hebb,

but without the development of invariant cells one cannot claim that the network has mastered

the task of rotation invariant object recognition.

As the step sizes between transforms increase from 9° to 36° the Hebb rule’s performance

remains poor. The CT effect does not work with large differences between successive

transforms. The trace rule, on the other hand, becomes more successful as the total number of

transforms for each object decreases down to 5 for 36° step sizes. These results again are

substituted by the performance shown in the multiple cell plots on the right.

The overall picture that emerges from these graphs is that the CT effect can drive

performance at small step sizes for Hebb rule as well as trace rule learning. With larger

rotations between views the CT effect breaks down as less and less neurons are commonly

activated between successive transforms and hence will not map onto only one post-synaptic

neuron by keeping it active over time. The trace learning rule, on the other hand, improves

performance with large step sizes and few transforms per object due to genuine trace learning

dynamics.

Experiment 3: CT learning and trace learning with interleaved

stimuli

Trace learning relies on temporal continuity between transforms of the same object in the

visual environment. However, one could imagine a realistic scenario in which alternating


19

Figure 7. Comparison between CT learning and trace learning with interleaved stimuli.

The network was trained on interleaved views of the cube and tetrahedron over a rotation of

180° (e.g., cube1, tetrahedron1, cube2, tetrahedron2, …). Between views objects were rotated

by 1°.

views of two stimuli are projected onto the retina. When scanning a visual scene saccadic eye-

movements do not necessarily scan objects individually but might jump from prominent

features in one object to prominent features of a second object. Under such conditions trace

learning will not achieve invariant object recognition as a consequence of the features of the

two objects getting associated together.

CT learning, on the other hand, relies on spatial regularities in the visual environment and will

be able to dissociate two objects by virtue of differences in local features. The network can

therefore develop invariant cells to two different stimuli with a Hebbian associative learning

rule even when these stimuli are presented in an interleaved fashion. Trace learning fails

under these conditions.

In Experiment 3 the cube and the tetrahedron were again rotated over 180° with 1° rotations

between individual views, but in contrast to the previous experiments the objects were not

shown individually but interleaved (the first view of the cube was followed by the first view

of the tetrahedron, followed by the second view of the cube, again followed by the second

view of the tetrahedron, and so on).

Numerical results are shown in Figure 7. CT learning by means of the Hebb rule has enabled

a number of 4th layer neurons to reach a maximal performance of 1 bit, which is expressed in

good single and multiple cell information measures. The difference in features between the

two stimuli has been sufficient for the CT effect to dissociate the two objects on a view-by-

view basis, even if the views are alternating between the two objects. Furthermore, the

activity pattern of pre-synaptic neurons that are driven by the presentation of successive

transforms of either object are similar enough to reactivate the same post-synaptic neurons on

successive views of one object, even if temporal continuity has been disturbed by the


20

presentation of a view of the other object. In contrast, trace learning does not work without

temporal continuity – cube and tetrahedron get associated due to the temporal proximity of

views of alternating objects in this presentation paradigm. The performance of the trace

learning rule is not significantly different from the random control condition in the single cell

information plots. In addition, the multiple cell information measure does not reach a value of

1 bit.

Experiment 4: CT learning with randomised views of objects

Having shown that CT learning is superior to trace learning when temporal continuity breaks

down (Experiment 3), Experiments 4 and 5 were meant to focus on exploring the limits of CT

learning and how some of its weaknesses might be overcome.

An agent in real world conditions cannot expect to be shown a full continuous rotation of an

object at a particular point in time. One might encounter an object rotating through a specific

angle on one occasion and only later encounter the same object rotating through a different

angle.

So far, the CT effect was observed to occur when objects were presented in a continuous

motion over 180°. In Experiment 3 it was shown that the effect does not necessarily rely on

views of the two objects being presented in a temporally continuous fashion. In continuation

of these earlier results, in Experiment 4, we show that in principle CT learning does not even

necessarily rely on transforms being presented in order as long as the motion is spatially

continuous (i.e., the views shown cover the whole rotation). Views of the rotating cube and

tetrahedron were split up into blocks of 30° and the network was then trained on these blocks

presented in a random order.

Results are shown in Figure 8. In the present simulations the learning rate was tuned from its

original value to 0.1 in all layers, in order to improve performance. As can be seen in the

single cell and multiple cell information plots, with these parameters, CT learning can

successfully be demonstrated using a standard Hebb rule. To explain this, it is important to

note the number of training epochs (50, 100, 100, and 75 for layers 1-4 respectively). Since

blocks are chosen at random in this presentation paradigm, the high number of epochs will

invariantly lead to the presentation of adjacent blocks in order. That is, while on most

presentation trials blocks will be presented that are not continuous (e.g., the blocks containing

the rotation intervals of 0°-30° and 60°-90°) a single presentation of two blocks in order (0°-

30° and 30°-60°) will lead to a remapping of previously learned representations of two

spatially adjacent blocks onto the same neurons in the 2nd layer. Therefore, as long as all

blocks are shown in order at least once, invariance can develop by means of CT learning.

These results demonstrate that, at least in principle, CT learning can still occur with a

randomised block ordering of stimulus views during training.


21

Figure 8. Demonstration of the CT effect with training the network on randomised

views of the stimuli. In this experiment the full set of transformations was divided into 6

blocks of 30° for each object. Then the network was trained on these blocks presented in

random order. The transformations within each block were shown in the usual continuous

fashion.

Experiment 5: Generalisation in higher layers

In the previous experiments, the CT effect has been demonstrated when the full view space of

180° was sampled by many closely spaced views during training. In contrast, the primate

visual system can learn to recognise objects invariantly from only a few canonical views. A

technique that has been used in earlier demonstrations of invariant object recognition using

the trace rule has been to train the network on a full set of stimuli in layers 1 and 2 (Stringer

and Rolls 2002). This was done in order to build a limited amount of invariance to low-level

stimulus features into the early layers of the network. It has been shown that the higher layers

can then generalise to respond invariantly to novel stimulus transforms after training on only

a small set of canonical views.

In Experiment 5, we investigated if the same mechanism can be applied for CT learning using

a standard Hebb rule. The first 2 layers were trained on a full set of 180 transforms with 1°

rotations between transforms. Next, in a separate training session, layers 3 and 4 were only

trained on 5 views with 36° rotation between transforms. The network was then tested on a

full set of stimuli of 180 views spaced by 1°.

Numerical results are shown in Figure 9. Even though layers 3 and 4 have not been exposed

to 97% of the views they were tested on the network achieves good invariance in both single

cell and multiple cell information plots as compared to the untrained network condition.

This result can be attributed to the fact that training in the first 2 layers on a full set of stimuli

has enabled the network to create neurons in layer 2 that generalise within part of the rotation

space of an object, leaving layer 3 and 4 neurons to associate together the different subregions

of the rotation space.


22

Figure 9. Demonstration of generalization in the higher layers after building invariance

in the first two layers.

Neurons in the lower layers, which develop invariant responses to simple features during

early visual experience, can help higher layers of the network to generalise to novel stimulus

views. In this simulation the first two layers of the network were trained on a full set of 180

transformations for each object. The higher layers could then be trained on sparser views of 5

transformations with 36° difference in rotation and still develop invariant neurons.

In this way it is possible that early visual experience that has build limited feature invariance

into the early layers helps CT learning to free itself from the constraint of having to be trained

on full sets of closely spaced transforms.

Discussion

In this paper we presented a novel unsupervised learning mechanism, which allows networks

to develop transformation invariance. The key condition for the CT effect to work is that there

has to be significant overlap between firing patterns of successive views of an object. These

views do not need to be shown in order (Experiment 4) and can be interleaved with views

from other objects, provided the input activity vectors of views of the other object are not too

similar to any of the views of the original object (Experiment 3). If conditions allow,

subsequent views of an object are then mapped onto the same higher layer neuron and over

successive processing stages neurons in layer 4 can develop, that respond specifically to one

object, regardless of the view shown.

In Experiment 5, it was shown that by building feature invariance into the early layers of the

network, the higher layers can develop invariance to all views of an object, even if they are

only trained on a few canonical views of the objects. This early feature invariance is enough

to consistently drive the same invariant cells in the upper layers even if these have not been

trained on all stimulus views. This last experiment shows that once feature invariance has

been built up, invariant object recognition with CT learning does not rely on the continuous


23

transformation of objects under which the CT effect had first been demonstrated (in

Experiments 1 and 2).

CT learning exhibits a number of advantages over trace learning.

(1) CT learning does not require the existence of a memory trace but can be demonstrated

using a standard Hebb rule (Equation 4). At the same time the CT effect does not rely on the

standard Hebb rule but could also be implemented with other associative learning rules (for

example a trace rule, as shown in Experiment 2). This property makes CT learning very

flexible with regards to novel developments in the neuroscience of vision. These points add to

its biological plausibility.

(2) CT learning performance does not deteriorate with an increasing number of transforms. In

fact, it was shown in some simulations (not presented in this paper) that it is possible to

further increase performance with even smaller step sizes than 1° and even more transforms

shown during training. The trace rule, on the other hand, has been found to be limited in terms

of its capacity to train networks on large numbers of transforms (Wallis and Rolls 1997).

(3) At least in theory, it can be seen how a network applying the CT effect could develop

invariance from just one training epoch – a condition that is of crucial importance if one

wants to generalise findings from artificial neural networks to biological systems.

(4) CT learning takes into account the actual spatial setup of features of objects in order to

recognise them. In contrast, trace learning will learn to associate any two views to belong to

the same object as long as they occur in temporal proximity.

In addition to demonstrating the CT effect and some of its properties, we contrasted

performance of Hebb trained networks with those of networks trained with the trace rule. An

interesting finding was that with small step sizes between views the CT effect was driving

performance of trace learning under conditions that would typically not allow trace learning

to successfully develop invariance (Experiment 2). When step sizes between views increase

the CT effect gets compromised, as there is not enough overlap between successive views.

Trace learning on the other hand, will show good performance when the number of

transforms shown decreases.

Models of object recognition will at some point have to live up to a comparison with the

biological visual system. Evidence from Experiment 1 and 2 suggests that CT learning and

trace learning could effectively be combined in the training of networks to develop

transformation invariance, especially since the CT effect does not need to be implemented

into the trace learning rule explicitly. In Experiment 1, the failure to find neurons in layer 4

that have developed complete invariance to the tetrahedron suggests that CT learning operates

well across continuous, or quantitative, changes in the view properties, but not across large

discontinuities in the form of catastrophic changes that occur occasionally in rotating objects

(see Koenderink 1990). This is by definition the domain of the trace learning rule, because it

does not require any similarity to associate views of an object together, provided temporal

continuity is preserved. It can be expected that discontinuity in response profiles is a

fundamental property of CT learning with objects more complex then the highly self-similar

cube.

One caveat of proposing a CT aided trace learning mechanism to be responsible for the

development of invariance in the primate visual system is that the trace rule fails to solve the


24

problem of interleaved stimuli. Especially in the context of human observers actively

scanning visual scenes by virtue of saccadic eye-movements this becomes a problem.

Currently, a general theory of the underlying mechanisms that govern the selection of saccade

targets in natural scenes remains illusive. However, most likely is an account, in which high

and low level salient features are implemented into a saliency map (Itti and Koch 2001;

Henderson 2003). Low-level salient features are in turn not necessarily occurring in only one

object and therefore eye movements will tend to fall on alternating objects, especially in the

early parts of scanning a novel scene. This does not pose a problem for the CT effect, but

trace learning will be compromised by such a breakdown in temporal continuity between

views of objects.

Experiment 5 opened up another point of interest with regards to natural viewing conditions.

In order for CT learning to work under natural viewing conditions, feature detectors of limited

invariance have to develop in the first 2 layers of the network in order to allow the last two

layers to be trained on only a few canonical views. In the present experiments this was

achieved by presenting a full set of rotation views of the two objects to the first two layers.

But how could such invariance be set up in a biological system? A first step to solving this

problem would be to demonstrate that training the network on complex scenes including

many different shapes and objects can set up feature invariance. CT learning can provide a

valid account of invariant object recognition in the real brain if the statistics of natural scenes

allow the first layers to set up a similar low level invariance as was done in Experiment 5. The

specific dynamics involved in the process of setting up these feature detectors in response to

training with complex scenes are hard to predict and have therefore got to be tested in

quantitative simulations. One tentative possibility of how this could work stems from the

finding that infants have problems with pursuit eye-movements (von Hofsten and Rosander

1997) – when the eyes are following a moving target they often tend to ‘slip’ off the target in

young infants. Incidentally, this might aid the process of setting up these early invariant

feature detectors because this provides a natural source of continuous object motion with

respect to the retina.

In the experiments presented above we have only used two simple objects to train the

network. This was done in order to provide an initial account of CT learning that is as simple

and straight forward as possible. However, a low estimate of the recognition capacity of the

primate visual system is in the order of magnitude of 100,000 different objects (Biederman

1987). Future research will therefore have to face up to the problem of, on the one hand,

increasing the complexity of training stimuli and, on the other hand, increasing the total

number of objects the network has to learn invariantly. One approach to solve this problem

could be to investigate how scaling up the number of nodes in each layer of the network

would impact on its capacity. In the present experiments, VisNet was run with 1024 cells in

each layer which is several orders of magnitude smaller then the number of neurons one finds

in the primate visual system (Rolls and Deco 2002).

Furthermore, if one combines a larger network with more complex objects (objects with a

more complex basic structure or texture) this might increase the capacity significantly because

the network would have to cope with more complex input vectors but would also be given

more information on which to base the distinction of objects.


25

Conclusion

Given that early visual areas are set up in topological maps, and that associative learning (i.e.,

long-term potentiation) is a common mechanism in the visual system (Artola and Singer

1993; Singer 1995; Fregnac 1996) a process with the characteristics of the CT effect is likely

to play an important role in the self-organisation of the ventral stream to develop invariant

representations of objects. This happens under the condition that input stimuli transform in

small steps.

Under these architectural constraints the CT effect is not merely an additional mechanism but

an intrinsic and ubiquitous property of the visual system.


26

References

Artola A, Singer W (1993) Long-term depression of excitatory synaptic transmission and its

relationship to long-term potentiation. Trends Neurosci 16: 480-487

Biederman I (1987) Recognition by components: A theory of human image understanding.

Psychological Review 94: 115-147

Boussaoud D, Desimone R, Ungerleider LG (1991) Visual topography of area TEO in the

macaque. J Comp Neurol 306: 554-575

Desimone R (1991) Face-selective cells in the temporal cortex of monkeys. Journal of

Cognitive Neuroscience 3: 1-8

Field DJ (1994) What is the goal of sensory coding? Neural Computation 6: 559-601

Fregnac Y (1996) Dynamics of functional connectivity in visual cortical networks: an

overview. J Physiol Paris 90: 113-139

Fukushima K (1980) Neocognitron: a self organizing neural network model for a mechanism

of pattern recognition unaffected by shift in position. Biol Cybern 36: 193-202

Földiák P (1991) Learning invariance from transformation sequences. Neural Computation 3:

194-200

Henderson JM (2003) Human gaze control during real-world scene perception. Trends in

Cognitive Sciences 7: 498-504

Hertz J, Krogh A, Palmer RG (1991) Introduction to the Theory of Neural Computation.

Addison-Wesley, Redwood City, CA

Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional

architecture in the cat's visual cortex. J Physiol 160: 106-154

Itti L, Koch C (2001) Computational modelling of visual attention. Nature Reviews

Neuroscience 2: 194-203

Koenderink JJ (1990) Solid Shape. MIT Press, Cambridge, MA

Poggio T, Edelman S (1990) A network that learns to recognize three-dimensional objects.

Nature 343: 263-266

Rolls ET (1992) Neurophysiological mechanisms underlying face processing within and

beyond the temporal cortical visual areas. Philos Trans R Soc Lond B Biol Sci 335:

11-20; discussion 20-11

Rolls ET (2000) Functions of the primate temporal lobe cortical visual areas in invariant

visual object and face recognition. Neuron 27: 205-218

Rolls ET, Deco G (2002) Computational neuroscience of vision. OUP, New York


27

Rolls ET, Milward T (2000) A model of invariant object recognition in the visual system:

learning rules, activation functions, lateral inhibition and information-based

performance measures. Neuron 12: 2547-2572

Rolls ET, Treves A, Tovee MJ (1997) The representational capacity of the distributed

encoding of information provided by populations of neurons in primate temporal

visual cortex. Exp Brain Res 114: 149-162

Singer W (1995) Development and plasticity of cortical processing architectures. Science

270: 758-764

Stringer SM, Rolls ET (2000) Position invariant recognition in the visual system with

cluttered environments. Neural Networks 13: 305-315

Stringer SM, Rolls ET (2002) Invariant object reccognition in the visual system with novel

views of 3D objects. Neural Computation 14: 2585-2596

Tanaka K, Saito H, Fukada Y, Moriya M (1991) Coding visual images of objects in the

inferotemporal cortex of the macaque monkey. J Neurophysiol 66: 170-189

Tarr MJ, Bülthoff HH (1995) Is human object recognition better described by geon structural

descriptions or by multiple views? Comment on Biederman and Gerhardstein (1993).

J Exp Psychol Hum Percept Perform 21: 1494-1505

Van Essen DC, Anderson CH, Felleman DJ (1992) Information processing in the primate

visual system: an integrated systems perspective. Science 255: 419-423

von Hofsten C, Rosander K (1997) Development of smooth pursuit tracking in young infants.

Vision Res 37: 1799-1810

Wallis G, Rolls ET (1997) Invariant face and object recognition in the visual system. Prog

Neurobiol 51: 167-194

continuous transformation learning: a novel learning ...henning/resources/hilary-thesis.pdf ·...

Documents