continuous transformation learning: a novel learning ...henning/resources/hilary-thesis.pdf ·...
TRANSCRIPT
Hilary Term 2005
Continuous transformation learning: A novel learning mechanism for invariant object recognition in feed-forward hierarchical neural networks
Candidate Number: 39312
Date: 10/3/06
Word Count: 8670
CT Learning Candidate: 39312
2
Abstract................................................................................................................................. 3
Introduction........................................................................................................................... 4
Trace learning.................................................................................................................... 5
Continuous Transformation (CT) learning ......................................................................... 7
Methods ................................................................................................................................ 9
The Model ......................................................................................................................... 9
Stimuli............................................................................................................................. 12
Training and test procedure.............................................................................................. 14
Experiment 1: Single-cell demonstration of CT learning...................................................... 14
Experiment 2: Comparison CT learning and Trace learning................................................. 16
Experiment 3: CT learning and trace learning with interleaved stimuli................................. 18
Experiment 4: CT learning with randomised views of objects.............................................. 20
Experiment 5: Generalisation in higher layers...................................................................... 21
Discussion ........................................................................................................................... 22
Conclusion .......................................................................................................................... 25
References........................................................................................................................... 26
CT Learning Candidate: 39312
3
Abstract
In this paper, we present a novel unsupervised learning mechanism, which allows networks to
recognise objects invariantly. Continuous Transformation (CT) learning relies on continual
synaptic modification (for example, by means of an associative Hebb rule) of the feedforward
inter-layer connection weights in a neural network during training on transforming (e.g.,
rotating, translating, etc.) visual stimuli. If there is large overlap between firing patterns of
successive views, with CT learning similar views of an object get continually associated with,
and mapped onto the same postsynaptic neurons. In this way it relies on the spatial continuity
of objects in the environment.
Here, we present five experiments that demonstrate the characteristics of the CT effect and
contrast its performance with that of another approach, trace learning, that relies on temporal
continuity of input stimuli.
In numerical simulations we find that the CT effect operates well when successive views of a
transforming stimulus are similar enough to drive the same post-synaptic neurons. This is
shown to occur largely independent of the specific order of presentation of the views.
Furthermore, if there is some invariance build into low-level feature detectors, the CT effect
drives the development of neurons in layer 4 that are tuned to specific objects independent of
rotation, even if the higher layers have only been trained on a few canonical views of the
object.
CT Learning Candidate: 39312
4
Introduction
One fundamental property of the primate visual system is its capability of identifying objects
invariantly of the particular viewing context. A chair can effortlessly be recognised as a chair
regardless of the specific viewing angle, its position relative to the retina, the lighting
conditions, and the background.
A general assumption adopted by many researchers is that any recognition mechanism has to
be based on a process of matching a view of an object to templates in memory (Poggio and
Edelman 1990; Tarr and Bülthoff 1995). However, objects in the natural environment occur
under conditions that are far from ideal for matching a template. If, for example, a stimulus is
shifted by only a small amount with respect to the template, a comparison using a dot product
will return a very low correlation between stimulus and template. Evidence on how the brain
solves this problem comes from single-cell physiology in monkeys. It has been shown that at
the level of inferior temporal cortex there are neurons that respond to specific objects
invariantly of the exact position on the retina, the size of the object, and the particular view
that is visible of the object (Desimone 1991; Tanaka et al. 1991; Rolls 1992; Rolls 2000).
A central issue in computational neuroscience is therefore how such invariant cells can
develop over successive processing stages in the ventral stream.
One approach to investigate the underlying mechanisms that might be involved in the
development of invariant neurons is to build biologically inspired artificial neural networks
and by implementing different learning algorithms and network architectures to study the
dynamics of self-organisation in such networks.
One specific type of network architecture that has been popular among modellers of the visual
system is based on the existence of feature hierarchies in the ventral stream (Fukushima
1980). Implementations of this approach typically involve routing visual information from the
input layer of a neural network in a feed-forward way through a hierarchy of layers (e.g.,
Poggio and Edelman 1990; Wallis and Rolls 1997). Competition between neurons within
layer helps to keep firing rates low, and convergence of connectivity between layers binds this
competition to neurons with spatially similar receptive fields and leads to a constant increase
of receptive field sizes from layer to layer.
Such a setup is supported by neurophysiological findings. Whereas in V1 individual cells
respond to an area of roughly 1.3° visual angle, in IT RFs tend to span the whole visual field
(!50° visual angle) (see Figure 3) (Boussaoud et al. 1991).
Within the framework of feature hierarchies, ‘simple’ feature detectors in V1 are tuned, for
example, to a bar of specific orientation (Hubel and Wiesel 1962). If several of these ‘simple’
cells project onto a cell in the next higher layer of the network, and this conjunction neuron’s
activity is calculated by means of a non-linear function, so that it only fires when several of
its afferent connections are active, one can build up a more and more complex representation
of objects as one ascends through the layers. Such a neuron could represent, for example, a
bar intersection. Several complex cells can then project to a neuron in an even higher layer,
and so forth. Due to the increasing receptive field sizes one convenient by-product of such a
CT Learning Candidate: 39312
5
feature hierarchy is that the higher up a cell is in the processing stream, the more invariant to
location its firing will be.
A major problem with feature hierarchies as a model of invariant object recognition is related
to capacity. In order to be able to represent objects in a way that mirrors the complexity of
their appearance in the environment, a high number of conjunction neurons have to be
present. The development of feature hierarchies in neural networks has therefore been
influenced by the question of how best to streamline the process of parsing natural scenes into
individual features. In this respect, nature is lending a helping hand in that visual input is not
chaotic but exhibits certain statistical regularities (Field 1994). Given that the evolution of
primate vision has taken place with these regularities remaining constant, it seems reasonable
to assume that the visual system will have adapted to these conditions.
Trace learning
In the context of feature hierarchies, one influential learning mechanism that draws on
statistical regularities in natural scenes was first proposed by Földiák (1991) and then
elaborated by Rolls (1992). Trace learning utilises the fact that objects under normal viewing
conditions show temporal continuity. It does so by assuming that an object that was
encountered at time step ! is likely to be the same object at time step ! + 1, regardless of any
transformations of the object that might have occurred between time steps. This is
implemented by modifying an associative learning rule to bias neurons to respond to patterns
of activity in the input layer that occur in temporal proximity (Equation 1). The trace that is
used to strengthen connections between coactive pre- and post-synaptic neurons is a function
of the activity at the current and previous time step as outlined in Equation 2. Here the trace
value " has to be optimised dependent on the length of the presentation sequence. Note, that if
one sets the trace value " to 0 the trace rule becomes a standard Hebb rule. Connection
weights between cells in adjacent layers are then updated according to the relative activity
patterns.
Equation 1
!
"w j =#y $x j
Equation 2
!
y "
= (1#$)y" +$y "#1
The variables are defined as:
xj: jth input to the neuron.
!
y " : Trace value of the postsynaptic neuron at time ".
wj: Synaptic weight between the jth presynaptic neuron and the neuron.
y: Activity of the postsynaptic neuron.
#: Learning rate [0, 1].
": Trace value [0, 1]. Optimal value dependent on presentation sequence length.
An illustration of trace learning is shown in Figure 1. In this example a bar stimulus moves
through the receptive fields (RF) of two input neurons. At time step 1 the bar enters the RF of
the upper input neuron. The weight of the connections between the pre- and post-synaptic
neurons is initially random but there is convergence between layers. By virtue of graded
competition between neurons in the post-synaptic layer, the pre-synaptic neuron will drive a
CT Learning Candidate: 39312
6
Figure 1. Illustration of trace learning. A moving bar is moved over the receptive fields of
two input cells at time steps t(1) and t(2). At t(1) the upper pre-synaptic neuron drives one
particular cell in the post-synaptic layer and its synaptic strength increases. At t(2) the bar has
moved into the receptive field of the lower input neuron, which is mapped onto the same post-
synaptic neuron that was driven by the upper input neuron at t(1) due to a trace of previous
activity (small circle next to post-synaptic neuron). Over many epochs this trace will lead to
cells getting associated that have been excited in temporal proximity.
small number of neurons in the post-synaptic layer, and hence the weight of the connection
between the co-active neurons increases. The small circle to the right of the post-synaptic cell
indicates the activation of the post-synaptic neuron at the previous time step, i.e. the trace.
At time step 2, the bar stimulates the pre-synaptic neuron on the bottom. Again due to random
weights and competition this cell in turn will activate a small number of neurons in the output
CT Learning Candidate: 39312
7
layer, and one of them might be the post-synaptic neuron from the previous time step. This
will lead to a strengthening of connections between the coactive cells but influenced by the
trace. Hence, pre-synaptic cell 2 will get associated with the same post-synaptic cell as pre-
synaptic cell 1, in addition to any other cells it might drive during that time step (In the figure
this is indicated by the strengthened connection to the lower post-synaptic cell). Over many
epochs the mapping of neurons will tend to settle down influenced by the trace
‘remembering’ previous activity patterns and hence learn different transforms of objects if
they are presented in temporal continuity.
Wallis and Rolls (1997) implemented this learning rule in a 4 layer competitive network with
convergence of connections between layers from topographically corresponding regions
(VisNet) and showed that such a network can indeed develop invariant representations of
objects and faces. Furthermore, it has been shown that by substituting the original trace rule in
Equation 1 with the asymmetric version shown in Equation 3 the performance of trace
learning can be significantly increased (Rolls and Milward 2000).
Equation 3
!
"w j =#y $%1
x j
$
While trace learning provides an attractive account for unsupervised learning of invariant
object recognition in hierarchical competitive neural networks, three major problems have
been identified:
(1) In order for trace learning to achieve invariant object recognition the training and test
stimuli have to be presented for a large number of training epochs.
(2) Trace learning is limited by the number of transforms it can learn. If too many transforms
are trained, trace learning breaks down.
(3) There is nothing about the features of an object itself that will make the trace learning
mechanism associate two transforms of an object together. For trace learning, all that counts
is that both transforms are presented in temporal proximity.
Continuous Transformation (CT) learning
In this paper we present a novel learning mechanism that relies on the spatial continuity found
in natural visual scenes. Objects typically transform in a continuous fashion: an object
shifting along a trajectory will start at the beginning and then travel through all intermediate
positions in a sequential order until reaching the target position. The same is true for other
types of transformation, like rotation or changes in size. By means of continuous updating of
the feed-forward connection weights between network layers, with CT learning, similar views
of an object get continually associated with, and mapped onto the same postsynaptic neurons
(hence, the name continuous transform learning). Over several layers of a feature hierarchy
this can lead to the development of neurons in the output layer that fire specifically in
response to a certain object, regardless of its state of transformation.
The main concept of CT learning is illustrated in Figure 2. Here we show a simplified 2-layer
network with an input (lower row of circles) and an output (upper row of circles) layer of 5
neurons each. If a moving bar is presented to the network the corresponding input units get
activated (solid circles) and by means of lateral inhibition in the output layer this activity will
converge onto a small number of neurons. At stimulus position 1 the three input neurons on
CT Learning Candidate: 39312
8
Figure 2. Illustration of the CT effect. A bar moves continuously across the retina and excites
corresponding input neurons. At stimulus position 1, input neurons 1-3 on the left fire and
drive cell 3 in the output layer. An associative learning rule strengthens the afferent
connections of neuron 3 for the 3 pre-synaptic neurons. If the bar moves to position 2, the
cells that have been active already at position 1 drive the same post-synaptic cell at position 1.
In addition the newly activated cell is also mapped onto the same output neuron.
the left get activated, and therefore their connections to the active post-synaptic cell will get
strengthened (thin lines).
The bar then continuously moves across the retina and at stimulus position 2 has shifted by
the equivalent of one neuron to the right. The key point is, that in this position, 2 out of the 3
neurons that were active at position 1 are still firing. These 2 cells had their efferent
connections to output cell 3 strengthened when the bar was in position 1 and will therefore
drive the same cell when the bar is in position 2. The weights get updated again, and the
CT Learning Candidate: 39312
9
connections between the active post-synaptic cells and the pre-synaptic cells get strengthened
(the thickness of the lines indicates the relative weight of connection between cells). It is
obvious how a repetition of the above process will lead to more and more cells being mapped
onto one output cell, which therefore will exhibit invariant properties even without a trace
term.
In general, the CT effect relies on the continuous transformation of the training stimuli,
expressed by the similarity of input vectors between transforms. In order to develop
invariance, a full set of overlapping transformations will have to be shown for each object.
Different associative learning rules could be used to achieve the CT effect, the simplest
example being a standard Hebb rule (Equation 4),
Equation 4
!
"wij =#yiy j
where $wij denotes the change in synaptic weights after activity patterns have been calculated,
and y is the firing rate of presynaptic neuron j and post-synaptic neuron i. The learning rate is
the constant #, and the weight vector w for each neuron is normalised at the end of every
training step, which is a condition of competitive learning (Hertz et al. 1991).
In this paper we present five experiments that demonstrate the characteristics of the CT effect
using a simple version of the Hebb learning rule to train a neural network to distinguish two
3D stimuli from different views. Note, that this rotation invariance is an isomorphic problem
to translation and size invariance.
All simulations were run in VisNet2 (Rolls and Milward 2000), a four-layer feed forward
hierarchical network with competition within layers that models the ventral stream of the
primate visual system (see Methods for details). This is essentially a scaled up version of the
model used to illustrate the CT effect in Figure 2.
Methods
The Model
All experiments presented here were conducted using the newest version of VisNet
(VisNet2), which is a model of the primate ventral stream consisting of four competitive
network layers that are set up hierarchically with initially random feed-forward connections
between the layers. However, CT learning is by no means constrained to this particular model
and could be implemented in many different architectures that exhibit the fundamental
properties of convergence and competition.
Each layer corresponds to one particular area of the primate visual system in terms of its
receptive field sizes (Figure 3), as identified by a number of different researchers (e.g., Rolls
1992; e.g., Van Essen et al. 1992).
Layer 1 corresponds to V2, layer 2 to V4, layer 3 to posterior inferior temporal cortex, and
layer 4 to anterior inferior temporal cortex.
Connection probabilities are taken from a Gaussian distribution and set so that each cell in
layer 2-4 will receive 67% of its feed-forward connections from a topographically
corresponding region in the preceding layer. Note, that not every neuron in one layer is
connected to every neuron in the layer above. The exact connectivity patterns were
CT Learning Candidate: 39312
10
Figure 3. VisNet architecture and receptive field sizes in the primate visual system. A four-
layer hierarchical network with connectivity converging onto topographically corresponding
regions in higher layers. Receptive field sizes are set to mirror those found in the macaque
monkey (adapted from Rolls and Deco 2002).
determined randomly, but then kept constant for all simulations in order to provide means of
comparing network performance with different learning rules. Network dimensions are
provided in Table 1.
Dimensions No of Connections Radius
Layer 4 32 x 32 100 12
Layer 3 32 x 32 100 6
Layer 2 32 x 32 100 6
Layer 1 32 x 32 272 6
Retina 128 x 128 x 32 - -
Table 1. Network dimensions.
Before presentation, images of the stimuli were run through filters that mirror the tuning
profiles of cells found in V1 (e.g., Hubel and Wiesel 1962). The input filters were computed
CT Learning Candidate: 39312
11
by weighting the difference of two Gaussians by a third orthogonal Gaussian according to
Equation 5:
Equation 5
!
"xy (#,$, f ) = # e%(x cos$ +y sin$
2 / f)2
%1
1.6e%(x cos$ +y sin$
1.6 2 / f)2&
' ( (
)
* + + e%(x sin$ %y cos$
3 2 / f)2
Here, f is the spatial frequency of the filter, % is the filter orientation, and & is the sign of the
filter. Filters exist for spatial frequencies of 0.0625 to 0.5 cycles/pixel, orientations of 0° to
135° in 45° steps, and sign ± 1. The number of filters for each spatial frequency is given in
Table 2.
Frequency (cycles/pixel) 0.5 0.25 0.125 0.0625
No of Connections 201 50 13 8
Table 2. Layer 1 connectivity.
The activation hi of each neuron i was calculated by summing the inputs yj from all afferent
neurons j weighted by the synaptic weights wij:
Equation 6
!
hi = wij y j
j
"
Within each layer there was graded competition between neurons, which was implemented in
two stages. Note, that this allows more than one post-synaptic neuron to fire because this
mechanism is not winner-take-all.
To achieve lateral inhibition the activation of all neurons within a layer was convolved with a
spatial filter I. Here, ' controls the contrast, ( specifies the width, and a and b index the
distance from the centre of the filter (see Table 3 for typical values).
Radius, ! Contrast, "
Layer 1 1.38 1.5
Layer 2 2.7 1.5
Layer 3 4.0 1.6
Layer 4 1.6 1.4
Table 3. Lateral inhibition parameters.
CT Learning Candidate: 39312
12
Equation 7
!
Ia, b
="#e
"a2+ b
2
$ 2
1" Ia, ba%0
b%0&
'
( )
* )
Secondly, a sigmoid activation function (Equation 8) was used to enhance contrast within
layers, where r is the firing rate after lateral inhibition (see above), y is the firing rate after
contrast enhancement, # is the sigmoid threshold, and ) is the slope.
Equation 8
!
y = fsigmoid
(r) =1
1+ e"2# (r"$ )
The parameter # was adjusted to control the sparseness of the firing rates in each layer. For
example, to set the sparseness to 5% one would have to set # to the value of the 95th
percentile point of the activations within a layer. Table 4 shows parameters for the sigmoid
activation function.
A comprehensive review of the VisNet architecture can be found in Rolls and Deco (2002).
Percentile Slope, #
Layer 1 99.2 190
Layer 2 98 40
Layer 3 88 75
Layer 4 91 26
Table 4. Sigmoid function parameters.
Stimuli
The stimuli used to train the networks were images of continuously rotating 3D objects. They
were created using the OpenGL API, which gives a maximum of control over all stimulus
parameters. In this way it was possible to fine-tune the amount by which each stimulus was
rotated between views of the objects. OpenGL builds a 3D representation of the objects and is
able to transform these with respect to the viewer. The transformed object is then projected
onto a 2-dimensional image of the current view. Lighting was mainly ambient with a diffuse
light source added for image realism.
The two objects used in the current experiments were a cube and tetrahedron matched for
size, which were rotated in front of a blank background around their vertical midline axis over
a range of 180° (after which transforms repeat itself, due to the inherent symmetry of the
objects). An example of a rotation with a step size of 18° is given in Figure 4.
if a # 0 or b # 0,
if a = 0 and b = 0.
CT Learning Candidate: 39312
13
Figure 4. Example of 3D stimuli rotated over 180° with a step size of 18°. Stimuli were
created using OpenGL computer graphics and could be set to turn around the vertical axis
with any angle of rotation between views.
CT Learning Candidate: 39312
14
Training and test procedure
At the start of each simulation the synaptic weights of the connections between layers were
randomised. Different simulations could therefore differ in the initial weight.
Networks were trained layer by layer, using the following procedure: (1) the stimulus picture
was filtered by means of Equation 5, then (2) the activity of individual neurons was
calculated. Lastly, (3) the synaptic weights were changed by means of the particular learning
rule used.
One training epoch consisted of presentation of all transforms of both objects and training
was repeated for 50 epochs in layer 1, 100 epochs in layers 2 and 3, and 75 epochs in layer 4.
The synaptic weight update was calculated using either the enhanced asymmetric trace rule
(Equation 3 and Equation 2) for trace learning, or the simple associative Hebbian learning
rule given in Equation 4 for demonstrations of the CT effect. The learning rate # was set to
25.0, 6.7, 5.0, and 4.0 for layers 1-4 respectively. For the simulations using trace learning the
trace value " was set to 0.8. These parameters were shown to be effective in previous VisNet
simulations (Rolls and Milward 2000; Stringer and Rolls 2000; Stringer and Rolls 2002).
After training, the network’s performance was tested by once more presenting all
transformations of both objects and monitoring the response patterns in the output layer.
Three different measures of performance were available. On an individual cell level it is
possible to calculate the invariance developed by each neuron by looking at response profiles
for all views of the full 180° rotation. In this way it is possible to quantify the amount of
information coded by each cell, the maximum being 1 bit as calculated with Equation 9.
Equation 9 Maximum cell information = log2(# of stimuli)
A neuron will therefore carry maximum information if it invariantly fires for presentations of
one object and never for the other object, regardless of the view shown.
In order to investigate to what extent a network has developed invariant cell response
properties in the output layer one can then order all output neurons by their informational
value in a single cell information plot.
If a number of cells exhibit maximum invariance this can be interpreted as evidence that the
network has mastered invariant object recognition. However, in order to make sure that the
network has developed invariance for both objects (cube and tetrahedron) it is necessary to
analyse the information contained across a population of output cells. The multiple cell
information plot shows how much information is available over the 5 neurons carrying the
most information about each object, and ideally should asymptote to 1, if the network is to
have developed invariance. The algorithm that was used for calculating information in a
population code is outlined in detail in Rolls, Treves, and Tovee (1997).
Experiment 1: Single-cell demonstration of CT learning
The aim of the first experiment was to provide a first demonstration of the CT effect in the
VisNet architecture. The network was trained on images showing the cube and tetrahedron
rotating through 180° with step sizes of 1° between individual views. For the CT effect to
occur it is crucial that there is high similarity of input vectors of subsequent transforms, that
CT Learning Candidate: 39312
15
Figure 5. Demonstration of the CT effect in typical cells of the output layer. (A) shows a
typical cell in layer 4 responding to the cube invariantly of the view shown. While there were
no cells found that responded to the tetrahedron in a completely invariant fashion (B), many
cells tended to respond invariantly to the same intervals of rotation (!65°-140°, and !0°-65°
and 160°-180°). These intervals were found to correspond to a novel surface of the
tetrahedron coming into view (C) at the same the same point in time as tuning switches
between neurons.
CT Learning Candidate: 39312
16
is, the angle by which an object is rotated between views has to be sufficiently small.
Synaptic weights were updated using a standard Hebb rule (Equation 4).
In each training epoch, all views of an object were shown in sequence, followed by all
transforms of the other object.
Single cell response profiles of typical neurons in layer 4 after training are shown in Figure 5.
The neuron with coordinates (10,10) has developed perfect rotation invariance for the cube
(Figure 5A). It has learned to respond whenever the cube is shown and never responds to the
tetrahedron. Similarly, neurons (22,13) and (14,17) fire specifically in presence of the
tetrahedron (Figure 5B), but for different intervals of rotation. Neuron (22,13) covers views
rotated by 65° to 140°. In turn, cell (14,17) has learned to respond to the complimentary
intervals from 0° to 65° and 160° to 180°.
Even though we were unable to find a single cell with maximum information for the
tetrahedron, many cells with similar response profiles like the ones shown here were present,
which might indicate that cells get tuned to distinct subregions of the tetrahedron. This is
illustrated in (Figure 5C) where a view of the tetrahedron rotated by 65° is shown,
corresponding to the angle at which neuron (14,17) ceases to fire and neurons (22,13) begins
firing. Incidentally, at this point of the rotation the surface highlighted by the arrow has just
come into view (see also Figure 4 for images of the objects at different rotations), that is, a
catastrophic change in the image occurs. This surface remains in view until just after the
viewing angle reaches 115°.
Experiment 2: Comparison CT learning and Trace learning
In Experiment 1, it was demonstrated that the CT effect is present at the single cell level when
objects were rotated during training with 1° difference between successive views. In order to
further investigate the effect on a network level, in experiment 2, CT learning (using a Hebb
rule, Equation 4) was contrasted with trace learning (Equation 3) by looking at the single cell
and multiple cell information plots comparing different step sizes between successive
transforms.
We discussed earlier how the CT effect relies on a significant overlap between activity
patterns of neighbouring views in a continuous rotation. The exact amount of overlap
necessary will depend on the specific simulation conditions (i.e. model architecture, learning
rate, stimuli). If views are presented in order, over adjacent time steps the same neuron in
layer 2 will be driven by input neurons that are active for both time steps. As the difference
between successive transforms increases CT learning will cease to work as input neurons will
start to excite new 2nd layer neurons rather than keeping the same 2nd layer neurons active and
mapping invariant representations onto higher layer cells.
Trace learning, on the other hand, does not require spatial continuity between transforms and
will therefore work fine with large step sizes between views. However, performance is
predicted to deteriorate with trace learning, if the network is overloaded with too many
transforms (Wallis and Rolls 1997).
The performance of networks after training was compared (Figure 6), one using a standard
Hebb rule, one using the trace rule, and one control network that was not trained at all. The
CT Learning Candidate: 39312
17
Figure 6. Comparison of CT learning with trace learning at different angles of rotation
between transforms. Results from single cell information measures (left column) multiple cell
information measures (right column) are presented for CT learning with a standard Hebb rule
(solid line), trace learning (dashed line), and an untrained control network with random
weights (dotted line).
CT Learning Candidate: 39312
18
same stimulus images of cube and tetrahedron rotating through 180° were used as in
Experiment 1, but at step sizes of 1°, 2°, 9°, and 36° between views.
The finding that the CT effect is present for the cube and tetrahedron, if the step size between
views is 1°, is replicated in this experiment: the single cell plot on the left reveals good
performance with the Hebb learning rule at 1°. A fair number of cells are carrying maximum
information of 1 bit. Interestingly, trace learning achieves good performance at this step size
as well.
An explanation for this is that the CT effect will drive trace learning when there is enough
overlap in input activity between successive transforms, even under conditions in which trace
learning would usually be compromised. The mechanism, in which the CT effect takes over
trace learning, is based on post-synaptic neurons being kept active over successive transforms
and hence strengthening connection weights of input cells onto one invariant 2nd layer neuron.
Note, that this is completely independent of the trace term
!
y " , as long as views are presented
separately for both objects. The multiple cell plots on the right mirror the single cell results.
When the step size between transforms is increased to 2°, one is confronted with a drastically
different picture. The CT effect has broken down, and so has trace learning. Even a small
increment in step size has reduced the overlap of firing to the extent that there are no
completely invariant neurons developing as a result of the CT effect. Trace learning without
help of the CT effect breaks down under the load of 90 transforms per object. This confirms
the explanation for trace learning showing good performance at 1° but being reliant on the CT
effect. The multiple cell plots reveal a better population performance for trace then for Hebb,
but without the development of invariant cells one cannot claim that the network has mastered
the task of rotation invariant object recognition.
As the step sizes between transforms increase from 9° to 36° the Hebb rule’s performance
remains poor. The CT effect does not work with large differences between successive
transforms. The trace rule, on the other hand, becomes more successful as the total number of
transforms for each object decreases down to 5 for 36° step sizes. These results again are
substituted by the performance shown in the multiple cell plots on the right.
The overall picture that emerges from these graphs is that the CT effect can drive
performance at small step sizes for Hebb rule as well as trace rule learning. With larger
rotations between views the CT effect breaks down as less and less neurons are commonly
activated between successive transforms and hence will not map onto only one post-synaptic
neuron by keeping it active over time. The trace learning rule, on the other hand, improves
performance with large step sizes and few transforms per object due to genuine trace learning
dynamics.
Experiment 3: CT learning and trace learning with interleaved
stimuli
Trace learning relies on temporal continuity between transforms of the same object in the
visual environment. However, one could imagine a realistic scenario in which alternating
CT Learning Candidate: 39312
19
Figure 7. Comparison between CT learning and trace learning with interleaved stimuli.
The network was trained on interleaved views of the cube and tetrahedron over a rotation of
180° (e.g., cube1, tetrahedron1, cube2, tetrahedron2, …). Between views objects were rotated
by 1°.
views of two stimuli are projected onto the retina. When scanning a visual scene saccadic eye-
movements do not necessarily scan objects individually but might jump from prominent
features in one object to prominent features of a second object. Under such conditions trace
learning will not achieve invariant object recognition as a consequence of the features of the
two objects getting associated together.
CT learning, on the other hand, relies on spatial regularities in the visual environment and will
be able to dissociate two objects by virtue of differences in local features. The network can
therefore develop invariant cells to two different stimuli with a Hebbian associative learning
rule even when these stimuli are presented in an interleaved fashion. Trace learning fails
under these conditions.
In Experiment 3 the cube and the tetrahedron were again rotated over 180° with 1° rotations
between individual views, but in contrast to the previous experiments the objects were not
shown individually but interleaved (the first view of the cube was followed by the first view
of the tetrahedron, followed by the second view of the cube, again followed by the second
view of the tetrahedron, and so on).
Numerical results are shown in Figure 7. CT learning by means of the Hebb rule has enabled
a number of 4th layer neurons to reach a maximal performance of 1 bit, which is expressed in
good single and multiple cell information measures. The difference in features between the
two stimuli has been sufficient for the CT effect to dissociate the two objects on a view-by-
view basis, even if the views are alternating between the two objects. Furthermore, the
activity pattern of pre-synaptic neurons that are driven by the presentation of successive
transforms of either object are similar enough to reactivate the same post-synaptic neurons on
successive views of one object, even if temporal continuity has been disturbed by the
CT Learning Candidate: 39312
20
presentation of a view of the other object. In contrast, trace learning does not work without
temporal continuity – cube and tetrahedron get associated due to the temporal proximity of
views of alternating objects in this presentation paradigm. The performance of the trace
learning rule is not significantly different from the random control condition in the single cell
information plots. In addition, the multiple cell information measure does not reach a value of
1 bit.
Experiment 4: CT learning with randomised views of objects
Having shown that CT learning is superior to trace learning when temporal continuity breaks
down (Experiment 3), Experiments 4 and 5 were meant to focus on exploring the limits of CT
learning and how some of its weaknesses might be overcome.
An agent in real world conditions cannot expect to be shown a full continuous rotation of an
object at a particular point in time. One might encounter an object rotating through a specific
angle on one occasion and only later encounter the same object rotating through a different
angle.
So far, the CT effect was observed to occur when objects were presented in a continuous
motion over 180°. In Experiment 3 it was shown that the effect does not necessarily rely on
views of the two objects being presented in a temporally continuous fashion. In continuation
of these earlier results, in Experiment 4, we show that in principle CT learning does not even
necessarily rely on transforms being presented in order as long as the motion is spatially
continuous (i.e., the views shown cover the whole rotation). Views of the rotating cube and
tetrahedron were split up into blocks of 30° and the network was then trained on these blocks
presented in a random order.
Results are shown in Figure 8. In the present simulations the learning rate was tuned from its
original value to 0.1 in all layers, in order to improve performance. As can be seen in the
single cell and multiple cell information plots, with these parameters, CT learning can
successfully be demonstrated using a standard Hebb rule. To explain this, it is important to
note the number of training epochs (50, 100, 100, and 75 for layers 1-4 respectively). Since
blocks are chosen at random in this presentation paradigm, the high number of epochs will
invariantly lead to the presentation of adjacent blocks in order. That is, while on most
presentation trials blocks will be presented that are not continuous (e.g., the blocks containing
the rotation intervals of 0°-30° and 60°-90°) a single presentation of two blocks in order (0°-
30° and 30°-60°) will lead to a remapping of previously learned representations of two
spatially adjacent blocks onto the same neurons in the 2nd layer. Therefore, as long as all
blocks are shown in order at least once, invariance can develop by means of CT learning.
These results demonstrate that, at least in principle, CT learning can still occur with a
randomised block ordering of stimulus views during training.
CT Learning Candidate: 39312
21
Figure 8. Demonstration of the CT effect with training the network on randomised
views of the stimuli. In this experiment the full set of transformations was divided into 6
blocks of 30° for each object. Then the network was trained on these blocks presented in
random order. The transformations within each block were shown in the usual continuous
fashion.
Experiment 5: Generalisation in higher layers
In the previous experiments, the CT effect has been demonstrated when the full view space of
180° was sampled by many closely spaced views during training. In contrast, the primate
visual system can learn to recognise objects invariantly from only a few canonical views. A
technique that has been used in earlier demonstrations of invariant object recognition using
the trace rule has been to train the network on a full set of stimuli in layers 1 and 2 (Stringer
and Rolls 2002). This was done in order to build a limited amount of invariance to low-level
stimulus features into the early layers of the network. It has been shown that the higher layers
can then generalise to respond invariantly to novel stimulus transforms after training on only
a small set of canonical views.
In Experiment 5, we investigated if the same mechanism can be applied for CT learning using
a standard Hebb rule. The first 2 layers were trained on a full set of 180 transforms with 1°
rotations between transforms. Next, in a separate training session, layers 3 and 4 were only
trained on 5 views with 36° rotation between transforms. The network was then tested on a
full set of stimuli of 180 views spaced by 1°.
Numerical results are shown in Figure 9. Even though layers 3 and 4 have not been exposed
to 97% of the views they were tested on the network achieves good invariance in both single
cell and multiple cell information plots as compared to the untrained network condition.
This result can be attributed to the fact that training in the first 2 layers on a full set of stimuli
has enabled the network to create neurons in layer 2 that generalise within part of the rotation
space of an object, leaving layer 3 and 4 neurons to associate together the different subregions
of the rotation space.
CT Learning Candidate: 39312
22
Figure 9. Demonstration of generalization in the higher layers after building invariance
in the first two layers.
Neurons in the lower layers, which develop invariant responses to simple features during
early visual experience, can help higher layers of the network to generalise to novel stimulus
views. In this simulation the first two layers of the network were trained on a full set of 180
transformations for each object. The higher layers could then be trained on sparser views of 5
transformations with 36° difference in rotation and still develop invariant neurons.
In this way it is possible that early visual experience that has build limited feature invariance
into the early layers helps CT learning to free itself from the constraint of having to be trained
on full sets of closely spaced transforms.
Discussion
In this paper we presented a novel unsupervised learning mechanism, which allows networks
to develop transformation invariance. The key condition for the CT effect to work is that there
has to be significant overlap between firing patterns of successive views of an object. These
views do not need to be shown in order (Experiment 4) and can be interleaved with views
from other objects, provided the input activity vectors of views of the other object are not too
similar to any of the views of the original object (Experiment 3). If conditions allow,
subsequent views of an object are then mapped onto the same higher layer neuron and over
successive processing stages neurons in layer 4 can develop, that respond specifically to one
object, regardless of the view shown.
In Experiment 5, it was shown that by building feature invariance into the early layers of the
network, the higher layers can develop invariance to all views of an object, even if they are
only trained on a few canonical views of the objects. This early feature invariance is enough
to consistently drive the same invariant cells in the upper layers even if these have not been
trained on all stimulus views. This last experiment shows that once feature invariance has
been built up, invariant object recognition with CT learning does not rely on the continuous
CT Learning Candidate: 39312
23
transformation of objects under which the CT effect had first been demonstrated (in
Experiments 1 and 2).
CT learning exhibits a number of advantages over trace learning.
(1) CT learning does not require the existence of a memory trace but can be demonstrated
using a standard Hebb rule (Equation 4). At the same time the CT effect does not rely on the
standard Hebb rule but could also be implemented with other associative learning rules (for
example a trace rule, as shown in Experiment 2). This property makes CT learning very
flexible with regards to novel developments in the neuroscience of vision. These points add to
its biological plausibility.
(2) CT learning performance does not deteriorate with an increasing number of transforms. In
fact, it was shown in some simulations (not presented in this paper) that it is possible to
further increase performance with even smaller step sizes than 1° and even more transforms
shown during training. The trace rule, on the other hand, has been found to be limited in terms
of its capacity to train networks on large numbers of transforms (Wallis and Rolls 1997).
(3) At least in theory, it can be seen how a network applying the CT effect could develop
invariance from just one training epoch – a condition that is of crucial importance if one
wants to generalise findings from artificial neural networks to biological systems.
(4) CT learning takes into account the actual spatial setup of features of objects in order to
recognise them. In contrast, trace learning will learn to associate any two views to belong to
the same object as long as they occur in temporal proximity.
In addition to demonstrating the CT effect and some of its properties, we contrasted
performance of Hebb trained networks with those of networks trained with the trace rule. An
interesting finding was that with small step sizes between views the CT effect was driving
performance of trace learning under conditions that would typically not allow trace learning
to successfully develop invariance (Experiment 2). When step sizes between views increase
the CT effect gets compromised, as there is not enough overlap between successive views.
Trace learning on the other hand, will show good performance when the number of
transforms shown decreases.
Models of object recognition will at some point have to live up to a comparison with the
biological visual system. Evidence from Experiment 1 and 2 suggests that CT learning and
trace learning could effectively be combined in the training of networks to develop
transformation invariance, especially since the CT effect does not need to be implemented
into the trace learning rule explicitly. In Experiment 1, the failure to find neurons in layer 4
that have developed complete invariance to the tetrahedron suggests that CT learning operates
well across continuous, or quantitative, changes in the view properties, but not across large
discontinuities in the form of catastrophic changes that occur occasionally in rotating objects
(see Koenderink 1990). This is by definition the domain of the trace learning rule, because it
does not require any similarity to associate views of an object together, provided temporal
continuity is preserved. It can be expected that discontinuity in response profiles is a
fundamental property of CT learning with objects more complex then the highly self-similar
cube.
One caveat of proposing a CT aided trace learning mechanism to be responsible for the
development of invariance in the primate visual system is that the trace rule fails to solve the
CT Learning Candidate: 39312
24
problem of interleaved stimuli. Especially in the context of human observers actively
scanning visual scenes by virtue of saccadic eye-movements this becomes a problem.
Currently, a general theory of the underlying mechanisms that govern the selection of saccade
targets in natural scenes remains illusive. However, most likely is an account, in which high
and low level salient features are implemented into a saliency map (Itti and Koch 2001;
Henderson 2003). Low-level salient features are in turn not necessarily occurring in only one
object and therefore eye movements will tend to fall on alternating objects, especially in the
early parts of scanning a novel scene. This does not pose a problem for the CT effect, but
trace learning will be compromised by such a breakdown in temporal continuity between
views of objects.
Experiment 5 opened up another point of interest with regards to natural viewing conditions.
In order for CT learning to work under natural viewing conditions, feature detectors of limited
invariance have to develop in the first 2 layers of the network in order to allow the last two
layers to be trained on only a few canonical views. In the present experiments this was
achieved by presenting a full set of rotation views of the two objects to the first two layers.
But how could such invariance be set up in a biological system? A first step to solving this
problem would be to demonstrate that training the network on complex scenes including
many different shapes and objects can set up feature invariance. CT learning can provide a
valid account of invariant object recognition in the real brain if the statistics of natural scenes
allow the first layers to set up a similar low level invariance as was done in Experiment 5. The
specific dynamics involved in the process of setting up these feature detectors in response to
training with complex scenes are hard to predict and have therefore got to be tested in
quantitative simulations. One tentative possibility of how this could work stems from the
finding that infants have problems with pursuit eye-movements (von Hofsten and Rosander
1997) – when the eyes are following a moving target they often tend to ‘slip’ off the target in
young infants. Incidentally, this might aid the process of setting up these early invariant
feature detectors because this provides a natural source of continuous object motion with
respect to the retina.
In the experiments presented above we have only used two simple objects to train the
network. This was done in order to provide an initial account of CT learning that is as simple
and straight forward as possible. However, a low estimate of the recognition capacity of the
primate visual system is in the order of magnitude of 100,000 different objects (Biederman
1987). Future research will therefore have to face up to the problem of, on the one hand,
increasing the complexity of training stimuli and, on the other hand, increasing the total
number of objects the network has to learn invariantly. One approach to solve this problem
could be to investigate how scaling up the number of nodes in each layer of the network
would impact on its capacity. In the present experiments, VisNet was run with 1024 cells in
each layer which is several orders of magnitude smaller then the number of neurons one finds
in the primate visual system (Rolls and Deco 2002).
Furthermore, if one combines a larger network with more complex objects (objects with a
more complex basic structure or texture) this might increase the capacity significantly because
the network would have to cope with more complex input vectors but would also be given
more information on which to base the distinction of objects.
CT Learning Candidate: 39312
25
Conclusion
Given that early visual areas are set up in topological maps, and that associative learning (i.e.,
long-term potentiation) is a common mechanism in the visual system (Artola and Singer
1993; Singer 1995; Fregnac 1996) a process with the characteristics of the CT effect is likely
to play an important role in the self-organisation of the ventral stream to develop invariant
representations of objects. This happens under the condition that input stimuli transform in
small steps.
Under these architectural constraints the CT effect is not merely an additional mechanism but
an intrinsic and ubiquitous property of the visual system.
CT Learning Candidate: 39312
26
References
Artola A, Singer W (1993) Long-term depression of excitatory synaptic transmission and its
relationship to long-term potentiation. Trends Neurosci 16: 480-487
Biederman I (1987) Recognition by components: A theory of human image understanding.
Psychological Review 94: 115-147
Boussaoud D, Desimone R, Ungerleider LG (1991) Visual topography of area TEO in the
macaque. J Comp Neurol 306: 554-575
Desimone R (1991) Face-selective cells in the temporal cortex of monkeys. Journal of
Cognitive Neuroscience 3: 1-8
Field DJ (1994) What is the goal of sensory coding? Neural Computation 6: 559-601
Fregnac Y (1996) Dynamics of functional connectivity in visual cortical networks: an
overview. J Physiol Paris 90: 113-139
Fukushima K (1980) Neocognitron: a self organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position. Biol Cybern 36: 193-202
Földiák P (1991) Learning invariance from transformation sequences. Neural Computation 3:
194-200
Henderson JM (2003) Human gaze control during real-world scene perception. Trends in
Cognitive Sciences 7: 498-504
Hertz J, Krogh A, Palmer RG (1991) Introduction to the Theory of Neural Computation.
Addison-Wesley, Redwood City, CA
Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional
architecture in the cat's visual cortex. J Physiol 160: 106-154
Itti L, Koch C (2001) Computational modelling of visual attention. Nature Reviews
Neuroscience 2: 194-203
Koenderink JJ (1990) Solid Shape. MIT Press, Cambridge, MA
Poggio T, Edelman S (1990) A network that learns to recognize three-dimensional objects.
Nature 343: 263-266
Rolls ET (1992) Neurophysiological mechanisms underlying face processing within and
beyond the temporal cortical visual areas. Philos Trans R Soc Lond B Biol Sci 335:
11-20; discussion 20-11
Rolls ET (2000) Functions of the primate temporal lobe cortical visual areas in invariant
visual object and face recognition. Neuron 27: 205-218
Rolls ET, Deco G (2002) Computational neuroscience of vision. OUP, New York
CT Learning Candidate: 39312
27
Rolls ET, Milward T (2000) A model of invariant object recognition in the visual system:
learning rules, activation functions, lateral inhibition and information-based
performance measures. Neuron 12: 2547-2572
Rolls ET, Treves A, Tovee MJ (1997) The representational capacity of the distributed
encoding of information provided by populations of neurons in primate temporal
visual cortex. Exp Brain Res 114: 149-162
Singer W (1995) Development and plasticity of cortical processing architectures. Science
270: 758-764
Stringer SM, Rolls ET (2000) Position invariant recognition in the visual system with
cluttered environments. Neural Networks 13: 305-315
Stringer SM, Rolls ET (2002) Invariant object reccognition in the visual system with novel
views of 3D objects. Neural Computation 14: 2585-2596
Tanaka K, Saito H, Fukada Y, Moriya M (1991) Coding visual images of objects in the
inferotemporal cortex of the macaque monkey. J Neurophysiol 66: 170-189
Tarr MJ, Bülthoff HH (1995) Is human object recognition better described by geon structural
descriptions or by multiple views? Comment on Biederman and Gerhardstein (1993).
J Exp Psychol Hum Percept Perform 21: 1494-1505
Van Essen DC, Anderson CH, Felleman DJ (1992) Information processing in the primate
visual system: an integrated systems perspective. Science 255: 419-423
von Hofsten C, Rosander K (1997) Development of smooth pursuit tracking in young infants.
Vision Res 37: 1799-1810
Wallis G, Rolls ET (1997) Invariant face and object recognition in the visual system. Prog
Neurobiol 51: 167-194