letter perception emerges from unsupervised deep …10.1038/s41562-017-0186... · letter perception...
TRANSCRIPT
lettersDOI: 10.1038/s41562-017-0186-2
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Letter perception emerges from unsupervised deep learning and recycling of natural image featuresAlberto Testolin 1, Ivilin Stoianov 2,3 and Marco Zorzi 1,4*
1 Department of General Psychology and Padova Neuroscience Center, University of Padova, via Venezia 8, Padova 35131, Italy. 2 Laboratoire de Psychologie Cognitive - UMR7290, Centre National de la Recherche Scientifique, Aix-Marseille Université, 3, place Victor Hugo, Marseille 13331 CEDEX 3, France. 3 Institute of Cognitive Sciences and Technologies (ISTC), National Research Council (CNR), Via Martiri della Libertà 2, Padova 35137, Italy. 4 IRCCS San Camillo Hospital Foundation, via Alberoni 70, Venice-Lido 30126, Italy. *e-mail: [email protected]
Corrected: Publisher correction
SUPPLEMENTARY INFORMATION
In the format provided by the authors and unedited.
NATure HuMAN BeHAvIour | VOL 1 | SEPTEMBER 2017 | 657–664 | www.nature.com/nathumbehav 657
Letter perception emerges from unsupervised deep learning
and recycling of natural image features
Alberto Testolin 1, Ivilin Stoianov 2,3, Marco Zorzi 1,4 *
1 Department of General Psychology and Padova Neuroscience Center,
University of Padova, Italy
2 Centre National de la Recherche Scientifique, Aix-Marseille Université, France
3 Institute of Cognitive Sciences and Technologies, CNR Padova, Italy
4 IRCCS San Camillo Neurorehabilitation Hospital, Venice-Lido, Italy
*Correspondence concerning this article should be addressed to Marco Zorzi, Department of General
Psychology, University of Padova, Via Venezia 12, Padova 35131, Italy. E-mail: [email protected]
Supplementary Figures
Supplementary Figure 1: The complete set of receptive fields developed in the first hidden layer.
Supplementary Figure 2: The complete set of receptive fields developed in the second hidden layer.
Supplementary Figure 3: Progressive refinement of read-out accuracy following unsupervised
learning on the reduced dataset. Accuracy was computed both on the Arial and Times test patterns
(left panel) and on the full set of test patterns (right panel), which included all fonts.
Supplementary Figure 4: Letter similarity matrix obtained on the model internal representations
(left panel) and by averaging the human similarity judgments of three published studies (right panel).
Ordering of the letters is optimized by hierarchical clustering. Lighter colors indicate higher
similarity: clusters of similar letters are highlighted by the yellow-colored groups along the main
diagonal.
Supplementary Figure 5. (a) The set of Arial letters, followed by noisy versions created with
increasing levels of Gaussian noise (std.dev. = 0.1, 0.4, 0.7, 1.1). (b) The set of Arial letters after
whitening. (c) Pseudoletters produced by rotating uppercase letters using the same procedure
adopted in the study of Chang and colleagues1. Their original set of stimuli is reported in panel (d).
Supplementary Tables
Supplementary Table 1: Pearson correlation coefficients between empirical confusion matrices and
the confusion matrix derived from model’s errors when read-out is applied to layer H1.
Empirical study Model correlation
Townsend-1 (1971) .62
Townsend-2 (1971) .46
Gilmore et al. (1979) .26
Loomis (1982) .29
Phillips at al. (1983) .40
Van Der et al. (1984) .58
Average correlation .45
Supplementary Methods
Natural images and printed letters datasets. We used a published, freely available natural image
dataset containing a large number of gray-scale pictures of three subjects: the Yosemite park, the
Liberty state and the Notre Dame cathedral2. Though it might seem counterintuitive to also consider
human-made artifacts as natural scenes, it has been shown that the types of spatial structures
present in “wild” environments give rise to statistical visual features similar to those learned from
more anthropomorphized environments3. Datasets that include human artifacts might better reflect
the everyday visual experience of people living in developed countries. Gray-scale, 40x40 pixel
bitmaps of the 26 Latin uppercase letters were created using the getframe MATLAB function. Pixel
values ranged between 0.2 (plain background) and 0.8 (maximum signal strength). This allowed to
manipulate the original stimuli by adding Gaussian noise during the simulations, as explained below.
The small variability in size and location of each letter was added to make learning more robust: our
model is primarily concerned about shape invariance and geometric similarity among input patterns,
while scale and position invariance could be obtained by including other processing mechanisms
such as convolution and max-pooling operations4.
Whitening algorithm. Following previous research5, pre-processing occurring in the retina and LGN
was implemented as a 1/f whitening algorithm that used a filter in the frequency domain designed
to flatten the spectrum of natural images. Since the power spectrum of natural images tends to fall
as 1/f2, the amplitude spectrum falls as 1/f. Thus, the amplitude spectrum of the whitening filter rose
linearly with frequency, to compensate for the 1/f amplitude spectrum of natural images. Moreover,
to avoid highlighting high-frequency noise, the filter was multiplied by a two-dimensional Gaussian,
thereby obtaining a center-surround type of filter. This filter was applied on the images in the
frequency domain. Then, local contrast normalization was obtained by dividing the value of each
pixel by the standard deviation of the total activity of its neighborhood, using a Gaussian
neighborhood with a diameter of 20 pixels. Whitened letter images are shown in Supplementary Fig.
5b.
Unsupervised deep learning details. Our unsupervised deep learning model was implemented as a
deep belief network6,7 composed by a stack of Restricted Boltzmann Machines (RBMs). The dynamics
of each RBM is driven by an energy function E that describes which configurations of the neurons
are more likely to occur by assigning them a probability value:
Z
ehvp
hvE ),(
),(
where v and h are, respectively, the visible and hidden neurons and Z is a normalizing factor known
as partition function, which ensures that the values of p constitute a legal probability distribution
(i.e., summing up to one). The restricted connectivity of RBMs does not allow intra-layer
connections, resulting in a particularly simple form for the energy function:
WvhhcvbhvE TTT ),(
where W is the matrix of connections weights and b and c are the biases of visible and hidden
neurons, respectively.
RBMs were trained in a greedy, layer-wise fashion using 1-step contrastive divergence8. This learning
procedure minimizes the Kullback-Leibler divergence between the data distribution and the model
distribution. Accordingly, for each pattern the network performs a data-driven, positive phase (+)
and a model-driven, negative phase (-). In the positive phase all the visible neurons are clamped to
the current pattern, and the activation of hidden neurons is computed as a conditional probability:
n
j
j vhPvhP1
)|()|(
where n is the total number of hidden neurons, and the activation probabilities for each individual
neuron are given by the logistic function:
m
i
iijj vwbj
e
vhP
11
1)|1(
where m is the total number of visible neurons, bj is the bias of the hidden neuron hj and wij
represents the connection weight with each visible neuron vi. During the negative phase, the
activation of the hidden neurons corresponding to the clamped data pattern is used in an analogous
way to perform top-down inference over the visible neurons (model’s reconstruction), which are in
turn used to update the state of the hidden neurons. Connection weights are then updated by
contrasting visible-hidden correlations computed on the data vector (v+h
+) with visible-hidden
correlations computed on the model’s reconstruction (v –
h – ):
)( hvhvW
where η is the learning rate.
For the layer trained with patches of natural images, learning was performed for 200 epochs with
learning rate of 0.03, momentum coefficient of 0.8 and weight decay factor of 0.0001. Patterns were
learned in a mini-batch scheme with 100 examples per batch. For the layer trained on printed
letters, learning was performed for 120 epochs with learning rate of 0.01, momentum coefficient of
0.9 and decay factor of 0.000004. Patterns were learned in a mini-batch scheme of size 91. Learning
in this layer was also weakly constrained by a sparsity factor that forced the network’s internal
representations to rely on a limited number of active hidden neurons. Sparsity was implemented by
driving the probability of a unit to be active to a given low probability, which was set to 0.19–11. The
two layers required different learning hyperparameters because the training distributions were
different in nature and complexity. Although there exist some automatic procedures that try to
optimally set the values of some hyperparamters12 we preferred to not employ them in order to
keep the learning algorithm as simple as possible. We also note that some authors have recently
used real-valued RBMs to model natural image patches13,14, resulting in low-level features
comparable to those learned by our model.
Supervised read-out details. A read-out, linear classifier was used to associate data patterns P =
{P1, P2, …, Pn} with desired categories L = {L1, L2, …, Ln} by means of the following linear mapping:
WPL
where P and L are matrices containing n column vectors that correspondingly encode patterns Pi
and binary class labels Li, and W is the weight matrix of the linear classifier. If an exact solution to
this linear system does not exist, a least-mean-square approximation can be found by computing the
weight matrix as:
LPW
where P+ is the Moore-Penrose pseudo-inverse15,16. In our implementation, we used an efficient
implementation of the pseudo-inverse method provided by the “backslash” operator in MATLAB.
Drop in performance following input degradation was measured by adding to the test patterns an
increasing amount of zero-mean Gaussian noise with standard deviation ranging from 0.1 up to 1.5,
with a step of 0.1 (samples of letters at different noise levels are reported in Supplementary Fig. 5a).
Noise was always truncated at two standard deviations. Generalization was improved by extending
the classifier training dataset with a noisy copy of each pattern, which was created by independently
adding to each image pixel a noise value sampled from a zero-mean Gaussian distribution with
standard deviation of 0.3.
Overall activity for natural images, letters and pseudoletters. Representational selectivity at
different levels of the hierarchy was tested by analyzing how responses in H1 and H2 were
modulated by the type of visual input. We probed the network with three different types of visual
input: randomly selected stimuli from the natural images dataset; a set of uppercase letters; and a
set of corresponding “pseudoletters”. To this aim, from the test set we selected the patterns
containing the letters used in the study of Chang and colleagues1: A K Y H F T L . The letter X
was excluded for simplicity, because its rotated version was not produced using a canonical angle
(multiple of 90 degrees). To more closely match the type of stimuli used by Chang and colleagues,
we only selected letters printed in the Arial font, with no variations in weight, size, style and
position. In order to increase variability, we then created 5 copies of each letter by adding a small
amount of Gaussian noise (std.dev. = 0.01), resulting in a total of 35 patterns. For each pattern, we
created the corresponding pseudoletter by performing the same transformations (flipping and
rotations) applied by Chang and colleagues (one sample for each pseudoletter is shown in
Supplementary Fig. 5c; the original set of pseudoletters used by Chang and colleagues is shown in
Supplementary Fig. 5d). For each type of stimuli, we computed the corresponding mean activation
norm (L2) of hidden neurons in layers H1 and H2, and performed paired t-tests to assess activation
difference at each layer. Activation norm was used to acknowledge the fact that cerebral activation,
via neurovascular coupling, is driven by both inhibitory and excitatory neurons. Theoretically, this
proxy for neuronal activity appeals to the fact that any deviation from (non-equilibrium) steady-state
will increase cerebral metabolism, through the equivalence between thermodynamic and
informational free energy17,18. For the comparison with natural images, the mean activation norm
was computed on a random sample of 35 patches.
Supplementary Results
Representational selectivity for letters vs. mirror letters. We tested whether the effect reported in
Fig. 2c was present also for mirror images of letters. We selected 7 letters presenting horizontal
asymmetries (F J K L N R Z; three were the same used to create pseudoletters) and flipped them
along the vertical axis. We did not find a significant difference in the H1 activation norm for mirror
vs. canonical letters (t(34) = 1.861, p > .05, d = 0.315). The difference was still present at layer H2
(t(34) = 3.040, p < .01, d = 0.514), which was expected given that it learned to represent canonical
letters.
Read-out performance with random networks. Networks with randomly generated weight matrices
were used as a baseline. Indeed, it has been shown that random networks can support surprisingly
good performance in classification tasks19,20. In one set of control simulations we used as input to the
read-out the internal representations of a single-layer random network. The network had 1000
hidden units and its weight matrix was initialized using a Gaussian distribution with zero mean and
several different values of standard deviation (std.dev. = 10; 1; 0.5; 0.1; 0.01), thereby yielding 5
different versions of random network. The results reported in Fig. 2e represent the random network
that achieved the best recognition performance (random weights with std.dev. = 0.1). In a second
control simulation we used a two-layer architecture obtained by stacking a Restricted Boltzmann
Machine (RBM) on top of the single-layer random network described above. This additional RBM
was trained on the letter dataset as in the main simulation. We found that read-out performance
from the RBM’s internal representations (i.e., top layer of the network) never improved in
comparison to the single-layer random network. These results show that the features obtained by
projecting the image through a random matrix are inadequate, both for decoding and as
intermediate level for learning letter representations.
Read-out performance without recycling natural image features. To better assess the importance
of H1 features as an intermediate representation level, we also tested the read-out accuracy on a
deep belief network trained directly on the whitened letter images. The deep network was
composed by a stack of two RBMs using the same learning scheme described above. The only
difference was that the connection weights of the first-level RBM were not learned on natural
images, but rather all weights were adjusted by generative learning on the printed letter dataset. To
make the comparison easier, we adopted the same processing architecture, with 1000 neurons in
the first hidden layer and 1300 neurons in the second hidden layer. Though read-out from the
deepest layer was better than from the first hidden layer, performance was always worse in
comparison to the model that recycled natural image features (see Fig. 2e).
Reduced training set for the unsupervised learning phase. Results reported in Fig. 2e show the
read-out performance when the classifier was trained on the reduced dataset (i.e., Arial and Times
fonts). However, unsupervised learning in the deep belief network still relied on prolonged exposure
(120 epochs) to the full training dataset (32760 patterns). Indeed, the high dimensionality of the
parameter space in deep neural networks normally implies that thousands of training examples must
be used to avoid overfitting issues21,22, which is in sharp contrast with the limited amount of
experience often required by human learners23. In a final set of control simulations, we therefore
tested the deep network performance when also unsupervised learning was strongly reduced. To
this aim, only the two prototypical fonts (Arial and Times) were selected, and only 50% of the
resulting patterns were included in the training dataset. This reduced set included 4680 patterns,
less than 15% of the original training set. Moreover, the learning trajectory of the network was
tracked by measuring read-out accuracy after every 40 epochs. The read-out classifier was trained
on the same training set used for the unsupervised learning (see previous simulations for all other
details). Classification accuracy was then measured on two different test sets: one including only the
remaining 50% of Arial and Times patterns, and one with all test patterns used in the main
simulations, thereby including all fonts. Test images were corrupted by a fixed level of Gaussian
noise (std.dev. = 0.4). As shown in Supplementary Fig. 3, read-out performance for the deep network
trained with recycling (blue curves) was remarkable even at early learning stages, especially when
the read-out involved the same fonts seen during learning (left panel of Supplementary Fig. 3). The
network trained without recycling (red curves) showed a gap in performance that was particularly
marked at the early stages of learning. This shows that natural image features constitute a privileged
starting point for learning visual symbols. Note that transfer of perceptual knowledge in deep
networks has also been simulated across different writing scripts, such as Latin and Farsi24.
Supplementary References
1. Chang, C. H. C. et al. Adaptation of the human visual system to the statistics of letters and line
configurations. Neuroimage 120, 428–440 (2015).
2. Snavely, N., Seitz, S. M. & Szeliski, R. Photo tourism: Exploring Photo Collections in 3D. ACM
Trans. Graph. 25, 835–846 (2006).
3. Hyvärinen, A., Hurri, J. & Hoyer, P. O. Natural Image Statistics: A Probabilistic Approach to
Early Computational Vision. (Springer London, 2009).
4. Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat.
Neurosci. 2, 1019–25 (1999).
5. Simoncelli, E. P. & Olshausen, B. A. Natural image statistics and neural representation. Annu.
Rev. Neurosci. 24, 1193--1216 (2001).
6. Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with neural networks.
Science. 313, 504–7 (2006).
7. Hinton, G. E., Osindero, S. & Teh, Y. A fast learning algorithm for deep belief nets. Neural
Comput. 18, 1527–1554 (2006).
8. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural
Comput. 14, 1771–1800 (2002).
9. Zorzi, M., Testolin, A. & Stoianov, I. Modeling language and cognition with deep unsupervised
learning: a tutorial overview. Front. Psychol. 4, 515 (2013).
10. Lee, H., Ekanadham, C. & Ng, A. Y. Sparse deep belief net models for visual area V2. Adv.
Neural Inf. Process. Syst. 20, 873--880 (2008).
11. Testolin, A., De Filippo De Grazia, M. & Zorzi, M. The role of architectural and learning
constraints in neural network models: A case study on visual space coding. Front. Comput.
Neurosci. 11, (2017).
12. Cho, K., Raiko, T. & Ilin, A. Enhanced Gradient and Adaptive Learning Rate for Training
Restricted Boltzmann Machines. Int. Conf. Mach. Learn. 105–112 (2011).
13. Wang, N., Melchior, J., Wiskott, L., Wang, N. & Wiskott, L. Gaussian-binary restricted
Boltzmann machines for modeling natural image statistics. PLoS One 12, e0171015 (2017).
14. Xiong, H., Rodríguez-Sánchez, A. J., Szedmak, S. & Piater, J. Diversity priors for learning early
visual features. Front. Comput. Neurosci. 9, 104 (2015).
15. Albert, A. Regression and the Moore-Penrose pseudoinverse. (Academic Press, 1972).
16. Hertz, J. A., Krogh, A. S. & Palmer, R. G. Introduction to the theory of neural computation.
(Addison-Weasley, 1991).
17. Friston, K. J. et al. Dynamic causal modelling revisited. Neuroimage 1273–1302 (2017).
18. Sengupta, B., Stemmler, M. B. & Friston, K. J. Information and Efficiency in the Nervous
System-A Synthesis. PLoS Comput. Biol. 9, (2013).
19. Jaeger, H., Maass, W. & Principe, J. Special issue on echo state networks and liquid state
machines. Neural Networks 20, 287–289 (2007).
20. Widrow, B., Greenblatt, A., Kim, Y. & Park, D. The No-Prop algorithm: A new learning
algorithm for multilayer neural networks. Neural Networks 37, 182–188 (2013).
21. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional
neural networks. Adv. Neural Inf. Process. Syst. 24, 609–616 (2012).
22. Ciresan, D., Meier, U., Gambardella, L. M. & Schmidhuber, J. Deep Big Simple Neural Nets
Excel on Handwritten Digit Recognition. Neural Comput. 22, 3207--3220 (2010).
23. Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building Machines that learn
and think like people. Behav. Brain Sci. (2017).
24. Sadeghi, Z. & Testolin, A. Learning representation hierarchies by sharing visual features: A
computational investigation of Persian character recognition with unsupervised deep learning.
Cogn. Process. 14, 1–12 (2017).