letter perception emerges from unsupervised deep …10.1038/s41562-017-0186... · letter perception...

lettersDOI: 10.1038/s41562-017-0186-2

© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

Letter perception emerges from unsupervised deep learning and recycling of natural image featuresAlberto Testolin 1, Ivilin Stoianov 2,3 and Marco Zorzi 1,4*

1 Department of General Psychology and Padova Neuroscience Center, University of Padova, via Venezia 8, Padova 35131, Italy. 2 Laboratoire de Psychologie Cognitive - UMR7290, Centre National de la Recherche Scientifique, Aix-Marseille Université, 3, place Victor Hugo, Marseille 13331 CEDEX 3, France. 3 Institute of Cognitive Sciences and Technologies (ISTC), National Research Council (CNR), Via Martiri della Libertà 2, Padova 35137, Italy. 4 IRCCS San Camillo Hospital Foundation, via Alberoni 70, Venice-Lido 30126, Italy. *e-mail: [email protected]

Corrected: Publisher correction

SUPPLEMENTARY INFORMATION

In the format provided by the authors and unedited.

NATure HuMAN BeHAvIour | VOL 1 | SEPTEMBER 2017 | 657–664 | www.nature.com/nathumbehav 657

http://orcid.org/0000-0001-7062-4861

http://orcid.org/0000-0003-0642-259X

http://orcid.org/0000-0002-4651-6390

mailto:[email protected]

https://doi.org/10.1038/s41562-017-0253-8

http://www.nature.com/nathumbehav

Letter perception emerges from unsupervised deep learning

and recycling of natural image features

Alberto Testolin 1, Ivilin Stoianov 2,3, Marco Zorzi 1,4 *

1 Department of General Psychology and Padova Neuroscience Center,

University of Padova, Italy

2 Centre National de la Recherche Scientifique, Aix-Marseille Université, France

3 Institute of Cognitive Sciences and Technologies, CNR Padova, Italy

4 IRCCS San Camillo Neurorehabilitation Hospital, Venice-Lido, Italy

*Correspondence concerning this article should be addressed to Marco Zorzi, Department of General

Psychology, University of Padova, Via Venezia 12, Padova 35131, Italy. E-mail: [email protected]

Supplementary Figures

Supplementary Figure 1: The complete set of receptive fields developed in the first hidden layer.

Supplementary Figure 2: The complete set of receptive fields developed in the second hidden layer.

Supplementary Figure 3: Progressive refinement of read-out accuracy following unsupervised

learning on the reduced dataset. Accuracy was computed both on the Arial and Times test patterns

(left panel) and on the full set of test patterns (right panel), which included all fonts.

Supplementary Figure 4: Letter similarity matrix obtained on the model internal representations

(left panel) and by averaging the human similarity judgments of three published studies (right panel).

Ordering of the letters is optimized by hierarchical clustering. Lighter colors indicate higher

similarity: clusters of similar letters are highlighted by the yellow-colored groups along the main

diagonal.

Supplementary Figure 5. (a) The set of Arial letters, followed by noisy versions created with

increasing levels of Gaussian noise (std.dev. = 0.1, 0.4, 0.7, 1.1). (b) The set of Arial letters after

whitening. (c) Pseudoletters produced by rotating uppercase letters using the same procedure

adopted in the study of Chang and colleagues1. Their original set of stimuli is reported in panel (d).

Supplementary Tables

Supplementary Table 1: Pearson correlation coefficients between empirical confusion matrices and

the confusion matrix derived from model’s errors when read-out is applied to layer H1.

Empirical study Model correlation

Townsend-1 (1971) .62

Townsend-2 (1971) .46

Gilmore et al. (1979) .26

Loomis (1982) .29

Phillips at al. (1983) .40

Van Der et al. (1984) .58

Average correlation .45

Supplementary Methods

Natural images and printed letters datasets. We used a published, freely available natural image

dataset containing a large number of gray-scale pictures of three subjects: the Yosemite park, the

Liberty state and the Notre Dame cathedral2. Though it might seem counterintuitive to also consider

human-made artifacts as natural scenes, it has been shown that the types of spatial structures

present in “wild” environments give rise to statistical visual features similar to those learned from

more anthropomorphized environments3. Datasets that include human artifacts might better reflect

the everyday visual experience of people living in developed countries. Gray-scale, 40x40 pixel

bitmaps of the 26 Latin uppercase letters were created using the getframe MATLAB function. Pixel

values ranged between 0.2 (plain background) and 0.8 (maximum signal strength). This allowed to

manipulate the original stimuli by adding Gaussian noise during the simulations, as explained below.

The small variability in size and location of each letter was added to make learning more robust: our

model is primarily concerned about shape invariance and geometric similarity among input patterns,

while scale and position invariance could be obtained by including other processing mechanisms

such as convolution and max-pooling operations4.

Whitening algorithm. Following previous research5, pre-processing occurring in the retina and LGN

was implemented as a 1/f whitening algorithm that used a filter in the frequency domain designed

to flatten the spectrum of natural images. Since the power spectrum of natural images tends to fall

as 1/f2, the amplitude spectrum falls as 1/f. Thus, the amplitude spectrum of the whitening filter rose

linearly with frequency, to compensate for the 1/f amplitude spectrum of natural images. Moreover,

to avoid highlighting high-frequency noise, the filter was multiplied by a two-dimensional Gaussian,

thereby obtaining a center-surround type of filter. This filter was applied on the images in the

frequency domain. Then, local contrast normalization was obtained by dividing the value of each

pixel by the standard deviation of the total activity of its neighborhood, using a Gaussian

neighborhood with a diameter of 20 pixels. Whitened letter images are shown in Supplementary Fig.

5b.

Unsupervised deep learning details. Our unsupervised deep learning model was implemented as a

deep belief network6,7 composed by a stack of Restricted Boltzmann Machines (RBMs). The dynamics

of each RBM is driven by an energy function E that describes which configurations of the neurons

are more likely to occur by assigning them a probability value:

Z

ehvp

hvE ),(

),(

where v and h are, respectively, the visible and hidden neurons and Z is a normalizing factor known

as partition function, which ensures that the values of p constitute a legal probability distribution

(i.e., summing up to one). The restricted connectivity of RBMs does not allow intra-layer

connections, resulting in a particularly simple form for the energy function:

WvhhcvbhvE TTT ),(

where W is the matrix of connections weights and b and c are the biases of visible and hidden

neurons, respectively.

RBMs were trained in a greedy, layer-wise fashion using 1-step contrastive divergence8. This learning

procedure minimizes the Kullback-Leibler divergence between the data distribution and the model

distribution. Accordingly, for each pattern the network performs a data-driven, positive phase (+)

and a model-driven, negative phase (-). In the positive phase all the visible neurons are clamped to

the current pattern, and the activation of hidden neurons is computed as a conditional probability:

n

j

j vhPvhP1

)|()|(

where n is the total number of hidden neurons, and the activation probabilities for each individual

neuron are given by the logistic function:

m

i

iijj vwbj

e

vhP

11

1)|1(

where m is the total number of visible neurons, bj is the bias of the hidden neuron hj and wij

represents the connection weight with each visible neuron vi. During the negative phase, the

activation of the hidden neurons corresponding to the clamped data pattern is used in an analogous

way to perform top-down inference over the visible neurons (model’s reconstruction), which are in

turn used to update the state of the hidden neurons. Connection weights are then updated by

contrasting visible-hidden correlations computed on the data vector (v+h

+) with visible-hidden

correlations computed on the model’s reconstruction (v –

h – ):

)( hvhvW

where η is the learning rate.

For the layer trained with patches of natural images, learning was performed for 200 epochs with

learning rate of 0.03, momentum coefficient of 0.8 and weight decay factor of 0.0001. Patterns were

learned in a mini-batch scheme with 100 examples per batch. For the layer trained on printed

letters, learning was performed for 120 epochs with learning rate of 0.01, momentum coefficient of

0.9 and decay factor of 0.000004. Patterns were learned in a mini-batch scheme of size 91. Learning

in this layer was also weakly constrained by a sparsity factor that forced the network’s internal

representations to rely on a limited number of active hidden neurons. Sparsity was implemented by

driving the probability of a unit to be active to a given low probability, which was set to 0.19–11. The

two layers required different learning hyperparameters because the training distributions were

different in nature and complexity. Although there exist some automatic procedures that try to

optimally set the values of some hyperparamters12 we preferred to not employ them in order to

keep the learning algorithm as simple as possible. We also note that some authors have recently

used real-valued RBMs to model natural image patches13,14, resulting in low-level features

comparable to those learned by our model.

Supervised read-out details. A read-out, linear classifier was used to associate data patterns P =

{P1, P2, …, Pn} with desired categories L = {L1, L2, …, Ln} by means of the following linear mapping:

WPL

where P and L are matrices containing n column vectors that correspondingly encode patterns Pi

and binary class labels Li, and W is the weight matrix of the linear classifier. If an exact solution to

this linear system does not exist, a least-mean-square approximation can be found by computing the

weight matrix as:

LPW

where P+ is the Moore-Penrose pseudo-inverse15,16. In our implementation, we used an efficient

implementation of the pseudo-inverse method provided by the “backslash” operator in MATLAB.

Drop in performance following input degradation was measured by adding to the test patterns an

increasing amount of zero-mean Gaussian noise with standard deviation ranging from 0.1 up to 1.5,

with a step of 0.1 (samples of letters at different noise levels are reported in Supplementary Fig. 5a).

Noise was always truncated at two standard deviations. Generalization was improved by extending

the classifier training dataset with a noisy copy of each pattern, which was created by independently

adding to each image pixel a noise value sampled from a zero-mean Gaussian distribution with

standard deviation of 0.3.

Overall activity for natural images, letters and pseudoletters. Representational selectivity at

different levels of the hierarchy was tested by analyzing how responses in H1 and H2 were

modulated by the type of visual input. We probed the network with three different types of visual

input: randomly selected stimuli from the natural images dataset; a set of uppercase letters; and a

set of corresponding “pseudoletters”. To this aim, from the test set we selected the patterns

containing the letters used in the study of Chang and colleagues1: A K Y H F T L . The letter X

was excluded for simplicity, because its rotated version was not produced using a canonical angle

(multiple of 90 degrees). To more closely match the type of stimuli used by Chang and colleagues,

we only selected letters printed in the Arial font, with no variations in weight, size, style and

position. In order to increase variability, we then created 5 copies of each letter by adding a small

amount of Gaussian noise (std.dev. = 0.01), resulting in a total of 35 patterns. For each pattern, we

created the corresponding pseudoletter by performing the same transformations (flipping and

rotations) applied by Chang and colleagues (one sample for each pseudoletter is shown in

Supplementary Fig. 5c; the original set of pseudoletters used by Chang and colleagues is shown in

Supplementary Fig. 5d). For each type of stimuli, we computed the corresponding mean activation

norm (L2) of hidden neurons in layers H1 and H2, and performed paired t-tests to assess activation

difference at each layer. Activation norm was used to acknowledge the fact that cerebral activation,

via neurovascular coupling, is driven by both inhibitory and excitatory neurons. Theoretically, this

proxy for neuronal activity appeals to the fact that any deviation from (non-equilibrium) steady-state

will increase cerebral metabolism, through the equivalence between thermodynamic and

informational free energy17,18. For the comparison with natural images, the mean activation norm

was computed on a random sample of 35 patches.

Supplementary Results

Representational selectivity for letters vs. mirror letters. We tested whether the effect reported in

Fig. 2c was present also for mirror images of letters. We selected 7 letters presenting horizontal

asymmetries (F J K L N R Z; three were the same used to create pseudoletters) and flipped them

along the vertical axis. We did not find a significant difference in the H1 activation norm for mirror

vs. canonical letters (t(34) = 1.861, p > .05, d = 0.315). The difference was still present at layer H2

(t(34) = 3.040, p < .01, d = 0.514), which was expected given that it learned to represent canonical

letters.

Read-out performance with random networks. Networks with randomly generated weight matrices

were used as a baseline. Indeed, it has been shown that random networks can support surprisingly

good performance in classification tasks19,20. In one set of control simulations we used as input to the

read-out the internal representations of a single-layer random network. The network had 1000

hidden units and its weight matrix was initialized using a Gaussian distribution with zero mean and

several different values of standard deviation (std.dev. = 10; 1; 0.5; 0.1; 0.01), thereby yielding 5

different versions of random network. The results reported in Fig. 2e represent the random network

that achieved the best recognition performance (random weights with std.dev. = 0.1). In a second

control simulation we used a two-layer architecture obtained by stacking a Restricted Boltzmann

Machine (RBM) on top of the single-layer random network described above. This additional RBM

was trained on the letter dataset as in the main simulation. We found that read-out performance

from the RBM’s internal representations (i.e., top layer of the network) never improved in

comparison to the single-layer random network. These results show that the features obtained by

projecting the image through a random matrix are inadequate, both for decoding and as

intermediate level for learning letter representations.

Read-out performance without recycling natural image features. To better assess the importance

of H1 features as an intermediate representation level, we also tested the read-out accuracy on a

deep belief network trained directly on the whitened letter images. The deep network was

composed by a stack of two RBMs using the same learning scheme described above. The only

difference was that the connection weights of the first-level RBM were not learned on natural

images, but rather all weights were adjusted by generative learning on the printed letter dataset. To

make the comparison easier, we adopted the same processing architecture, with 1000 neurons in

the first hidden layer and 1300 neurons in the second hidden layer. Though read-out from the

deepest layer was better than from the first hidden layer, performance was always worse in

comparison to the model that recycled natural image features (see Fig. 2e).

Reduced training set for the unsupervised learning phase. Results reported in Fig. 2e show the

read-out performance when the classifier was trained on the reduced dataset (i.e., Arial and Times

fonts). However, unsupervised learning in the deep belief network still relied on prolonged exposure

(120 epochs) to the full training dataset (32760 patterns). Indeed, the high dimensionality of the

parameter space in deep neural networks normally implies that thousands of training examples must

be used to avoid overfitting issues21,22, which is in sharp contrast with the limited amount of

experience often required by human learners23. In a final set of control simulations, we therefore

tested the deep network performance when also unsupervised learning was strongly reduced. To

this aim, only the two prototypical fonts (Arial and Times) were selected, and only 50% of the

resulting patterns were included in the training dataset. This reduced set included 4680 patterns,

less than 15% of the original training set. Moreover, the learning trajectory of the network was

tracked by measuring read-out accuracy after every 40 epochs. The read-out classifier was trained

on the same training set used for the unsupervised learning (see previous simulations for all other

details). Classification accuracy was then measured on two different test sets: one including only the

remaining 50% of Arial and Times patterns, and one with all test patterns used in the main

simulations, thereby including all fonts. Test images were corrupted by a fixed level of Gaussian

noise (std.dev. = 0.4). As shown in Supplementary Fig. 3, read-out performance for the deep network

trained with recycling (blue curves) was remarkable even at early learning stages, especially when

the read-out involved the same fonts seen during learning (left panel of Supplementary Fig. 3). The

network trained without recycling (red curves) showed a gap in performance that was particularly

marked at the early stages of learning. This shows that natural image features constitute a privileged

starting point for learning visual symbols. Note that transfer of perceptual knowledge in deep

networks has also been simulated across different writing scripts, such as Latin and Farsi24.

Supplementary References

1. Chang, C. H. C. et al. Adaptation of the human visual system to the statistics of letters and line

configurations. Neuroimage 120, 428–440 (2015).

2. Snavely, N., Seitz, S. M. & Szeliski, R. Photo tourism: Exploring Photo Collections in 3D. ACM

Trans. Graph. 25, 835–846 (2006).

3. Hyvärinen, A., Hurri, J. & Hoyer, P. O. Natural Image Statistics: A Probabilistic Approach to

Early Computational Vision. (Springer London, 2009).

4. Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat.

Neurosci. 2, 1019–25 (1999).

5. Simoncelli, E. P. & Olshausen, B. A. Natural image statistics and neural representation. Annu.

Rev. Neurosci. 24, 1193--1216 (2001).

6. Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with neural networks.

Science. 313, 504–7 (2006).

7. Hinton, G. E., Osindero, S. & Teh, Y. A fast learning algorithm for deep belief nets. Neural

Comput. 18, 1527–1554 (2006).

8. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural

Comput. 14, 1771–1800 (2002).

9. Zorzi, M., Testolin, A. & Stoianov, I. Modeling language and cognition with deep unsupervised

learning: a tutorial overview. Front. Psychol. 4, 515 (2013).

10. Lee, H., Ekanadham, C. & Ng, A. Y. Sparse deep belief net models for visual area V2. Adv.

Neural Inf. Process. Syst. 20, 873--880 (2008).

11. Testolin, A., De Filippo De Grazia, M. & Zorzi, M. The role of architectural and learning

constraints in neural network models: A case study on visual space coding. Front. Comput.

Neurosci. 11, (2017).

12. Cho, K., Raiko, T. & Ilin, A. Enhanced Gradient and Adaptive Learning Rate for Training

Restricted Boltzmann Machines. Int. Conf. Mach. Learn. 105–112 (2011).

13. Wang, N., Melchior, J., Wiskott, L., Wang, N. & Wiskott, L. Gaussian-binary restricted

Boltzmann machines for modeling natural image statistics. PLoS One 12, e0171015 (2017).

14. Xiong, H., Rodríguez-Sánchez, A. J., Szedmak, S. & Piater, J. Diversity priors for learning early

visual features. Front. Comput. Neurosci. 9, 104 (2015).

15. Albert, A. Regression and the Moore-Penrose pseudoinverse. (Academic Press, 1972).

16. Hertz, J. A., Krogh, A. S. & Palmer, R. G. Introduction to the theory of neural computation.

(Addison-Weasley, 1991).

17. Friston, K. J. et al. Dynamic causal modelling revisited. Neuroimage 1273–1302 (2017).

18. Sengupta, B., Stemmler, M. B. & Friston, K. J. Information and Efficiency in the Nervous

System-A Synthesis. PLoS Comput. Biol. 9, (2013).

19. Jaeger, H., Maass, W. & Principe, J. Special issue on echo state networks and liquid state

machines. Neural Networks 20, 287–289 (2007).

20. Widrow, B., Greenblatt, A., Kim, Y. & Park, D. The No-Prop algorithm: A new learning

algorithm for multilayer neural networks. Neural Networks 37, 182–188 (2013).

21. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional

neural networks. Adv. Neural Inf. Process. Syst. 24, 609–616 (2012).

22. Ciresan, D., Meier, U., Gambardella, L. M. & Schmidhuber, J. Deep Big Simple Neural Nets

Excel on Handwritten Digit Recognition. Neural Comput. 22, 3207--3220 (2010).

23. Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building Machines that learn

and think like people. Behav. Brain Sci. (2017).

24. Sadeghi, Z. & Testolin, A. Learning representation hierarchies by sharing visual features: A

computational investigation of Persian character recognition with unsupervised deep learning.

Cogn. Process. 14, 1–12 (2017).

letter perception emerges from unsupervised deep …10.1038/s41562-017-0186... · letter perception...

Documents