convolutional neural networks as a model of the visual
TRANSCRIPT
Convolutional Neural Networks as a Model ofthe Visual System: Past, Present, and Future
Grace W. Lindsay
Abstract
â Convolutional neural networks (CNNs) were inspired by earlyfindings in the study of biological vision. They have since becomesuccessful tools in computer vision and state-of-the-art models ofboth neural activity and behavior on visual tasks. This reviewhighlights what, in the context of CNNs, it means to be a goodmodel in computational neuroscience and the various ways
models can provide insight. Specifically, it covers the origins ofCNNs and the methods by which we validate them as models ofbiological vision. It then goes on to elaborate on what we canlearn about biological vision by understanding and experiment-ing on CNNs and discusses emerging opportunities for the useof CNNs in vision research beyond basic object recognition. â
INTRODUCTION
Computational models serve several purposes in neuro-science. They can validate intuitions about how a systemworks by providing a way to test those intuitions directly.They also offer a means to explore new hypotheses in anideal experimental testing ground, wherein every detailcan be controlled and measured. In addition models openthe system in question up to a new realm of understand-ing through the use of mathematical analysis. In recentyears, convolutional neural networks (CNNs) have per-formed all of these roles as a model of the visual system.This review covers the origins of CNNs, the methods
by which we validate them as models of the visual sys-tem, what we can find by experimenting on them, andemerging opportunities for their use in vision research.Importantly, this review is not intended to be a thor-ough overview of CNNs or extensively cover all uses ofdeep learning in the study of vision (other reviews maybe of use to the reader for this; Kietzmann, McClure, &Kriegeskorte, 2019; Kriegeskorte & Golan, 2019; Serre,2019; Storrs & Kriegeskorte, 2019; Yamins & DiCarlo,2016, Kriegeskorte, 2015). Rather, it is meant to demon-strate the strategies by which CNNs as a model can beused to gain insight and understanding about biologicalvision.According to Kay (2018), âa functional model attempts
only to match the outputs of a system given the sameinputs provided to the system, whereas a mechanisticmodel attempts to also use components that parallelthe actual physical components of the system.â Using thesedefinitions, this review is concerned with the use of CNNsas âmechanisticâmodels of the visual system. That is, it will
be assumed and argued that, in addition to an overallmatch between outputs of the two systems, subparts ofa CNN are intended to match subparts of the visualsystem.
WHERE CNNS CAME FROM
The history of CNNs threads through both neuroscienceand artificial intelligence. Like artificial neural networks ingeneral, they are an example of brain-inspired ideas com-ing to fruition through an interaction with computerscience and engineering.
Origins of the Model
In the mid twentieth century, Hubel and Wiesel discov-ered two major cell types in the primary visual cortex(V1) of cats (Hubel & Wiesel, 1962). The first typeâthe simple cellsâresponds to bars of light or dark whenplaced at specific spatial locations. Each cell has an orien-tation of the bar at which it fires most, with its responsefalling off as the angle of the bar changes from this pre-ferred orientation (creating an orientation âtuning curveâ).The second typeâcomplex cellsâhas less strict responseprofiles; these cells still have preferred orientations butcan respond just as strongly to a bar in several differentnearby locations. Hubel and Wiesel concluded that thesecomplex cells are likely receiving input from several simplecells, all with the same preferred orientation but withslightly different preferred locations (Figure 1, left).
In 1980, Fukushima transformed Hubel and Wieselâsfindings into a functioning model of the visual system(Fukushima, 1980). This model, the Neocognitron, isthe precursor to modern CNNs. It contains two main cellUniversity College London
© 2020 Massachusetts Institute of Technology Journal of Cognitive Neuroscience X:Y, pp. 1â15https://doi.org/10.1162/jocn_a_01544
types. The S-cells are named after simple cells and rep-licate their basic features: Specifically, a 2-D grid ofweights is applied at each location in the input imageto create the S-cell responses. A âplaneâ of S-cells thushas a retinotopic layout with all cells sharing the samepreferred visual features and multiple planes existing ata layer. The response of the C-cells (named after complexcells) is a nonlinear function of several S-cells comingfrom the same plane but at different locations.
After a layer of simple and complex cells representingthe basic computations of V1, the Neocognitron simply
repeats the process again. That is, the output of the firstlayer of complex cells serves as the input to the secondsimple cell layer, and so on. With several repeats, this cre-ates a hierarchical model that mimics not just the opera-tions of V1 but the ventral visual pathway as a whole. Thenetwork is âself-organized,â meaning weights changewith repeated exposure to unlabeled images.By the 1990s, many similar hierarchical models of the
visual system were being explored and related back todata (Riesenhuber & Poggio, 2000). One of the mostprominent ones, HMAX, used the simple âmaxâ opera-tion over the activity of a set of simple cells to determinethe response of the C-cells and was very robust to imagevariations. Because these models could be applied to thesame images used in human psychophysics experiments,the behavior of a model could be directly compared withthe ability of humans to perform rapid visual categoriza-tion. Through this, a correspondence between these hi-erarchical models and the first 100â150 msec of visualprocessing was found (Serre et al., 2007). Such modelswere also fit to capture the responses of V4 neurons tocomplex shape stimuli (Cadieu et al., 2007).
CNNs in Computer Vision
The CNN as we know it today comes from the field ofcomputer vision, yet the inspiration from the work ofHubel and Wiesel is clearly visible in it (Figure 1; Rawat& Wang, 2017). Modern CNNs start by convolving a set offilters with an input image and rectifying the outputs, lead-ing to âfeature mapsâ akin to the planes of S-cells in theNeocognitron. Max pooling is then applied, creating com-plex cell-like responses. After several iterations of this pat-tern, nonconvolutional fully connected layers are added,and the last layer contains as many units as number of cat-egories in the task to output a category label for the image(Figure 2, bottom).
Figure 1. The relationship between components of the visual systemand the base operations of a CNN. Hubel and Wiesel (1962) discoveredthat simple cells (left, blue) have preferred locations in the image(dashed ovals) wherein they respond most strongly to bars of particularorientation. Complex cells (green) receive input from many simplecells and thus have more spatially invariant responses. These operationsare replicated in a CNN (right). The first convolutional layer (blue) isproduced by applying a convolution operation to the image. Specifically,the application of a small filter (gray box) to every location in the imagecreates a feature map. A convolutional layer has as many featuremaps as filters applied (only one shown here). Taking the maximalactivation in a small section of each feature map (gray box) downsamplesthe image and leads to complex cell-like responses (green). This isknown as a âmax-poolingâ operation, and the downsampled feature mapsproduced by it comprise a pooling layer.
Figure 2. Validating CNNs as amodel by comparing theirrepresentations to those of thebrain. A CNN trained to performobject recognition (bottom)contains several convolutionallayers (Conv-1 and Conv-2shown here, each followed bymax-pooling layers Pool-1 andPool-2) and ends with a fullyconnected layer with a numberof units equal to the number ofcategories. When the sameimage is shown to a CNN and ananimal (top), the response fromeach can be compared. Here,the activity of single unitsrecorded from a visual area inthe brain can be predicted fromthe activity of artificial units at aparticular layer.
2 Journal of Cognitive Neuroscience Volume X, Number Y
The first major demonstration of the power of CNNscame in 1989 when it was shown that a small CNN trainedwith supervision using the backpropagation algorithm couldperform handwritten digit classification (LeCun et al., 1989).However, these networks did not really take off until 2012,when an eight-layer network (dubbed âAlexNet,â âStandardArchitectureâ in Figure 3) trained with backpropagation farexceeded state-of-the-art performance on the ImageNetchallenge. The ImageNet data set is composed of over a mil-lion real-world images, and the challenge requires classifyingan image into one of a thousand object categories. The suc-cess of this network demonstrated that the basic features ofthe visual system found by neuroscientists were indeed ca-pable of supporting vision; they simply needed appropriatelearning algorithms and data.In the years since this demonstration, many different
CNN architectures have been explored, with the main pa-rameters varied including network depth, placement ofpooling layers, number of feature maps per layers, train-ing procedures, and whether or not residual connectionsthat skip layers exist (Rawat & Wang, 2017). The goal ofexploring these parameters in the computer vision com-munity is to create a model that performs better on stan-dard image benchmarks, with secondary goals of makingnetworks that are smaller or train with less data. Corre-spondence with biology is not a driving factor.
VALIDATING CNNS AS A MODEL OF THEVISUAL SYSTEM
The architecture of a CNN has (by design) direct parallelsto the architecture of the visual system. Images fed into
these networks are usually first normalized and separatedinto three different color channels (red, green, blue), whichcaptures certain computations done by the retina. Eachstacked bundle of convolutionânonlinearityâpooling canthen be thought of as an approximation to a single visualareaâusually the ones along the ventral stream such asV1, V2, V4, and ITâeach with its own retinotopy and fea-ture maps. This stacking creates receptive fields for in-dividual neurons that increase in size deeper in thenetwork and the features of the image that they respondto become more complex. As has been mentioned,when trained to, these architectures can take in an imageand output a category label in agreement with humanjudgment.
All of these features make CNNs good candidates formodels of the visual system. However, all of these fea-tures have been explicitly built into CNNs. To validatethat CNNs are performing computations similar to thoseof the visual system, they should match the visual systemin additional, nonengineered ways. That is, from theassumptions put into the model, further features of thedata should fall out. Indeed, many further correspon-dences have been found at the neural and behaviorallevels.
Comparison at the Neural Level
One of the major causes of the recent resurgence of in-terest in artificial neural networks among neuroscientistsis the finding that they can recapitulate the representa-tion of visual information along the ventral stream. In par-ticular, when CNNs and animals are shown the same
Figure 3. Examplearchitectures. The standardarchitecture shown here issimilar to the original AlexNet,where an red/green/blue imageis taken as input and the activityat the final layer determines thecategory label. In the secondarchitecture, local recurrence isshown at each convolutionallayer and feedback recurrencegoes from Layer 5 to Layer 3.This network would be trainedwith the backpropagation-through-time algorithm. Thethird network is an autoencoder.Here, the goal of the network isto recreate the input image.Crucially, the fully connectedlayer is of significantly lowerdimensionality than the image,forcing the network to find amore compact representation ofrelevant image statistics. Finally,an architecture for reinforcement learning is shown. Here, the aim of the network is to turn an input image representing information about thecurrent state into an action for the network to take. This architecture is augmented with a specific type of local recurrent units known as LSTMs,which embue it with memory.
Lindsay 3
image (Figure 2), the activity of the artificial units can beused to predict the activity of real neurons, with accuracybeyond that of any previous methods.
One of the early studies to show this (Yamins et al.,2014), published in 2014, recorded extracellular activityin macaque during the viewing of complex object images.Regressing the activity of a real V4 or IT neuron onto theactivity of the artificial units in a network (and cross-validating the predictive ability with a held-out test set),the authors found that networks that performed betteron an object recognition task also better predicted neuralactivity (a relationship also found using video classifica-tion; Tacchetti, Isik, & Poggio, 2017). Furthermore, theactivity of units from the last layer of the network bestpredicted IT activity and the penultimate layer best pre-dicted V4. This relationship between models and thebrain wherein later layers in the network better predicthigher areas of the ventral stream has been found in sev-eral other studies, including using human fMRI (GÌçlÌ& van Gerven, 2015), MEG (Seeliger et al., 2018), and withvideo instead of static images as the stimuli (Eickenberg,Gramfort, Varoquaux, & Thirion, 2017).
Another method to check for a correspondence be-tween different populations is representational similarityanalysis (RSA; Kriegeskorte, Mur, & Bandettini, 2008).This method starts by creating a matrix for each popula-tion that represents how dissimilar the responses of thatpopulation are for every pair of images. This serves as asignature of the populationâs representation properties.The similarity between two different populations is thenmeasured as the correlation between their dissimilaritymatrices. This method was used in 2014 (Khaligh-Razavi& Kriegeskorte, 2014) to demonstrate that the later layersof an AlexNet network trained on ImageNet match multi-ple higher areas of the human visual system, along withmonkey IT, better than previously used models.
Because dissimilarity matrices can be created from anykind of responses, including behavioral outputs, RSA iswidely applicable as a means to compare data from differentexperimental methods and models. It is also a straight-forward way to incorporate and compare full populationresponses, whereas regression techniques focus on a singleneuron or voxel at a time. On the other hand, the re-gression techniques allow for selective weighting of themodel features most relevant for fitting the data, which maybe more informative than the more âunsupervisedâ RSAapproach. In all cases, details of the methodology andinterpretation should be carefully considered (Kornblith,Norouzi, Lee, & Hinton, 2019; Thompson, Bengio, Formisano,& Schönwiesner, 2018).
Many of the studies comparing CNNs to biological datahave highlighted their ability to explain later visual areassuch as V4 and IT. This is a notable feat of CNNs becausethe complex response properties of these areas havemade them notoriously difficult to fit compared with pri-mary visual cortex. Recent work, however, has shownthat early-to-middle layers of task-trained CNNs can also
predict V1 activity beyond the ability of more traditionalV1 models (Cadena, Denfield, et al., 2019).Beyond predicting neural activity or matching overall
representations, CNNs can also be compared with neuraldata using more specific features traditionally used insystems neuroscience. Similarities have been found be-tween units in the network and individual neurons interms of response sparseness and size tuning, but differ-ences have been found for object selectivity and orienta-tion tuning (Tripp, 2017). Other studies have exploredtuning to shape and categories (Zeman, Ritchie, Bracci,& Op de Beeck, 2019) and how responses change withchanges in pose, location, and size (Murty & Arun, 2017,2018; Hong, Yamins, Majaj, & DiCarlo, 2016).Generally, the similarities with real neural activity that
emerge from training a CNN to perform object recogni-tion suggest that this architecture and task do indeedhave some similarity to the architecture and purpose ofthe visual system.
Comparison at the Behavioral Level
Insofar as CNNs outperform any previous model of thevisual system on real-world image classification, theycan be considered a good match to human behavior.However, overall accuracy on the standard ImageNet taskis only one measure of CNN behavior, and it is the onefor which the network is explicitly optimized.A deeper comparison can be made by looking at the
errors these networks make. Although there is only oneway to be correct, there are many ways humans andmodels can make mistakes in classification, and thiscan provide rich insight into their respective workings.Confusion matrices, for example, are used to representhow frequently images from one category are classifiedas belonging to another and can be compared betweenmodels and animal behavior. A large-scale study showedthat several different deep CNN architectures show simi-lar matches to human and monkey object classification,though the models are not predictive down to the imagelevel (Rajalingham et al., 2018).Other studies have explicitly asked participants to rate
similarity between images and compared human judgmentto model representations (King, Groen, Steel, Kravitz,& Baker, 2019; Jozwik, Kriegeskorte, Storrs, & Mur, 2017).In the study of Rosenfeld, Solbach, and Tsotsos (2018),for example, a large data set was taken from the Web siteâTotally Looks Like.â Although many similarity studies findgood matches from CNNs, this study demonstrated morechallenging elements of similarity that CNNs struggled toreplicate.In addition to similarity, other psychological concepts
that have been tested in CNNs include typicality (Lake,Zaremba, Fergus, & Gureckis, 2015), Gestalt principles(Kim, Reif, Wattenberg, & Bengio, 2019), and animacy(Bracci, Ritchie, Kalfas, & Op de Beeck, 2019). In Jacob,Pramod, Katti, and Arun (2019), a battery of tests inspired
4 Journal of Cognitive Neuroscience Volume X, Number Y
by findings in visual psychophysics were applied toCNNs, and CNNs were found to be similar to biologicalvision according to roughly half of them.Another way to probe animal and CNN behavior is to
make the classification task more challenging by degradingimage quality. Several studies have added various types ofnoise, occlusion, or blur to standard images and observeda decrease in classification performance (Geirhos, Temme,et al., 2018; Tang et al., 2018; Geirhos et al., 2017; Wichmannet al., 2017; Ghodrati, Farzmahdi, Rajaei, Ebrahimpour,& Khaligh-Razavi, 2014). Importantly this performance de-crease is usually more severe in the CNNs than it is in hu-mans, suggesting biological vision has mechanisms forovercoming degradation. These points of mismatch are thusimportant to identify to steer future research (e.g., see Alter-native Data Sets section). A particular CNN architecture,known as a capsule network (Roy, Ghosh, Bhattacharya,& Pal, 2018), was shown to be more robust to degradation;however, it is not exactly clear how to relate the âcapsulesâin this architecture to parts of the visual system.Another emerging finding in the behavioral analysis of
CNNs is their reliance on texture. Although an argumenthas been made for CNNs as a model of human shapesensitivity (Kubilius, Bracci, & Op de Beeck, 2016), otherstudies have demonstrated that CNNs rely too much ontexture and not enough on shape when classifying im-ages (Baker, Lu, Erlikhman, & Kellman, 2018; Geirhos,Rubisch, et al., 2018).Interestingly, it is possible for certain deep networks
to outperform humans on some tasks (Kheradpisheh,Ghodrati, Ganjtabesh, & Masquelier, 2016). This high-lights an important tension between the goals of com-puter vision and those of neuroscience: In the formerexceeding human performance is desirable, in the latterit is still counted as a mismatch between the model andthe data.
Something to keep in mind when studying CNN behav-ior is that standard feedforward CNN architectures arebelieved to represent the very initial stages of visual pro-cessing, before various kinds of recurrent processingcan take place. Therefore, when comparing the behav-ior of CNNs to animal behavior, fast stimulus presenta-tion and backward masking are advised, as these areknown to prevent many stages of recurrent processing(Tang et al., 2018).
Other Forms of Validation
Although neural and behavioral comparisons are themain methods by which CNNs are validated as modelsof the visual system, other approaches can provide fur-ther support.
Methods for visualizing the image features that driveunits at different layers in the network (Figure 4; Olah,Mordvintsev, & Schubert, 2017; Zeiler & Fergus, 2014) haverevealed preferred visual patterns that align with thosefound in neuroscience. For example, the first layer in aCNN has filters that look like Gabors, whereas later layersrespond to partial object features and eventually fuller fea-tures such as faces. This supports the idea that the process-ing steps in CNNs match those in the ventral stream.
CNNs have also been used to produce optimal stimulifor real neurons. Starting from the above-mentioned pro-cedure for predicting neural activity with CNNs, stimulithat were intended to maximally drive a neuronâs firingrate were produced (Bashivan, Kar, & DiCarlo, 2019).The fact that the resulting stimuli were indeed effectiveat driving the neuron beyond its normal rates despite be-ing unnatural and unrelated to the images the networkwas trained on further supports the notion that CNNsare capturing something fundamental about the visualprocessing stream.
Figure 4. Visualizingpreferences of differentnetwork components. Thegeneral method for creatingthese images is shown at thetop. Specifically, a componentof the model is selected foroptimization and gradientcalculations sent back throughthe network indicate how pixelsin an initially noisy imageshould change to best activatethe chosen component.Components can be individualunits, whole feature maps,layers (using the DeepDreamalgorithm), or unitsrepresenting a specific classin the final layer. Featurevisualizations taken from Olahet al. (2017).
Lindsay 5
In a related vein, studies have also used CNNs to decodeneural activity to recreate the stimuli that are being presentedto participants (Shen, Horikawa, Majima, & Kamitani, 2019).
WHAT WE LEARN FROM VARYING THE MODEL
The above-mentioned studies have focused mainly on stan-dard feedforward CNN architectures, trained via supervisedlearning to perform object recognition. Yet, with full controlover these models, it is possible to explore many variationsin data sets, architectures, and training procedures. By ob-serving how these changes make the model a better orworse fit to data, we can gain insight into how and why cer-tain features of biological visual processing exist.
Alternative Data Sets
The ImageNet data set has proven very useful for learning aset of basic visual features that can be adapted to many dif-ferent tasks in a way that previous smaller and simpler datasets such as MNIST and CIFAR-10 were not. However, itcontains images where objects are the focus and as a resultis best suited for studying object recognition pathways. Tostudy scene processing areas such as the occipital placearea, several studies have instead trained on scene images.In the study of Bonner and Epstein (2018), the responses ofoccipital place area could be predicted using a networktrained to classify scenes. Whatâs more, the authors wereable to relate the features captured by the network to nav-igational affordances in the scene. In the study of Cichy,Khosla, Pantazis, and Oliva (2017), scene-trained networkswere able to capture representations of scene size collectedusing MEG. Such studies can in general help to identify theevolutionarily or developmentally determined role of differ-ent brain areas based on the extent to which their represen-tations align with networks trained on different specializeddata sets. For an open data set of fMRI responses to differ-ent image sets, see Chang et al. (2019).
Data sets have also been developed with the explicit in-tent of correcting ways in which CNNs do not match hu-man behavior. For example, training with different imagedegradations can make networks more robust to thosedegradations (Geirhos, Temme, et al., 2018). However, thisonly works for the noise model specifically trained and doesnot generalize to new noise types. In addition, in the studyof Geirhos, Rubisch, et al. (2018), a data set that keeps ab-stract shapes but varies low-level textural elements wasshown to decrease a CNNâs texture bias. Whether animalshave good visual performance due to an exposure to simi-larly diverse data during developmental or due to built-inpriors instead remains to be determined.
Alternative Architectures
A variety of purely feedforward architectures have beenexplored both in the machine learning community andby neuroscientists. In comparison to data, AlexNet has
been commonly used and performs well. But it can bebeat by somewhat deeper architectures such as VGGmodels and ResNets (Schrimpf et al., 2018; Wen, Shi,Chen, & Liu, 2018). With very deep networksâthoughthey may perform better on image tasksâthe relationshipbetween layers in the network and areas in the brain maybreak down, making them a worse fit to the data. However,in some cases, the processing in very deep networks can bethought of as equivalent to the multiple stages of recurrentprocessing in the brain (Kar, Kubilius, Schmidt, Issa,& DiCarlo, 2019).In primate vision, the first two processing stagesâretina
and the lateral geniculate nucleusâcontain cells thatrespond preferentially to on-center or off-center lightpatterns, and only in V1 does orientation tuning becomedominant. Yet, Gabor filters are frequently learned at thefirst layer of a CNN, creating a V1-like representation.Using a modified architecture wherein the connectionsfrom early layers are constrained as they are biologicallyby the optic nerve creates a CNN with on- and off-centerresponses at early stages and orientation tuning only later(Lindsey, Ocko, Ganguli, & Deny, 2019). Thus, the patternof selectivity seen in primate vision is potentially a conse-quence of anatomical constraints.In the study of Kell, Yamins, Shook, Norman-Haignere,
and McDermott (2018), various architectures weretrained to perform a pair of auditory tasks (speech andmusic recognition). Through this, the authors found thatthe two tasks could share three layers of processing be-fore the network needed to split into specialized streamsto perform well on both tasks. Recently, a similar proce-dure was applied to visual tasks and used to explain whyspecialized pathways for face processing arise in the vi-sual system (Dobs, Kell, Palmer, Cohen, & Kanwisher,2019). A similar question was also approached by traininga single network to perform two tasks and looking foremergent subnetworks in the trained network (Scholte,Losch, Ramakrishnan, de Haan, & Bohte, 2018). Suchstudies can explain the existence of dorsal and ventralstreams and other details of visual architectures.Taking further inspiration from biology, many studies
have explored the beneficial role of both local and feedbackrecurrence (Figure 3). Local recurrence refers to horizontalconnections within a single visual area. By adding theseconnections to CNNs, studies have found that these recur-rent connections make networks better at more challengingtasks (Hasani, Soleymani, & Aghajan, 2019; Montobbio,Bonnasse-Gahot, Citti, & Sarti, 2019; Spoerer, Kietzmann,& Kriegeskorte, 2019; Tang et al., 2018). These connectionscan also help make the CNN representations a better matchto neural data, particularly for challenging images and atlater time points in the response (Kar et al., 2019; Kubiliuset al., 2019; Rajaei, Mohsenzadeh, Ebrahimpour, & Khaligh-Razavi, 2019; Shi, Wen, Zhang, Han, & Liu, 2018; McIntosh,Maheswaranathan, Nayebi, Ganguli, & Baccus, 2016). Thesestudies make a strong argument for the computationalrole of these local connections.
6 Journal of Cognitive Neuroscience Volume X, Number Y
Feedback connections go from frontal or parietal areasto regions in the visual system or from higher visual areasback to lower areas (Wyatte, Jilk, & OâReilly, 2014). Likehorizontal recurrence, they are known to be common inbiological vision. Connections from frontal and parietalareas are believed to implement goal-directed selectiveattention. Such feedback has been added to network mod-els to implement cued detection tasks (Thorat, van Gerven,& Peelen, 2019; Wang, Zhang, Song, & Zhang, 2014).Feedback from higher visual areas back to lower ones arethought to implement more immediate and general imageprocessing such as denoising. Some studies have addedthese connections in addition to local recurrence and foundthat they can aid performance as well (Kim, Linsley, Thakkar,& Serre, 2019; Spoerer, McClure, & Kriegeskorte, 2017). Astudy comparing different feedback and feedforward ar-chitectures suggests that feedback can help in part by in-creasing the effective receptive field size of cells (Jarvers& Neumann, 2019).
Alternative Training Procedures
Supervised learning using backpropagation is themost com-mon method of training CNNs; however, other methodshave the potential to result in a good model of the visualsystem. The 2014 study that initially showed a correlationbetween performance on object recognition and ability tocapture neural responses (Yamins et al., 2014), for example,did not use backpropagation but rather a modular optimi-zation procedure.Unsupervised learning, wherein networks aim to capture
relevant statistics of the input data rather than match in-puts to outputs, can also be used to train neural networks(Fleming & Storrs, 2019; Figure 3). These methods may helpidentify a low-dimensional set of features that underlie high-dimensional visual inputs and thus allow an animal to makebetter sense of the world and possibly build useful causalmodels (Lake, Ullman, Tenenbaum, & Gershman, 2017).Furthermore, because of the large amount of labeled datapoints required for supervised training, it is assumed thatthe brain must make use of unsupervised learning.However, as of yet, unsupervised methods do not producemodels that capture neural representations as well as super-vised methods. Behaviorally, a generative model was shownto performmuchworse than supervisedmodels on capturinghuman image categorization (Peterson, Abbott, & Griffiths,2018). In addition, a model trained based on the conceptof predictive coding (Lotter, Kreiman, & Cox, 2016) was ableto predict object movement and replicate motion illusions(Watanabe, Kitaoka, Sakamoto, Yasugi, & Tanaka, 2018).Overall, the limits and benefits of unsupervised training forvision models need further exploration.Interestingly, a recent study found a class-conditioned
generative model to be in better agreement with humanjudgment on controversial stimuli than models trained toperform classification (Golan, Raju, & Kriegeskorte, 2019).Although this generative model still relied on class labels
for training and was thus not unsupervised, its ability toreplicate aspects of humanperception supports the notionthat the visual system aims in part to capture a distributionof the visual world rather than merely funnel images intoobject categories.
A compromise between unsupervised and supervisedmethods is âsemisupervisedâ learning, which has recentlybeen explored as a means of making more biologically re-alistic networks (Zhuang, Yan, Nayebi, & Yamins, 2019;Zhuang, Zhai, & Yamins, 2019).
Reinforcement learning is the third major class of trainingin machine learning. In these systems, an artificial agentmust learn to produce action outcomes in response to infor-mation from the environment, including rewards. Severalsuch artificial systems have used convolutional architectureson the front end to process visual information about theworld (Figure 3; Merel et al., 2018; Paine et al., 2018; Zhuet al., 2018). It would be interesting to compare the repre-sentations learned in the context of these models to thosetrained by other mechanisms, as well as to data.
A simple way to understand the importance of trainingfor a network is to compare to a network with the same ar-chitecture but random weights. In Kim, Reif, et al. (2019),the ability of a network to perform perceptual closing ex-isted only in networks that had been trained with naturalimages, not random networks.
The above methods rely on training data, such as a set ofimages, that do not come fromneural recordings. However, itis possible to train these architectures to replicate neural ac-tivity directly. Doing so can help identify the features of theinput image most responsible for a neuronâs firing (Cadena,Denfield, et al., 2019; Kindel, Christensen, & Zylberberg,2019; Sinz et al., 2018). Ideally, the components of the modelcan be also related back to anatomical features of the circuitin question (GÃŒnthner et al., 2019; Maheswaranathan et al.,2018) as is done when networks are trained on classifica-tion tasks.
A final hybrid option is to train on a classification taskwhile also using neural data to constrain the intermediaterepresentations to be brain-like (Fong, Scheirer, & Cox,2018), resulting in a network that better matches neuraldata and can perform the task.
HOW TO UNDERSTAND CNNS
Varying a networkâs structure and training is one way toexplore how it functions. Additionally, trained networks canbe probed directly using techniques normally applied tobrains or those only available in models (Samek & MÃŒller,2019). In either case, if we believe CNNs have been validatedas amodel of the visual system, any insight gained fromprob-ing how they work may apply to biological vision as well.
Empirical Methods
The tools of the standard neuroscientific toolboxâlesions,recordings, anatomical tracings, stimulations, silencing, and
Lindsay 7
so forthâare all readily available in artificial neural networks(Barrett, Morcos, & Macke, 2019) and can be used to answerquestions about the workings of these networks.
âAblatingâ individual units in a CNN, for example, canhave an impact on classification accuracy; however, theimpact of ablating a particular unit does not have a strongrelationship with the unitâs selectivity properties (Morcos,Barrett, Rabinowitz, & Botvinick, 2018; Zhou, Sun, Bau, &Torralba, 2018).
In the study of Bonner and Epstein (2018), imageswere manipulated or occluded to determine which fea-tures were responsible for the CNNâs response to scenes,akin to how the function of real neurons is explored.
Other studies (Lindsey et al., 2019; Spoerer et al., 2019)have analyzed the connectivity properties of trained net-works to see if they replicate features of visual corticalanatomy.
One conceptual framework to describe what the stagesof visual processing are doing is that of âuntangling.âHigh-level concepts that are intertangled in the pixel orretinal representation get pulled apart to form easily sep-arable clusters in later representations. This theory hasbeen developed using biological data (DiCarlo & Cox,2007); however, recently, techniques for describing the ge-ometry of these clusters (or manifolds) have been developedand used to understand the untangling process in deep neu-ral networks (Cohen, Chung, Lee, & Sompolinsky, 2019;Chung, Lee, & Sompolinsky, 2018). This work highlightsthe relevant features of these manifolds for classificationand how they change through learning and processingstages. This can help identify which response features tolook for in data.
Mathematical Analyses
Given our full access to the equations that define a CNN,many mathematical techniques not currently applicable tobrains can be performed. One such common tool is thecalculation of gradients. Gradients indicate how certaincomponents in the network affect others, potentially faraway. They are used to train these networks by determin-ing how a weight at one layer can be changed to decreaseerror at the output. However, they can also be used to vi-sualize the preferred features of a unit in a network. In thatcase, the gradient calculation goes all the way through tothe input image, where it can be determined how individ-ual pixels should change to increase the activity of a par-ticular unit (Olah et al., 2017; Figure 4). Multiple variantsof feature visualization have also been used to probe neu-ral network functions (Nguyen, Yosinski, & Clune, 2019),for example, by showing which invariances exist at differ-ent layers (Cadena, Weis, Gatys, Bethge, & Ecker, 2018).
Gradients can also be used to determine the role anartificial neuron plays in a classification task. In the studyof Lindsay and Miller (2018), it was found that the role aunit plays in classifying an image as being of a certaincategory is not tightly correlated with how strongly it
responds to images of that category. Much like the abla-tion study above, this demonstrates a disassociationbetween tuning and function that should give neurosci-entists pause about how informative a tuning-basedanalysis of the brain is.Machine learning researchers and mathematicians also
aim to cast CNNs in a light that makes them more ame-nable to traditional mathematical analysis. The conceptsof information theory (Tishby & Zaslavsky, 2015) andwavelet scattering, for example, have been used towardthis end (Mallat, 2016). Another fruitful approach tomathematical analysis has been to study deep linear neu-ral networks as this approximation makes more analysespossible (Saxe, McClelland, & Ganguli, 2013).Several studies in computational neuroscience have
trained simple recurrent neural networks to performtasks and then interrogated the properties of the trainednetworks for clues as to how they work (Barak, 2017;Foerster, Gilmer, Sohl-Dickstein, Chorowski, & Sussillo,2017; Sussillo & Barak, 2013). Going forward, more so-phisticated techniques for understanding the learned rep-resentations and connectivity patterns of CNNs should bedeveloped. This can both provide insight into how thesenetworks work as well as indicate which experimentaltechniques and data analysis methods would be fruitfulto pursue.
Are They Understandable?
For some researchers, CNNs represent the unavoidabletrade-off between complexity and interpretability. Tohave a model complex enough to perform real-worldtasks, we must sacrifice the desire to make simple state-ments about how each stage of it worksâa goal inherentin much of systems neuroscience. For this reason, an alter-native way to describe the network that is compact with-out relying on simple statements about computations hasbeen proposed (Lillicrap & Kording, 2019; Richards et al.,2019); this viewpoint focuses on describing the architec-ture, optimization function, and learning algorithm ofthe networkâinstead of attempting to describe specificcomputationsâbecause specific computations and repre-sentations can be seen as simply emerging from thesethree factors.It is true that the historical aim of language-based de-
scriptions of the roles of individual neurons or groups ofneurons (e.g., âtuned to orientationâ or âface detectorsâ)seems woefully incomplete as a way to capture the essen-tial computations of CNNs. Yet, it also seems that thereare still more compact ways to describe the functioningof these networks and that finding these simpler descrip-tions could provide a further sense of understanding.Only certain sets of weights, for example, allow a networkto perform well on real-world classification tasks. As ofyet, the main thing we know about these âgoodâ weightsis that they can be found through optimization proce-dures. But are there essential properties we could
8 Journal of Cognitive Neuroscience Volume X, Number Y
identify that would result in a more condensed descrip-tion of the network? The âlottery ticketâmethod of trainingcan produce networks that work as well as dense networksusing only a fraction of the weights; a recent study notedthat as many as 95% of model weights could be guessedfrom the remaining 5% (Frankle & Carbin, 2018; Denil,Shakibi, Dinh, Ranzato, & de Freitas, 2013). Such findingssuggest that more condensed (and thus potentially moreunderstandable) descriptions of the computations ofhigh-performing networks are possible.A similar analysis of architectures could be done to
determine which broad features of connectivity are suf-ficient to create good performance. Recent work usingrandom weights, for example, helps to isolate the role ofarchitecture versus specific weight values (Gaier & Ha,2019). A more compact description of the essential fea-tures of a high-performing network is a goal for both ma-chine learning and neuroscience.One undeniably noncompact component of deep
learning is the data set. As has been mentioned, large,real-world data sets are required to match the perfor-mance and neural representations of biological vision.Are we doomed to carry around the full ImageNet dataset as part of our description of how vision works? Orcan a set of sufficient statistics be defined? Natural scenestatistics have been a historically large part of both com-putational neuroscience and computer vision (Simoncelli& Olshausen, 2001). Although much of that work has fo-cused on lower order correlations that are insufficient tocapture the relevant features for object recognition, thetime seems ripe for explorations of higher order imagestatistics, particularly as advances in generative modeling(Salakhutdinov, 2015) point to the ability to condense fulland complex features of natural images into a model.In any case, nearly all the critiques against the in-
terpretability of CNNs could equally apply to biologicalnetworks. Therefore, CNNs make a good testing groundfor deciding what understanding looks like in neuralsystems.
BEYOND THE BASICS
Since 2014, an explosion of research has answered manyquestions about how varying different features of a CNNcan change its properties and its ability to match data.These findings, in turn, have aided progress in under-standing both the âhowâ and âwhyâ of biological vision.Beyond this, having access to an âimage computableâmodel of the visual system opens the door to many moremodeling opportunities that can explore more than justcore vision.
Exploring Cognitive Tasks
The relationship between image encoding and memora-bility was explored in Jaegle et al. (2019). The authorsshowed that the overall magnitude of the response of
later layers of a CNN correlated with how memorableimages were found to be experimentally.
In Devereux, Clarke, and Tyler (2018), an attractor net-work based on semantic features was added to the end ofa CNN architecture. This additional processing stage wasable to account for perirhinal cortical activity during a se-mantic task.
Visual attention is known to enhance performance onchallenging visual tasks. Applying the neuromodulatoryeffects of attention to the units in a CNN was shown toincrease performance in these networks as well, more sowhen applied at later rather than earlier layers (Lindsay &Miller, 2018). This use of task-performing models has alsoled to better theories of how attention can work than thosestemming solely from neural data (Thorat et al., 2019).
Finally, CNNs were also able to recapitulate several be-havioral and neural effects of fine grain perceptual learn-ing (Wenliang & Seitz, 2018).
Adding Biological Details
Some of the architectural variants described aboveâparticularly the addition of local and feedback recurrenceâare biologically inspired and assumed to enhance the abilityof the network to perform challenging tasks. Other brain-inspired details have been explored by the machine learningcommunity in the hopes that these additions would be usefulfor difficult tasks. This includes foveation and saccading(Mnih, Heess, & Graves, 2014) for image classification.
Another reason to add biological detail would be out ofa belief that it may make the network âworse.â The factthat CNNs lack many biological details does not makethem a poor model of the visual system necessarily; itsimply makes them an abstract one. However, it shouldbe an ultimate aim to bring abstract and detailed modelstogether to show how the high-level computations of aCNN are implemented using the machinery available tothe brain. To this end, work has been done on spikingCNNs (Cao, Chen, & Khosla, 2015) and CNNs with stochas-tic noise added (McIntosh et al., 2016). These attempts canalso identify aspects of these biological details that are use-ful for computation.
The long history of modeling the circuitry of visualcortex can provide more ideas about what details to in-corporate and how. A preexisting circuit model of V1anatomy and function (Rubin, Van Hooser, & Miller,2015), for example, was placed into the architecture of aCNN and used to replicate effects of visual attention(Lindsay, Rubin, & Miller, 2019). A more extreme ap-proach to adding biological detail can be found in thestudy of Tschopp, Reiser, and Turaga (2018), where theconnectome of the fly visual system defined the architec-ture of the model, which was then trained to performvisual tasks. Beyond this, having access to an âimage com-putableâ model of the visual system opens the door tomany more modeling opportunities that can explore morethan just core vision (Figure 5).
Lindsay 9
LIMITATIONS AND FUTURE DIRECTIONS
As with any model, the current limitations and flaws ofCNNs should point the way to future research directionsthat will bring these models more in line with biology.
The basic structure of a CNN assumes weight sharing.That is, a feature map is the result of the exact same filterweights applied at each location in the layer below.Although selectivity to visual features like orientation canappear all over the retinotopic map, it is clear that this isnot the result of any sort of explicit weight sharing. Eithergenetic programming ensures the same features are de-tected throughout space, or this property is learned throughexposure. Studies on âtranslation toleranceâ have shown thelatter may be true (Dill & Fahle, 1998). Weight sharingmakes CNNs easier to train; however, ideally the same re-sults could be found using a more biologically plausibleway of fitting filters.
Furthermore, in most CNNs, Daleâs law is not respected.That is, the same neuron can be weighted by both inhib-itory (negative) and excitatory (positive) weights. In thevisual system, connections between areas tend to comeonly from excitatory cells. To be consistent with biology,a negative feedforward weight could be interpreted asan excitatory feedforward connection that acts on local in-hibitory neurons. But this relationship between excitatoryfeedforward connections and the need for local inhibitoryrecurrence points to a complication of adding biologicaldetails to these networks: Some of these biological detailsmay only function well or make sense in light of othersand thus need to be added together.
To some, the way in which these networks are trainedalso poses a problem. The backpropagation algorithm isnot considered biologically plausible enough to be anapproximation of how the visual system actually learns.However, most methods for model fitting in computational
neuroscience do not intend to mimic biological learning,and backpropagation could be thought of as just anotherparameter fitting technique. That being said, several re-searchers are investigating means by which the brain couldperform something like backpropagation (Whittington &Bogacz, 2019; Bartunov et al., 2018; Sacramento, Costa,Bengio, & Senn, 2018; Roelfsema, van Ooyen, &Watanabe, 2010). Comparing models trained using morebiologically plausible techniques to standard supervisedlearning (as well as to the unsupervised and reinforcementlearning approaches discussed above) could offer insightsas to the role of learning in determining representations.The vast majority of studies comparing CNNs to biolog-
ical vision have used data from humans or nonhumanprimates. Where attempts have been made to compareCNNs to one of the most commonly used animal groupsin neuroscience researchârodentsâresults are not nearlyas strong as they are for primates (Cadena, Sinz, et al., 2019;de Vries et al., 2019; Matteucci, Marotti, Riggi, Rosselli, &Zoccolan, 2019). Understanding what can turn CNNs intoa good model of rodent vision would go a long way inunderstanding the difference between primate and rodentvision and would open rodent vision up to the explorationtactics described here. CNNs have also been comparedwith the behavioral patterns of pigeons on a classifica-tion task (Love, Guest, Slomka, Navarro, & Wasserman,2017).Even in the context of primate vision, simple object or
scene classification tasks only represent a small fractionof what visual systems are capable of and used for natu-rally. More ethologically relevant and embodied taskssuch as navigation, object manipulation and visual rea-soning, may be needed to capture the full diversity of vi-sual processing and its relation to other brain areas. Earlyversions of this idea are already being explored (Cichy,Kriegeskorte, Jozwik, van den Bosch, & Charest, 2019;Dwivedi & Roig, 2018). The study of insect vision has his-torically taken this more holistic approach and may makefor useful inspiration (Turner, Giraldo, Schwartz, & Rieke,2019).
Conclusions
The story of CNNs started with a study on the tuningproperties of individual neurons in primary visual cortex.Yet, one of the impacts of using CNNs to study the visualsystem has been to push the field away from focusing oninterpretable responses of single neurons and towardpopulation-level descriptions of how visual informationis represented and transformed to perform visual tasks.The shift toward models that actually âdoâ somethinghas forced a reshaping of the questions around the studyof the visual system. Neuroscientists are adapting to thisnew style of explanation and the different expectationsthat come with it (Hasson, Nastase, & Goldstein, 2019).Importantly, these models have also made it possible
to reach some of the preexisting goals in the study of
Figure 5. A sampling of the many uses of visual information by thebrain. Much of the use of CNNs as a model of visual processing hasfocused on the ventral visual stream. However, the visual systeminteracts with many other parts of the brain to achieve many differentgoals. Future models should work to integrate CNNs into this largerpicture.
10 Journal of Cognitive Neuroscience Volume X, Number Y
vision. In 2007 (Pinto, Cox, & DiCarlo, 2008), for exam-ple, a perspective piece on the study of object recogni-tion claimed that âProgress in understanding the brainâssolution to object recognition requires the construction ofartificial recognition systems that ultimately aim to emulateour own visual abilities, often with biological inspirationâand that âinstantiation of a working recognition systemrepresents a particularly effective measure of success inunderstanding object recognition.â In this way, CNNs as amodel of the visual system are a success.Of course, nothing can be learned about biological vi-
sion through CNNs in isolation, but rather only throughiteration. The insights gained from experimenting withCNNs should shape future experiments in the lab, which,in turn, should inform the next generation of models.
Acknowledgments
We thank SciDraw.io for providing brain and neuron drawings.Feature visualizations in Figure 4 taken from Olah et al. (2017)are licensed under Creative Commons Attribution CC-BY 4.0.This work was supported by a Marie SkÅodowska-CurieIndividual Fellowship and a Sainsbury Wellcome Centre/Gatsby Computational Unit Research Fellowship.
Reprint requests should be sent to Grace W. Lindsay, GatsbyComputational Unit/Sainsbury Wellcome Centre, UniversityCollege London, 25 Howland St., London W1T 4JG, UnitedKingdom, or via e-mail: [email protected].
REFERENCES
Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deepconvolutional networks do not classify based on global objectshape. PLoS Computational Biology, 14, e1006613.https://doi.org/10.1371/journal.pcbi.1006613.PMID: 30532273. PMCID: PMC6306249.
Barak, O. (2017). Recurrent neural networks as versatile tools ofneuroscience research. Current Opinion in Neurobiology,46, 1â6. https://doi.org/10.1016/j.conb.2017.06.003.PMID: 28668365.
Barrett, D. G. T., Morcos, A. S., & Macke, J. H. (2019). Analyzingbiological and artificial neural networks: Challenges withopportunities for synergy? Current Opinion in Neurobiology,55, 55â64. https://doi.org/10.1016/j.conb.2019.01.007.PMID: 30785004.
Bartunov, S., Santoro, A., Richards, B. A., Marris, L., Hinton, G.E., & Lillicrap, T. P. (2018). Assessing the scalability ofbiologically-motivated deep learning algorithms andarchitectures. In Advances in neural information processingsystems (pp. 9390â9400). Montreal, Canada.
Bashivan, P., Kar, K., & DiCarlo, J. J. (2019). Neural populationcontrol via deep image synthesis. Science, 364, eaav9436.https://doi.org/10.1126/science.aav9436. PMID: 31048462.
Bonner, M. F., & Epstein, R. A. (2018). Computational mechanismsunderlying cortical responses to the affordance properties ofvisual scenes. PLoS Computational Biology, 14, e1006111.https://doi.org/10.1371/journal.pcbi.1006111.PMID: 29684011. PMCID: PMC5933806
Bracci, S., Ritchie, J. B., Kalfas, I., & Op de Beeck, H. P. (2019).The ventral visual pathway represents animal appearanceover animacy, unlike human behavior and deep neuralnetworks. Journal of Neuroscience, 39, 6513â6525.
https://doi.org/10.1523/JNEUROSCI.1714-18.2019.PMID: 31196934. PMCID: PMC6697402.
Cadena, S. A., Denfield, G. H., Walker, E. Y., Gatys, L. A., Tolias,A. S., Bethge, M., et al. (2019). Deep convolutional modelsimprove predictions of macaque V1 responses to naturalimages. PLoS Computational Biology, 15, e1006897.https://doi.org/10.1371/journal.pcbi.1006897.PMID: 31013278. PMCID: PMC6499433.
Cadena, S. A., Sinz, F. H., Muhammad, T., Froudarakis, E.,Cobos, E., Walker, E. Y., et al. (2019). How well do deepneural networks trained on object recognition characterizethe mouse visual system? In NeurIPS Workshop on RealNeurons and Hidden Units. Vancouver, Canada.
Cadena, S. A., Weis, M. A., Gatys, L. A., Bethge, M., & Ecker, A. S.(2018). Diverse feature visualizations reveal invariances in earlylayers of deep neural networks. In Proceedings of the EuropeanConference onComputer Vision (ECCV) (pp. 217â232).Munich,Germany. https://doi.org/10.1007/978-3-030-01258-8_14.
Cadieu, C., Kouh, M., Pasupathy, A., Connor, C. E., Riesenhuber,M., & Poggio, T. (2007). A model of V4 shape selectivity andinvariance. Journal of Neurophysiology, 98, 1733â1750.https://doi.org/10.1152/jn.01265.2006. PMID: 17596412.
Cao, Y., Chen, Y., & Khosla, D. (2015). Spiking deep convolutionalneural networks for energy-efficient object recognition.International Journal of Computer Vision, 113, 54â66.https://doi.org/10.1007/s11263-014-0788-3.
Chang, N., Pyles, J. A., Marcus, A., Gupta, A., Tarr, M. J., &Aminoff, E. M. (2019). BOLD5000, a public fMRI dataset whileviewing 5000 visual images. Scientific Data, 6, 49.https://doi.org/10.1038/s41597-019-0052-3.PMID: 31061383. PMCID: PMC6502931
Chung, S., Lee, D. D., & Sompolinsky, H. (2018). Classification andgeometry of general perceptual manifolds. Physical Review X, 8,031003. https://doi.org/10.1103/PhysRevX.8.031003.
Cichy, R. M., Khosla, A., Pantazis, D., & Oliva, A. (2017). Dynamicsof scene representations in the human brain revealedby magnetoencephalography and deep neural networks.Neuroimage, 153, 346â358. https://doi.org/10.1016/j.neuroimage.2016.03.063. PMID: 27039703. PMCID: PMC5542416.
Cichy, R. M., Kriegeskorte, N., Jozwik, K. M., van den Bosch, J. J. F.,& Charest, I. (2019). The spatiotemporal neural dynamicsunderlying perceived similarity for real-world objects.Neuroimage, 194, 12â24. https://doi.org/10.1016/j.neuroimage.2019.03.031. PMID: 30894333. PMCID: PMC6547050.
Cohen, U., Chung, S., Lee, D. D., & Sompolinsky, H. (2019).Separability and geometry of object manifolds in deep neuralnetworks. bioRxiv, 644658. https://doi.org/10.1101/644658.
Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & de Freitas, N.(2013). Predicting parameters in deep learning. In Advancesin neural information processing systems (pp. 2148â2156).Lake Tahoe, NV.
Devereux, B. J., Clarke, A., & Tyler, L. K. (2018). Integrated deepvisual and semantic attractor neural networks predict fMRIpattern-information along the ventral object processingpathway. Scientific Reports, 8, 10636. https://doi.org/10.1038/s41598-018-28865-1. PMID: 30006530. PMCID: PMC6045572.
de Vries, S. E. J., Lecoq, J. A., Buice, M. A., Groblewski, P. A.,Ocker, G. K., Oliver, M., et al. (2019). A large-scale, standardizedphysiological survey reveals functional organization of themouse visual cortex. Nature Neuroscience, 23, 138â151.https://doi.org/10.1038/s41593-019-0550-9.PMID: 31844315. PMCID: PMC6948932.
DiCarlo, J. J., & Cox, D. D. (2007). Untangling invariant objectrecognition. Trends in Cognitive Sciences, 11, 333â341.https://doi.org/10.1016/j.tics.2007.06.010. PMID: 17631409.
Dill, M., & Fahle, M. (1998). Limited translation invariance of humanvisual pattern recognition.Perception&Psychophysics, 60, 65â81.https://doi.org/10.3758/BF03211918. PMID: 9503912.
Lindsay 11
Dobs, K., Kell, A. J. E., Palmer, I., Cohen, M., & Kanwisher, N.(2019). Why are face and object processing segregated in thehuman brain? Testing computational hypotheses with deepconvolutional neural networks. Oral presentation at CognitiveComputational Neuroscience Conference, Berlin, Germany.https://doi.org/10.32470/CCN.2019.1405-0.
Dwivedi, K., & Roig, G. (2018). Task-specific vision modelsexplain task-specific areas of visual cortex. bioRxiv, 402735.https://doi.org/10.1101/402735.
Eickenberg, M., Gramfort, A., Varoquaux, G., & Thirion, B. (2017).Seeing it all: Convolutional network layers map the function ofthe human visual system. Neuroimage, 152, 184â194. https://doi.org/10.1016/j.neuroimage.2016.10.001. PMID: 27777172.
Fleming, R. W., & Storrs, K. R. (2019). Learning to see stuff.Current Opinion in Behavioral Sciences, 30, 100â108.https://doi.org/10.1016/j.cobeha.2019.07.004.PMID: 31886321. PMCID: PMC6919301.
Foerster, J. N., Gilmer, J., Sohl-Dickstein, J., Chorowski, J., &Sussillo, D. (2017). Input switched affine networks: An RNNarchitecture designed for interpretability. In Proceedings ofthe 34th International Conference on Machine Learning(Vol. 70, pp. 1136â1145). Sydney, Australia.
Fong, R.C., Scheirer,W. J., &Cox,D.D. (2018). Using humanbrainactivity to guide machine learning. Scientific Reports, 8, 5397.https://doi.org/10.1038/s41598-018-23618-6.PMID: 29599461. PMCID: PMC5876362.
Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis:Finding sparse, trainable neural networks. arXiv preprintarXiv:1803.03635.
Fukushima, K. (1980). Neocognitron: A self-organizing neuralnetwork model for a mechanism of pattern recognitionunaffected by shift in position. Biological Cybernetics, 36,193â202. https://doi.org/10.1007/BF00344251. PMID: 7370364.
Gaier, A., & Ha, D. (2019). Weight agnostic neural networks.arXiv preprint arXiv:1906.04358.
Geirhos, R., Janssen, D. H., SchÃŒtt, H. H., Rauber, J., Bethge, M.,& Wichmann, F. A. (2017). Comparing deep neural networksagainst humans: Object recognition when the signal getsweaker. arXiv preprint arXiv:1706.06969.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann,F. A., & Brendel, W. (2018). ImageNet-trained CNNs arebiased towards texture; increasing shape bias improvesaccuracy and robustness. arXiv preprint arXiv:1811.12231.
Geirhos, R., Temme, C. R., Rauber, J., SchÃŒtt, H. H., Bethge, M.,& Wichmann, F. A. (2018). Generalisation in humans anddeep neural networks. In Advances in neural informationprocessing systems (pp. 7538â7550).
Ghodrati, M., Farzmahdi, A., Rajaei, K., Ebrahimpour, R., &Khaligh-Razavi, S.-M. (2014). Feedforward object-visionmodels only tolerate small image variations compared tohuman. Frontiers in Computational Neuroscience, 8, 74.https://doi.org/10.3389/fncom.2014.00074. PMID: 25100986.PMCID: PMC4103258.
Golan, T., Raju, P. C., & Kriegeskorte, N. (2019). Controversialstimuli: Pitting neural networks against each other as modelsof human recognition. arXiv preprint arXiv:1911.09288.
GÌçlÃŒ, U., & van Gerven, M. A. J. (2015). Deep neural networksreveal a gradient in the complexity of neural representationsacross the ventral stream. Journal of Neuroscience, 35,10005â10014. https://doi.org/10.1523/JNEUROSCI.5023-14.2015.PMID: 26157000. PMCID: PMC6605414.
GÃŒnthner, M. F., Cadena, S. A., Denfield, G. H., Walker, E. Y.,Tolias, A. S., Bethge, M., et al. (2019). Learning divisivenormalization in primary visual cortex. bioRxiv, 767285.https://doi.org/10.32470/CCN.2019.1211-0.
Hasani, H., Soleymani, M., & Aghajan, H. (2019). Surroundmodulation: A bio-inspired connectivity structure forconvolutional neural networks. In Advances in neural
information processing systems (pp. 15877â15888).Vancouver, Canada.
Hasson, U., Nastase, S. A., & Goldstein, A. (2019). Robust-fit tonature: An evolutionary perspective on biological (and artificial)neural networks.bioRxiv, 764258. https://doi.org/10.1101/764258.
Hong, H., Yamins, D. L. K., Majaj, N. J., & DiCarlo, J. J. (2016).Explicit information for category-orthogonal object propertiesincreases along the ventral stream. Nature Neuroscience, 19,613â622. https://doi.org/10.1038/nn.4247. PMID: 26900926.
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocularinteraction and functional architecture in the catâs visualcortex. Journal of Physiology, 160, 106â154.https://doi.org/10.1113/jphysiol.1962.sp006837.PMID: 14449617. PMCID: PMC1359523.
Jacob, G., Pramod, R. T., Katti, H., & Arun, S. P. (2019). Do deepneural networks see the way we do? bioRxiv, 860759.https://doi.org/10.1101/860759.
Jaegle, A., Mehrpour, V., Mohsenzadeh, Y., Meyer, T., Oliva, A.,& Rust, N. (2019). Population response magnitude variationin inferotemporal cortex predicts image memorability. eLife, 8,e47596. https://doi.org/10.7554/eLife.47596.015. https://doi.org/10.7554/eLife.47596. PMID: 31464687. PMCID: PMC6715346.
Jarvers, C., & Neumann, H. (2019). Incorporating feedbackin convolutional neural networks. In Proceedings of theCognitive Computational Neuroscience Conference(pp. 395â398). Berlin, Germany.https://doi.org/10.32470/CCN.2019.1191-0.
Jozwik, K. M., Kriegeskorte, N., Storrs, K. R., & Mur, M. (2017).Deep convolutional neural networks outperform feature-based but not categorical models in explaining objectsimilarity judgments. Frontiers in Psychology, 8, 1726.https://doi.org/10.3389/fpsyg.2017.01726.PMID: 29062291. PMCID: PMC5640771.
Kar, K., Kubilius, J., Schmidt, K., Issa, E. B., & DiCarlo, J. J.(2019). Evidence that recurrent circuits are critical to theventral streamâs execution of core object recognition behavior.Nature Neuroscience, 22, 974â983.https://doi.org/10.1038/s41593-019-0392-5. PMID: 31036945.
Kay, K. N. (2018). Principles for models of neural informationprocessing. Neuroimage, 180, 101â109. https://doi.org/10.1016/j.neuroimage.2017.08.016. PMID: 28793238.
Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere,S. V., & McDermott, J. H. (2018). A task-optimized neuralnetwork replicates human auditory behavior, predicts brainresponses, and reveals a cortical processing hierarchy. Neuron,98, 630â644. https://doi.org/10.1016/j.neuron.2018.03.044.PMID: 29681533.
Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014). Deep supervised,but not unsupervised, models may explain IT corticalrepresentation. PLoS Computational Biology, 10, e1003915.https://doi.org/10.1371/journal.pcbi.1003915. PMID: 25375136.PMCID: PMC4222664.
Kheradpisheh, S. R., Ghodrati, M., Ganjtabesh, M., & Masquelier,T. (2016). Deep networks can resemble human feed-forwardvision in invariant object recognition. Scientific Reports, 6, 32672.https://doi.org/10.1038/srep32672. PMID: 27601096.PMCID: PMC5013454.
Kietzmann, T. C., McClure, P., & Kriegeskorte, N. (2019). Deepneural networks in computational neuroscience. In Oxfordresearch encyclopedia of neuroscience. Oxford: OxfordUniversity Press. https://doi.org/10.1093/acrefore/9780190264086.013.46.
Kim, J., Linsley, D., Thakkar, K., & Serre, T. (2019). Disentanglingneural mechanisms for perceptual grouping. arXiv preprintarXiv:1906.01558. https://doi.org/10.32470/CCN.2019.1130-0.
Kim, B., Reif, E., Wattenberg, M., & Bengio, S. (2019). Do neuralnetworks show Gestalt phenomena? An exploration of thelaw of closure. arXiv preprint arXiv:1903.01069.
12 Journal of Cognitive Neuroscience Volume X, Number Y
Kindel, W. F., Christensen, E. D., & Zylberberg, J. (2019). Usingdeep learning to probe the neural code for images in primaryvisual cortex. Journal of Vision, 19, 29. https://doi.org/10.1167/19.4.29. PMID: 31026016. PMCID: PMC6485988.
King, M. L., Groen, I. I. A., Steel, A., Kravitz, D. J., & Baker, C. I.(2019). Similarity judgments and cortical visual responsesreflect different properties of object and scene categories innaturalistic images. Neuroimage, 197, 368â382. https://doi.org/10.1016/j.neuroimage.2019.04.079. PMID: 31054350.PMCID: PMC6591094.
Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019).Similarity of neural network representations revisited. InProceedings of the 36th International Conference onMachineLearning (pp. 3519â3529). Long Beach, CA.
Kriegeskorte, N. (2015). Deep neural networks: A new frameworkfor modeling biological vision and brain information processing.Annual Review of Vision Science, 1, 417â446. https://doi.org/10.1146/annurev-vision-082114-035447. PMID: 28532370.
Kriegeskorte, N., & Golan, T. (2019). Neural network modelsand deep learning. Current Biology, 29, R231âR236.https://doi.org/10.1016/j.cub.2019.02.034. PMID: 30939301.
Kriegeskorte, N., Mur, M., & Bandettini, P. (2008). Representationalsimilarity analysisâConnecting the branches of systemsneuroscience. Frontiers in Systems Neuroscience, 2, 4.https://doi.org/10.3389/neuro.06.004.2008. PMID: 19104670.PMCID: PMC2605405.
Kubilius, J., Bracci, S., & Op de Beeck, H. P. (2016). Deep neuralnetworks as a computational model for human shape sensitivity.PLoS Computational Biology, 12, e1004896.https://doi.org/10.1371/journal.pcbi.1004896.PMID: 27124699. PMCID: PMC4849740.
Kubilius, J., Schrimpf, M., Kar, K., Hong, H., Majaj, N. J.,Rajalingham, R., et al. (2019). Brain-like object recognitionwith high-performing shallow recurrent ANNs. arXiv preprintarXiv:1909.06161.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J.(2017). Building machines that learn and think like people.Behavioral and Brain Sciences, 40, e253. https://doi.org/10.1017/S0140525X16001837. PMID: 27881212.
Lake, B. M., Zaremba, W., Fergus, R., & Gureckis, T. M. (2015).Deep neural networks predict category typicality ratings forimages. In R. Dale, C. Jennings, P. Maglio, T. Matlock, D.Noelle, A. Warlaumont, & J. Yoshimi (Eds.), Proceedings ofthe 37th Annual Conference of the Cognitive Science Society.Austin, TX: Cognitive Science Society.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,Hubbard, W., et al. (1989). Backpropagation applied tohandwritten zip code recognition. Neural Computation, 1,541â551. https://doi.org/10.1162/neco.1989.1.4.541.
Lillicrap, T. P., & Kording, K. P. (2019). What does it mean tounderstand a neural network? arXiv preprint arXiv:1907.06374.
Lindsay, G. W., & Miller, K. D. (2018). How biological attentionmechanisms improve task performance in a large-scale visualsystem model. eLife, 7, e38105. https://doi.org/10.7554/eLife.38105. https://doi.org/10.7554/eLife.38105.030.
Lindsay, G. W., Rubin, D. B., & Miller, K. D. (2019). A simplecircuit model of visual cortex explains neural and behavioralaspects of attention. bioRxiv, 875534. https://doi.org/10.1101/2019.12.13.875534.
Lindsey, J., Ocko, S. A., Ganguli, S., & Deny, S. (2019). A unifiedtheory of early visual representations from retina to cortexthrough anatomically constrained deep CNNs. bioRxiv, 511535.https://doi.org/10.1101/511535.
Lotter, W., Kreiman, G., & Cox, D. (2016). Deep predictivecoding networks for video prediction and unsupervisedlearning. arXiv preprint arXiv:1605.08104.
Love, B. C., Guest, O., Slomka, P., Navarro, V. M., & Wasserman,E. (2017). Deep networks as models of human and animal
categorization. In Proceedings of the 39th Annual Meeting ofthe Cognitive Science Society (pp. 1457â1458). London, UK.
Maheswaranathan, N., McIntosh, L. T., Kastner, D. B., Melander,J., Brezovec, L., Nayebi, A., et al. (2018). Deep learningmodels reveal internal structure and diverse computations inthe retina under natural scenes. bioRxiv, 340943.
Mallat, S. (2016). Understanding deep convolutional networks.Philosophical Transactions of the Royal Society of London,Series A: Mathematical, Physical and Engineering Sciences,374, 20150203. https://doi.org/10.1098/rsta.2015.0203.PMID: 26953183. PMCID: PMC4792410.
Matteucci, G., Marotti, R. B., Riggi, M., Rosselli, F. B., &Zoccolan, D. (2019). Nonlinear processing of shape informationin rat lateral extrastriate cortex. Journal of Neuroscience, 39,1649â1670. https://doi.org/10.1523/JNEUROSCI.1938-18.2018.PMID: 30617210. PMCID: PMC6391565.
McIntosh, L. T., Maheswaranathan, N., Nayebi, A., Ganguli, S., &Baccus, S. A. (2016). Deep learning models of the retinalresponse to natural scenes. In Advances in neural informationprocessing systems (pp. 1361â1369). Barcelona, Spain.
Merel, J., Ahuja, A., Pham, V., Tunyasuvunakool, S., Liu, S.,Tirumala, D., et al. (2018). Hierarchical visuomotor control ofhumanoids. arXiv preprint arXiv:1811.09656.
Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models ofvisual attention. In Advances in neural informationprocessing systems (pp. 2204â2212). Montreal, Camada.
Montobbio, N., Bonnasse-Gahot, L., Citti, G., & Sarti, A. (2019).KerCNNs: Biologically inspired lateral connections for classificationof corrupted images. arXiv preprint arXiv:1910.08336.
Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C., & Botvinick,M. (2018). On the importance of single directions forgeneralization. arXiv preprint arXiv:1803.06959.
Murty, N. A. R., & Arun, S. P. (2017). A balanced comparison ofobject invariances in monkey IT neurons. eNeuro, 4, ENEURO.0333-16.2017. https://doi.org/10.1523/ENEURO.0333-16.2017.PMID: 28413827. PMCID: PMC5390242.
Murty, N. A. R., & Arun, S. P. (2018). Multiplicative mixing ofobject identity and image attributes in single inferiortemporal neurons. Proceedings of the National Academy ofSciences, U.S.A., 115, E3276âE3285. https://doi.org/10.1073/pnas.1714287115. PMID: 29559530. PMCID: PMC5889630.
Nguyen, A., Yosinski, J., & Clune, J. (2019). Understandingneural networks via feature visualization: A survey. arXivpreprint arXiv:1904.08939. https://doi.org/10.1007/978-3-030-28954-6_4.
Olah, C., Mordvintsev, A., & Schubert, L. (2017). Featurevisualization. Distill, 2, e7. https://doi.org/10.23915/distill.00007.
Paine, T. L., Colmenarejo, S. G., Wang, Z., Reed, S., Aytar, Y.,Pfaff, T., et al. (2018). One-shot high-fidelity imitation: Traininglarge-scale deep nets with RL. arXiv preprint arXiv:1810.05017.
Peterson, J. C., Abbott, J. T., & Griffiths, T. L. (2018). Evaluating(and improving) the correspondence between deep neuralnetworks and human representations. Cognitive Science, 42,2648â2669. https://doi.org/10.1111/cogs.12670. PMID: 30178468.
Pinto, N., Cox, D. D., & DiCarlo, J. J. (2008). Why is real-worldvisual object recognition hard? PLoS Computational Biology, 4,e27. https://doi.org/10.1371/journal.pcbi.0040027.PMID: 18225950. PMCID: PMC2211529.
Rajaei, K., Mohsenzadeh, Y., Ebrahimpour, R., & Khaligh-Razavi,S.-M. (2019). Beyond core object recognition: Recurrentprocesses account for object recognition under occlusion.PLoS Computational Biology, 15, e1007001. https://doi.org/10.1371/journal.pcbi.1007001. PMID: 31091234. PMCID:PMC6538196.
Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., &DiCarlo, J. J. (2018). Large-scale, high-resolution comparisonof the core visual object recognition behavior of humans,monkeys, and state-of-the-art deep artificial neural networks.
Lindsay 13
Journal of Neuroscience, 38, 7255â7269. https://doi.org/10.1523/JNEUROSCI.0388-18.2018. PMID: 30006365.PMCID: PMC6096043.
Rawat, W., & Wang, Z. (2017). Deep convolutional neuralnetworks for image classification: A comprehensive review.Neural Computation, 29, 2352â2449. https://doi.org/10.1162/neco_a_00990. PMID: 28599112.
Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz,R., Christensen, A., et al. (2019). A deep learning frameworkfor neuroscience. Nature Neuroscience, 22, 1761â1770.https://doi.org/10.1038/s41593-019-0520-2. PMID: 31659335.PMCID: PMC7115933.
Riesenhuber, M., & Poggio, T. (2000). Computational modelsof object recognition in cortex: A review (CBCL Paper 190/AIMemo 1695). Cambridge, MA: MIT Press. https://doi.org/10.21236/ADA458109.
Roelfsema, P. R., vanOoyen, A., &Watanabe, T. (2010). Perceptuallearning rules based on reinforcers and attention. Trends inCognitive Sciences, 14, 64â71. https://doi.org/10.1016/j.tics.2009.11.005. PMID: 20060771. PMCID: PMC2835467.
Rosenfeld, A., Solbach, M. D., & Tsotsos, J. K. (2018). Totallylooks likeâHow humans compare, compared to machines.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops (pp. 1961â1964). SaltLake City, UT. https://doi.org/10.1109/CVPRW.2018.00262.
Roy, P., Ghosh, S., Bhattacharya, S., & Pal, U. (2018). Effects ofdegradations on deep neural network architectures. arXivpreprint arXiv:1807.10108.
Rubin, D. B., Van Hooser, S. D., & Miller, K. D. (2015). Thestabilized supralinear network: A unifying circuit motifunderlying multi-input integration in sensory cortex. Neuron,85, 402â417. https://doi.org/10.1016/j.neuron.2014.12.026.PMID: 25611511. PMCID: PMC4344127.
Sacramento, J., Costa, R. P., Bengio, Y., & Senn, W. (2018).Dendritic cortical microcircuits approximate thebackpropagation algorithm. In Advances in neural informationprocessing systems (pp. 8735â8746). Montreal, Canada.
Salakhutdinov, R. (2015). Learning deep generative models.Annual Review of Statistics and Its Application, 2, 361â385.https://doi.org/10.1146/annurev-statistics-010814-020120.
Samek, W., & MÃŒller, K.-R. (2019). Towards explainable artificialintelligence. In W. Samek, G. Montavon, A. Vedaldi, L. K.Hansen, & K.-R. MÃŒller (Eds.), Explainable AI: Interpreting,explaining and visualizing deep learning (pp. 5â22). Cham,Switzerland: Springer. https://doi.org/10.1007/978-3-030-28954-6_1.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Exactsolutions to the nonlinear dynamics of learning in deep linearneural networks. arXiv preprint arXiv:1312.6120.
Scholte, H. S., Losch, M. M., Ramakrishnan, K., de Haan, E. H. F.,& Bohte, S. M. (2018). Visual pathways from the perspectiveof cost functions and multi-task deep neural networks.Cortex, 98, 249â261. https://doi.org/10.1016/j.cortex.2017.09.019. PMID: 29150140.
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R.,Issa, E. B., et al. (2018). Brain-Score: Which artificial neuralnetwork for object recognition is most brain-like? bioRxiv,407007. https://doi.org/10.1101/407007.
Seeliger, K., Fritsche, M., GÌçlÃŒ, U., Schoenmakers, S.,Schoffelen, J.-M., Bosch, S. E., et al. (2018). Convolutionalneural network-based encoding and decoding of visual objectrecognition in space and time. Neuroimage, 180, 253â266.https://doi.org/10.1016/j.neuroimage.2017.07.018.PMID: 28723578.
Serre, T. (2019). Deep learning: The good, the bad, and theugly. Annual Review of Vision Science, 5, 399â426.https://doi.org/10.1146/annurev-vision-091718-014951.PMID: 31394043.
Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., &Poggio, T. (2007). A quantitative theory of immediate visualrecognition. Progress in Brain Research, 165, 33â56.https://doi.org/10.1016/S0079-6123(06)65004-8.
Shen, G., Horikawa, T., Majima, K., & Kamitani, Y. (2019). Deepimage reconstruction from human brain activity. PLoSComputational Biology, 15, e1006633. https://doi.org/10.1371/journal.pcbi.1006633. PMID: 30640910. PMCID: PMC6347330.
Shi, J., Wen, H., Zhang, Y., Han, K., & Liu, Z. (2018). Deeprecurrent neural network reveals a hierarchy of processmemory during dynamic natural vision. Human BrainMapping, 39, 2269â2282. https://doi.org/10.1002/hbm.24006.PMID: 29436055. PMCID: PMC5895512.
Simoncelli, E. P., & Olshausen, B. A. (2001). Natural imagestatistics and neural representation. Annual Review ofNeuroscience, 24, 1193â1216. https://doi.org/10.1146/annurev.neuro.24.1.1193. PMID: 11520932.
Sinz, F., Ecker, A. S., Fahey, P., Walker, E., Cobos, E.,Froudarakis, E., et al. (2018). Stimulus domain transfer inrecurrent models for large scale cortical populationprediction on video. In Advances in neural informationprocessing systems (pp. 7199â7210). Montreal, Canada.https://doi.org/10.1101/452672.
Spoerer, C. J., Kietzmann, T. C., & Kriegeskorte, N. (2019).Recurrent networks can recycle neural resources to flexiblytrade speed for accuracy in visual recognition. bioRxiv,677237. https://doi.org/10.32470/CCN.2019.1068-0.
Spoerer, C. J., McClure, P., & Kriegeskorte, N. (2017). Recurrentconvolutional neural networks: A better model of biologicalobject recognition. Frontiers in Psychology, 8, 1551. https://doi.org/10.3389/fpsyg.2017.01551. https://doi.org/10.1101/133330.
Storrs, K. R., & Kriegeskorte, N. (2019). Deep learning forcognitive neuroscience. In M. Gazzaniga (Ed.), The cognitiveneurosciences (6th ed.). Cambridge, MA: MIT Press.
Sussillo, D., & Barak, O. (2013). Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neuralnetworks. Neural Computation, 25, 626â649. https://doi.org/10.1162/NECO_a_00409. PMID: 23272922.
Tacchetti, A., Isik, L., & Poggio, T. (2017). Invariant recognitiondrives neural representations of action sequences. PLoSComputational Biology, 13, e1005859. https://doi.org/10.1371/journal.pcbi.1005859. PMID: 29253864. PMCID: PMC5749869.
Tang, H., Schrimpf, M., Lotter, W., Moerman, C., Paredes, A.,Caro, J. O., et al. (2018). Recurrent computations for visualpattern completion. Proceedings of the National Academy ofSciences, U.S.A., 115, 8835â8840. https://doi.org/10.1073/pnas.1719397115. PMID: 30104363. PMCID: PMC6126774.
Thompson, J. A. F., Bengio, Y., Formisano, E., & Schönwiesner,M. (2018). How can deep learning advance computationalmodeling of sensory information processing? arXiv preprintarXiv:1810.08651.
Thorat, S., van Gerven, M., & Peelen, M. (2019). The functionalrole of cue-driven feature-based feedback in object recognition.arXiv preprint arXiv:1903.10446. https://doi.org/10.32470/CCN.2018.1044-0.
Tishby, N., & Zaslavsky, N. (2015). Deep learning and theinformation bottleneck principle. In 2015 IEEE InformationTheory Workshop (ITW) (pp. 1â5). Jerusalem, Israel.https://doi.org/10.1109/ITW.2015.7133169.
Tripp, B. P. (2017). Similarities and differences between stimulustuning in the inferotemporal visual cortex and convolutionalnetworks. In 2017 International Joint Conference on NeuralNetworks (IJCNN) (pp. 3551â3560). Anchorage, AK.https://doi.org/10.1109/IJCNN.2017.7966303.
Tschopp, F. D., Reiser, M. B., & Turaga, S. C. (2018). Aconnectome based hexagonal lattice convolutional networkmodel of the Drosophila visual system. arXiv preprintarXiv:1806.04793.
14 Journal of Cognitive Neuroscience Volume X, Number Y
Turner, M. H., Giraldo, L. G. S., Schwartz, O., & Rieke, F. (2019).Stimulus- and goal-oriented frameworks for understandingnatural vision. Nature Neuroscience, 22, 15â24.https://doi.org/10.1038/s41593-018-0284-0. PMID: 30531846.
Wang, Q., Zhang, J., Song, S., & Zhang, Z. (2014). Attentionalneural network: Feature selection using cognitive feedback.In Advances in neural information processing systems(pp. 2033â2041). Montreal, Canada.
Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M., & Tanaka, K.(2018). Illusory motion reproduced by deep neural networkstrained for prediction. Frontiers in Psychology, 9, 345.https://doi.org/10.3389/fpsyg.2018.00345.PMID: 29599739. PMCID: PMC5863044.
Wen,H., Shi, J., Chen,W., & Liu, Z. (2018). Deep residual networkpredicts cortical representation and organization of visualfeatures for rapid categorization. Scientific Reports, 8, 3752.https://doi.org/10.1038/s41598-018-22160-9.PMID: 29491405. PMCID: PMC5830584.
Wenliang, L. K., & Seitz, A. R. (2018). Deep neural networks formodeling visual perceptual learning. Journal of Neuroscience,38, 6028â6044. https://doi.org/10.1523/JNEUROSCI.1620-17.2018.PMID: 29793979. PMCID: PMC6031581.
Whittington, J. C. R., & Bogacz, R. (2019). Theories of errorback-propagation in the brain. Trends in Cognitive Sciences,23, 235â250. https://doi.org/10.1016/j.tics.2018.12.005. PMID:30704969. PMCID: PMC6382460.
Wichmann, F. A., Janssen, D. H. J., Geirhos, R., Aguilar, G., SchÃŒtt,H. H., Maertens, M., et al. (2017). Methods and measurementsto compare men against machines. Electronic Imaging,Human Vision and Electronic Imaging, 2017, 36â45.https://doi.org/10.2352/ISSN.2470-1173.2017.14.HVEI-113.
Wyatte, D., Jilk, D. J., & OâReilly, R. C. (2014). Early recurrentfeedback facilitates visual object recognition under challengingconditions. Frontiers in Psychology, 5, 674.https://doi.org/10.3389/fpsyg.2014.00674.PMID: 25071647. PMCID: PMC4077013.
Yamins, D. L. K., & DiCarlo, J. J. (2016). Using goal-driven deeplearning models to understand sensory cortex. NatureNeuroscience, 19, 356â365. https://doi.org/10.1038/nn.4244.PMID: 26906502.
Yamins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. A.,Seibert, D., & DiCarlo, J. J. (2014). Performance-optimizedhierarchical models predict neural responses in higher visualcortex. Proceedings of the National Academy of Sciences,U.S.A., 111, 8619â8624. https://doi.org/10.1073/pnas.1403112111. PMID: 24812127. PMCID: PMC4060707.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understandingconvolutional networks. In D. Fleet, T. Pajdla, B. Schiele,& T. Tuytelaars (Eds.), 13th European Conference onComputer Vision, ECCV 2014 (pp. 818â833).Cham, Switzerland: Springer.
Zeman, A. A., Ritchie, J. B., Bracci, S., & Op de Beeck, H. P.(2019). Orthogonal representations of object shape andcategory in deep convolutional neural networks and humanvisual cortex. bioRxiv, 555193. https://doi.org/10.1101/555193.
Zhou, B., Sun, Y., Bau, D., & Torralba, A. (2018). Revisiting theimportance of individual units in CNNs via ablation. arXivpreprint arXiv:1806.02891.
Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., et al.(2018). Reinforcement and imitation learning for diversevisuomotor skills. arXiv preprint arXiv:1802.09564.https://doi.org/10.15607/RSS.2018.XIV.009.
Zhuang, C., Yan, S., Nayebi, A., & Yamins, D. L. K. (2019). Self-supervised neural network models of higher visual cortexdevelopment. In 2019 Conference on Cognitive ComputationalNeuroscience (pp. 566â569). Berlin, Germany.https://doi.org/10.32470/CCN.2019.1393-0.
Zhuang, C., Zhai, A. L., & Yamins, D. L. K. (2019). Localaggregation for unsupervised learning of visual embeddings.In Proceedings of the IEEE International Conference onComputer Vision (pp. 6002â6012). Seoul, South Korea.https://doi.org/10.1109/ICCV.2019.00610.
Lindsay 15
title: "Convolutional Neural Networks as a Model of the Visual System: Past, Present, and Future" author: "Grace W. Lindsay"
jornal: Journal of Cognitive Neuroscience
year: 2020
èŠçŽ
ç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ïŒCNNïŒã¯ã✣ç©åŠçãªèŠèŠç 究ã®åæã®ææããçæ³ãåŸããã®ã§ããã 以æ¥ãCNN ã¯ã³ã³ãã¥ãŒã¿ããžã§ã³ã®åéã§æåãåããèŠèŠã¿ã¹ã¯ã«ãããç¥çµæŽ»åãšâŸåã®äž¡âœ ã®æå 端ã¢ãã«ãšãªã£ãŠããã ãã®ã¬ãã¥ãŒã§ã¯ãCNN ã®âœèã«ãããŠãèšç®è«çç¥çµç§åŠã«ãããåªããã¢ãã«ãšã¯ã©ããããã®ãããŸããã¢ãã«ãæŽå¯ãããããããŸããŸãªâœ æ³ã«ã€ããŠçŽ¹ä»ããã å ·äœçã«ã¯ãCNN ã®èµ·æºãšã✣ç©åŠçèŠèŠã®ã¢ãã«ãšããŠæ€èšŒãã✠æ³ãåãäžããŠããã ããã«ãCNN ãç解ãå®éšããããšã§âœ£ç©åŠçèŠèŠã«ã€ããŠäœãåŠã¶ããšãã§ãããã詳ãã説æããåºæ¬çãªç©äœèªèãè¶ ããŠèŠèŠç 究㫠CNN ã䜿✀ããæ°ããªæ©äŒã«ã€ããŠè«ããŠããã
ã¯ããã«
èšç®ã¢ãã«ã¯ ç¥çµç§åŠã«ãããŠããã€ãã®åœ¹å²ãæããã èšç®ã¢ãã«ã¯ ã·ã¹ãã ãã©ã®ããã«æ©èœãããã«ã€ããŠã®çŽæãçŽæ¥ãã¹ããã✠æ³ãæäŸããããšã§ ãã®çŽæãæ€èšŒããããšãã§ããã ãŸãã现éšãŸã§å¶åŸ¡ã»èšæž¬ã§ããçæ³çãªå®éšå Žã§ãæ°ããªä»®èª¬ãæ¢ã⌿段ã«ããªããŸãã ããã«ã¢ãã«ã¯ æ°åŠçãªåæãâŸãããšã§åé¡ãšãªã£ãŠããã·ã¹ãã ãæ°ããªé åã§ç解ããããšãã§ããã è¿å¹Žã§ã¯, ç³ã¿èŸŒã¿ãã¥ãŒã©ã«ãããã¯ãŒã¯ïŒCNNïŒã èŠèŠã·ã¹ãã ã®ã¢ãã«ãšããŠãããã®åœ¹å²ããã¹ãŠæãããŠããã
ãã®æŠèª¬è«âœã§ã¯ CNN ã®èµ·æº, èŠèŠã·ã¹ãã ã®ã¢ãã«ãšã㊠CNN ãæ€èšŒãã✠æ³, CNN ã䜿ã£ãŠå®éšããããšã§åŸããããã®, ãããŠèŠèŠç 究㧠CNN ã䜿ãããã®æ°ããªæ©äŒã«ã€ããŠèª¬æããã éèŠãªã®ã¯, æ¬çš¿ã§ã¯ CNN ã®å®å šãªæŠèŠã✰ããã®ã§ã¯ãªã, èŠèŠç 究ã«ããã深局åŠç¿ã®ãã¹ãŠã®âœ€éãåºç¯å²ã«ã«ããŒããããšãæå³ããŠããªãããšã§ãã (ãã®ç¹ã«ã€ããŠã¯ ä»ã®æŠèª¬è«âœãèªè ã«ãšã£ãŠæ✀ã§ããã; Kietzmann, McClure, & Kriegeskorte, 2019; Kriegeskorte & Golan, 2019; Serre, 2019; Storrs & Kriegeskorte, 2019; Yamins & DiCarlo, 2016,Kriegeskorte, 2015)ã ãããã✣ç©åŠçãªèŠèŠã«ã€ããŠã®æŽå¯ãç解ãåŸãããã«ãã¢ãã«ãšããŠã® CNN ãã©ã®ãããªæŠç¥ã§äœ¿ãããã®ãã✰ãããšã✬çãšããŠããŸãã
Kay (2018) ã«ããã° ã"æ©èœè«çã¢ãã«" 㯠ã·ã¹ãã ã«æäŸãããåãâŒâŒãäžããããã·ã¹ãã ã®åºâŒãâŒèŽãããããšããã ãã§ããã®ã«å¯Ÿã, "æ©æ¢°è«çã¢ãã«" 㯠ã·ã¹ãã ã®å®éã®ç©ççãªæ§æèŠçŽ ã䞊åã«ããã³ã³ããŒãã³ãã䜿✀ããããšããã ãšã®ããšã§ããã ãããã®å®çŸ©ã✀ã㊠ãã®ã¬ãã¥ãŒã§ã¯èŠèŠã·ã¹ãã ã® "æ©æ¢°è«çã¢ãã«" ãšããŠã® CNN ã®äœ¿âœ€ã察象ãšããã ã€ãŸã 2 ã€ã®ã·ã¹ãã ã®åºâŒéã®å šäœçãªâŒèŽã«å ã㊠CNN ã®äžäœéšå ãèŠèŠã·ã¹ãã ã® äžäœéšå ãšâŒèŽããããã«æå³ãããŠãããšä»®å®ã è«ããŠããã
CNN ã®æ¥æŽCNN ã®æŽå²ã¯ãç¥çµç§åŠãšâŒâŒ¯ç¥èœã®äž¡âœ ã«éããŠããã âŒè¬çãªâŒâŒ¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ãšåæ§, CNN ã¯è³ããçæ³ãåŸãã¢ã€ãã¢ã, ã³ã³ãã¥ãŒã¿ãµã€ãšã³ã¹ããšã³ãžãã¢ãªã³ã°ãšã®çžäºäœâœ€ã«ãã£ãŠçµå®ããäŸã§ããã
ã¢ãã«ã®èµ·æº
20 äžçŽåã° ãã¥ãŒãã«ãšãŽã£ãŒãŒã«ã¯, ç«ã®âŒæ¬¡èŠèŠé (V1) ã« 2 ã€ã®äž»èŠãªã¿ã€ãã®çŽ°èãçºâŸãã (Hubel & Wiesel, 1962)ã 1 ã€âœ¬ã®ã¿ã€ã㯠åçŽçŽ°èã§ãã, ç¹å®ã®ç©ºéçäœçœ®ã«çœ®ãããææã®æ£ã«åå¿ããã ããããã®çŽ°èã«ã¯ãæãçºâœããããæ£ã®åãããã, æ£ã®âŸåºŠããã®éžå¥œã®åãããå€ãããšåå¿ãäœäžãã (åãã®ãå調æ²ç·ã)ã 第 2 ã®ã¿ã€ãã§ããè€é现è㯠åå¿ã®ãããã¡ã€ã«ãããã»ã©å³å¯ã§ã¯ãªãã ãããã®çŽ°èã¯, éžå¥œâœ åæ§ãæã£ãŠããã, éžå¥œåºæ¿è¿é£ã®è€æ°ã®ç°ãªãããŒã«å¯ŸããŠãåãããã«åŒ·ãåå¿ããã ãã¥ãŒãã«ãšãŠã£ãŒãŒã«ã¯, ãããã®è€é现èã¯, éžå¥œâœ åã¯åãã ã, éžå¥œå Žæããããã«ç°ãªãè€æ°ã®åçŽçŽ°èããâŒâŒãåããŠããå¯èœæ§ã⟌ããšçµè«ã¥ãã (å³1å·Š)ã
å³1. èŠèŠã·ã¹ãã ã®æ§æèŠçŽ ãšCNNã®åºæ¬åäœãšã®é¢ä¿ã ãã¥ãŒãã«ããŠã£ãŒãŒã« (1962) ã¯ãåçŽçŽ°è (å·Šå³, â») ã¯, ç¹å®ã®âœ åã®æ£ã«æã匷ãåå¿ããç»åå ã®éžå¥œäœçœ® (ç Žç·ã®æ¥å) ãæã€ããšãçºâŸããã è€é现è (ç·) ã¯ãå€ãã®åçŽãªçŽ°èããã®âŒâŒãåãããã ãã空éçã«äžå€ãªåå¿ãããã ãããã®æäœã¯ CNN (å³å³) ã§ãåçŸãããã æåã®ç³ã¿èŸŒã¿å±€ (â») ã¯, ç»åã«ç³ã¿èŸŒã¿æŒç®ãé©âœ€ããŠäœãããã å ·äœçã«ã¯, ç»åã®ãã¹ãŠã®å Žæã«âŒ©ããªãã£ã«ã¿ (ç°âŸã®ããã¯ã¹)ãé©âœ€ããããšã§ ç¹åŸŽå°å³ãäœæããã ç³ã¿èŸŒã¿å±€ã¯ é©âœ€ããããã£ã«ã¿ã®æ°ã ãç¹åŸŽå°å³ãæ〠(ããã§ã¯ 1 ã€ã ã)ãããããã®ç¹åŸŽå°å³ (ç°âŸã®ç®±) ã®âŒ©ããªéšåã§æ⌀ã®æŽ»æ§åãåããš, ç»åãããŠã³ãµã³ããªã³ã°ãã, è€é现èã®ãããªåå¿ (ç·) ãåŸãããã ãã㯠ãæ⌀å€ããŒãªã³ã° max-poolingã ãšåŒã°ããæäœã§ãããæ⌀å€ããŒãªã³ã°ã«ãã£ãŠâœ£æãããããŠã³ãµã³ããªã³ã°ãããç¹åŸŽå°å³ãããŒãªã³ã°å±€ãæ§æããã
1980 幎 çŠå³¶ ã¯ãã¥ãŒãã« ãšãŽã£ãŒãŒã«ã®çºâŸã èŠèŠç³»ã®æ©èœã¢ãã«ã«å€æãã (Fukushima, 1980)ã ãã®ã¢ãã«ãããªã³ã°ãããã³ã㯠çŸä»£ã® CNN ã®å é§ããšãªã£ãã¢ãã«ã§ããããã®ã¢ãã«ã«ã¯äž»ã« 2 çš®é¡ã®çŽ°èãå«ãŸããŠããã S 现èã¯åçŽçŽ°èã«ã¡ãªãã§åä»ãããããã®ã§ããã®åºæ¬çãªæ©èœãåçŸããã å ·äœçã«ã¯ãâŒâŒç»åã®åäœçœ®ã«2次å ã®éã¿ã®ã°ãªãããé©âœ€ããŠãS 现èã®å¿çãäœæããã S 现èã®ããã¬ãŒã³ãã¯ããã¹ãŠã®çŽ°èãåãèŠèŠçç¹åŸŽãå ±æãã1 ã€ã®å±€ã«è€æ°ã®ãã¬ãŒã³ãååšãããšããã網èå±æçãªã¬ã€ã¢ãŠãã«ãªã£ãŠããã C现è (è€é现èã«ã¡ãªãã§åä»ãããã) ã®å¿çã¯ãåã平⟯ã®ç°ãªãäœçœ®ããæ¥ãè€æ°ã® S 现èã®âŸ®ç·åœ¢é¢æ°ã§ããã
V1 ã®åºæ¬çãªèšç®ãè¡šãåçŽãªã»ã«ãšè€éãªã»ã«ã®å±€ã®åŸãããªã³ã°ãããã³ã¯åçŽã«ãã®ããã»ã¹ãåã³ç¹°ãè¿ãã ã€ãŸããè€éãªçŽ°èã®ç¬¬ 1 ã®å±€ã®åºâŒãã第 2 ã®åçŽãªçŽ°èã®å±€ã®âŒâŒãšãªãããããç¹°ãè¿ããŠããã ãããäœåºŠãç¹°ãè¿ãããšã§ãV1 ã ãã§ãªããè ¹åŽèŠèŠçµè·¯å šäœã®åäœãæš¡å£ããéå±€çãªã¢ãã«ãã§ãããããŸãã ãã®ãããã¯ãŒã¯ã¯ãâŸâŒ°çµç¹åããããŠãããã©ãã«ã®ãªãç»åãç¹°ãè¿ãâŸãããšã§éã¿ãå€åããã
1990 幎代ãŸã§ã«ã¯ãèŠèŠç³»ã®å€ãã®åæ§ã®éå±€ã¢ãã«ãæ€èšãããããŒã¿ãšã®é¢é£ä»ããâŸãããŠããŸãã (Riesenhuber & Poggio, 2000)ã ãã®äžã§ãæãèåãªãã®ã®âŒã€ã§ãããHMAXãã¯ãåçŽãªçŽ°èã®æŽ»åã«å¯ŸããåçŽãªãæ⌀ãæŒç®ã✀ããŠC现èã®åå¿ã決å®ãããã®ã§ãç»åã®å€åã«å¯ŸããŠâŸ®åžžã«ããã¹ããªãã®ã§ããã ãããã®ã¢ãã«ã¯ãâŒéã®âŒŒçç©çåŠå®éšã§äœ¿âœ€ããããã®ãšåãç»åã«é©âœ€ã§ãããããã¢ãã«ã®åäœã¯ãâŒéãè¿ éã«èŠèŠçåé¡ãâŸãèœâŒãšçŽæ¥âœèŒããããšãã§ããŸããã ããã«ããããããã®éå±€ã¢ãã«ãšèŠèŠåŠçã®æåã®100ã150ããªç§ãšã®éã«å¯Ÿå¿é¢ä¿ãâŸåºãããŸãã (Serre et al. 2007)ã ãŸãããã®ãããªã¢ãã«ã¯ãè€éãªåœ¢ç¶ã®åºæ¿ã«å¯ŸããV4ãã¥ãŒãã³ã®åå¿ãæããã®ã«ãé©ããŠãã (Cadieu et al. 2007)ã
ã ã¿ããž ã«ããã
ã³ã³ãã¥ãŒã¿ããžã§ã³ã«ããã CNN
ä»âœç§ãã¡ãç¥ã£ãŠãã CNN ã¯ã³ã³ãã¥ãŒã¿ããžã§ã³ã®åéãã✣ãŸãããã®ã ããããã«ã¯ãã¥ãŒãã«ãšãŽã£ãŒãŒã«ã®ç 究ããã®ã€ã³ã¹ãã¬ãŒã·ã§ã³ãã¯ã£ãããšâŸãŠåãã (å³1:Rawat & Wang, 2017)ã çŸä»£ã® CNN ã¯ããŸãâŒâŒç»åã«âŒé£ã®ãã£ã«ã¿ãŒãç³ã¿èŸŒã¿ãåºâŒãæŽæµããããšã§ãããªã³ã°ãããã³ã® S 现èã®å¹³âŸ¯ã«äŒŒããç¹åŸŽããããã✣æããã ãã®åŸãããã¯ã¹ããŒãªã³ã°ãâŸããè€éãªçŽ°èã®ãããªåå¿ãäœãåºããŸãã ãã®ãã¿ãŒã³ãäœåºŠãç¹°ãè¿ããåŸã⟮ç³ã¿èŸŒã¿å®å šé£çµå±€ãè¿œå ããæåŸã®å±€ã«ã¯ã¿ã¹ã¯ã®ã«ããŽãªãŒæ°ãšåãæ°ã®ãŠããããâŒããŠãç»åã®ã«ããŽãªãŒã©ãã«ãåºâŒãã (å³2äž)ã
å³2ïŒCNN ã®è¡šçŸãè³ã®è¡šçŸãšâœèŒããããšã§ãã¢ãã«ãšããŠã®åŠ¥åœæ§ãæ€èšŒããã ç©äœèªèãâŸãããã«åŠç¿ããã CNN (äž) ã¯ãããã€ãã®ç³ã¿èŸŒã¿å±€ (ããã§ã¯ Conv-1 ãš Conv-2ã ããã«ç¶ãæ⌀ããŒãªã³ã°å±€ Pool-1 ãš Pool-2) ãå«ã¿ãæåŸã«ã«ããŽãªæ°ãšåãæ°ã®ãŠããããæã€å®å šé£çµå±€ã§çµããã åãç»åã CNN ãšåç©ã«âŸããå Žå (äž)ãããããã®åå¿ãâœèŒããããšãã§ããã ããã§ã¯ãè³ã®èŠèŠéããèšé²ãããåâŒãŠãããã®æŽ»åããç¹å®ã®å±€ã®âŒâŒ¯ãŠãããã®æŽ»åããäºæž¬ãããã
CNN ã®èœâŒãæåã«âŒ€ãã✰ãããã®ã¯1989幎ã®ããšã§ãããã¯ãããã²ãŒã·ã§ã³ã»ã¢ã«ãŽãªãºã ã䜿ã£ãŠç£èŠäžã§åŠç¿ãã〈åã® CNN ãã⌿æžãã®æ°åãåé¡ã§ããããšã✰ããã(LeCun et al., 1989)ã ããããCNN ãæ¬æ Œçã«æ®åããã®ã¯ 2012 幎ã«ãªã£ãŠããã®ããšã§ãããã¯ãããã²ãŒã·ã§ã³æ³ã§åŠç¿ãã8å±€ã®ãããã¯ãŒã¯ïŒå³3 ã§ã¯ "AlexNet", "StandardArchitecture" ãšåŒã°ããŠãã) ãã ImageNet ã®èª²é¡ã§æå 端ã®æ§èœãã¯ããã«äžåãçµæãåºããã ImageNet ã®ããŒã¿ã»ãã㯠100 äžæ以äžã®å®äžçã®ç»åã§æ§æãããŠããã1 æã®ç»åã 1,000å ã®ç©äœã«ããŽãªã«åé¡ããããšãæ±ããããŠããã ãã®ãããã¯ãŒã¯ã®æåã¯ãç¥çµç§åŠè ãçºâŸããèŠèŠã·ã¹ãã ã®åºæ¬çãªç¹åŸŽããé©åãªåŠç¿ã¢ã«ãŽãªãºã ãšããŒã¿ãå¿ èŠãšããã ãã§ãå®éã«èŠèŠããµããŒãã§ããããšã✰ããŠããã
å³3. ã¢ãŒããã¯ãã£ã®äŸã ããã§âœ°ãããŠããæšæºçãªã¢ãŒããã¯ãã£ã¯ããªãªãžãã«ã® AlexNet ã«äŒŒãŠãããâŸ/ç·/â» ã®ç»åãâŒâŒãšããŠåã蟌ãŸãã æçµå±€ã§ã®æŽ»åãã«ããŽãªã©ãã«ã決å®ããŸãã 2 çªâœ¬ã®ã¢ãŒããã¯ãã£ã§ã¯ãåç³ã¿èŸŒã¿å±€ã§å±æååž°æ§ã✰ããããã£ãŒãããã¯ååž°æ§ã¯ç¬¬5å±€ãã第3å±€ãžãšç¶ãã ãã®ãããã¯ãŒã¯ã¯ãããã¯ãããã²ãŒã·ã§ã³ã»ã¹ã«ãŒã¿ã€ã ã»ã¢ã«ãŽãªãºã ã§åŠç¿ãããã 3 ã€âœ¬ã®ãããã¯ãŒã¯ã¯ããªãŒããšã³ã³ãŒããŒã§ãã ããã§ã®ãããã¯ãŒã¯ã®âœ¬çã¯ãâŒâŒç»åãåçŸããããšã§ãã éèŠãªã®ã¯ãå®å šé£çµå±€ãç»åãããããªãäœã次å ã§ããããšã§ããããã¯ãŒã¯ã¯é¢é£ããç»åçµ±èšã®ããã³ã³ãã¯ããªè¡šçŸãâŸã€ããªããã°ãªããªãã æåŸã«ã匷ååŠç¿ã®ããã®ã¢ãŒããã¯ãã£ã✰ããŸãã ããã§ã¯ããããã¯ãŒã¯ã®âœ¬çã¯ãçŸåšã®ç¶æ ã«é¢ããæ å ±ãè¡šãâŒâŒç»åãããããã¯ãŒã¯ãåãã¹ãâŸåã«å€ããããšã§ãã ãã®ã¢ãŒããã¯ãã£ã¯ãLSTM ãšåŒã°ããç¹å®ã®ã¿ã€ãã®å±æååž°ãŠãããã«ãã£ãŠåŒ·åãããŠãããããã«ãã£ãŠèšæ¶ãåã蟌ãŸããŠããŸãã
ãã®ãã¢ããæ°å¹Žã®éã«ãããŸããŸãª CNN ã¢ãŒããã¯ãã£ã暡玢ãããŠãããäž»ãªãã©ã¡ãŒã¿ãšããŠã¯ããããã¯ãŒã¯ã®æ·±ããããŒãªã³ã°å±€ã®é 眮ãå±€ããšã®ç¹åŸŽãããã®æ°ãåŠç¿âŒ¿é ãå±€ã⟶ã°ããæ®çæ¥ç¶ãååšãããã©ãããªã©ãå€åããŠãã (Rawat & Wang, 2017)ã ã³ã³ãã¥ãŒã¿ããžã§ã³ã®ã³ãã¥ããã£ã§ãããã®ãã©ã¡ãŒã¿ãæ¢çŽ¢ãã✬çã¯ãæšæºçãªç»åãã³ãããŒã¯ã§ããè¯ãæ§èœãçºæ®ããã¢ãã«ãäœæããããšã§ããããããã¯ãŒã¯ããã〈ããããããããå°ãªãããŒã¿ã§èšç·ŽãããããããšãâŒæ¬¡çãªâœ¬æšãšãªã£ãŠããŸãã ✣ç©åŠãšã®å¯Ÿå¿ã¯æšé²èŠå ã§ã¯ãªãã
èŠèŠç³»ã®ã¢ãã«ãšã㊠CNN ãæ€èšŒããCNN ã®ã¢ãŒããã¯ãã£ã¯ãèŠèŠã·ã¹ãã ã®ã¢ãŒããã¯ãã£ãš (æå³çã«) çŽæ¥çãªé¡äŒŒæ§ãæã£ãŠããŸãã ãããã®ãããã¯ãŒã¯ã«âŒâŒãããç»åã¯ãéåžžããŸãæ£èŠåããã3 ã€ã®ç°ãªãã«ã©ãŒãã£ã³ãã« (âŸãç·ãâ») ã«åããããã ç³ã¿èŸŒã¿-⟮ç·åœ¢æ§-ããŒã«ã®åã¹ã¿ãã¯ãã³ãã«ã¯ã1 ã€ã®èŠèŠé å (éåžž V1, V2, V4, IT ãªã©ã®è ¹åŽã¹ããªãŒã ã«æ²¿ã£ããã®) ã«è¿äŒŒããŠãããšèããããšãã§ãããããããç¬âŸã®ç¶²èæ§ãšç¹åŸŽããããæã£ãŠããŸãã ãã®ãããªç©ã¿éãã«ãããåã ã®ãã¥ãŒãã³ã®å容éã圢æããããããã¯ãŒã¯ã®æ·±éšã«ãªãã»ã©ãµã€ãºã⌀ãããªããåå¿ããç»åã®ç¹åŸŽãè€éã«ãªããŸãã ãããŸã§è¿°ã¹ãŠããããã«ããããã®ã¢ãŒããã¯ãã£ã¯ãèšç·Žãåããã°ãç»åãåã蟌ã¿ãâŒéã®å€æãšâŒèŽããã«ããŽãªãŒã©ãã«ãåºâŒããããšãã§ããã
ãããã®ç¹åŸŽã®ãã¹ãŠããCNN ãèŠèŠã·ã¹ãã ã®ã¢ãã«ã«é©ããåè£ãšããŠããã ãããããããã®ç¹åŸŽã®ãã¹ãŠã CNN ã«æ✰çã«çµã¿èŸŒãŸããŠããããã§ã¯ãªãã CNN ãèŠèŠã·ã¹ãã ãšåããããªèšç®ãããŠããããšãæ€èšŒããã«ã¯ã⌯åŠçã§ã¯ãªãè¿œå çãªâœ æ³ã§èŠèŠã·ã¹ãã ãšâŒèŽãããå¿ èŠãããã ã€ãŸããã¢ãã«ã«âŒããããä»®å®ãããããŒã¿ã®ãããªãç¹åŸŽãèœã¡ãŠããã¯ãã ã å®éãç¥çµã¬ãã«ãšâŸåã¬ãã«ã§å€ãã®ãããªã察å¿é¢ä¿ãçºâŸãããŠããã
ç¥çµã¬ãã«ã§ã®âœèŒ
æè¿ãç¥çµç§åŠè ã®éã§âŒâŒ¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ãžã®é¢âŒŒãåçããŠãã⌀ããªåå ã®ã²ãšã€ã¯ãCNNãè ¹åŽã®æµãã«æ²¿ã£ãèŠèŠæ å ±ã®è¡šçŸãåçŸã§ããããšãããã£ãããšã§ãããç¹ã«ãCNNãšåç©ã«åãç»åãâŸããå ŽåïŒå³2ïŒãâŒâŒ¯ãŠãããã®æŽ»åã䜿ã£ãŠæ¬ç©ã®ãã¥ãŒãã³ã®æŽ»åãäºæž¬ããããšãã§ãããã®ç²ŸåºŠã¯ãããŸã§ã®âœ æ³ãäžåãã
ããã✰ããåæã®ç 究㮠1 〠(Yamins et al, 2014) 㯠2014 幎ã«çºè¡šããããã®ã§ãè€éãªç©äœç»åãâŸãŠããéã®ãã«ã¯ã®çŽ°èå€æŽ»åãèšé²ããŸããã å®éã® V4 ãã¥ãŒãã³ã IT ãã¥ãŒãã³ã®æŽ»åãããããã¯ãŒã¯å ã®âŒâŒ¯çãªãŠãããã®æŽ»åã«ååž°ããããšãã (ãããŠãäºæž¬èœâŒãããŒã«ãã¢ãŠããããã¹ãã»ããã§äº€å·®æ€èšŒãããšãã)ã ç©äœèªèã¿ã¹ã¯ã§è¯ãçµæãåºãããããã¯ãŒã¯ã¯ã ç¥çµæŽ»åã®äºæž¬ã«ãåªããŠããããšãããã£ã (ãã®é¢ä¿ã¯ãããªåé¡ã✀ããŠãâŸããã: Tacchetti, Isik, & Poggio, 2017)ã ããã«ããããã¯ãŒã¯ã®æåŸã®å±€ã®ãŠãããã®æŽ»åã IT ã®æŽ»åãæãããäºæž¬ããæåŸã®å±€ãV4ãæãããäºæž¬ããŸããã ãã®ããã«ããããã¯ãŒã¯ã®åŸåã®å±€ãè ¹åŽã¹ããªãŒã ã®ãã⟌ãé åãããããäºæž¬ãããšããã¢ãã«ãšè³ã®é¢ä¿ã¯ã ããã® fMRI (GÌçlÃŒ & van Gerven, 2015)ãMEGïŒ (eeliger et al, 2018) ã åºæ¿ãšããŠéâœç»åã§ã¯ãªãåç»ã✀ããç 究 (Eickenberg, Gramfort, Varoquaux, &Thirion, 2017) ãªã©ãä»ã®ããã€ãã®ç 究ã§ãâŸåºãããŠããŸãã
ç°ãªãéå£éã®å¯Ÿå¿é¢ä¿ã確èªããããâŒã€ã®âœ æ³ãšããŠïŒè¡šè±¡é¡äŒŒæ§åæ (RSA; Kriegeskorte, Mur, & Bandettini, 2008) ããããŸãã ãã®âœ æ³ã§ã¯ïŒãŸãâºéå£ããšã«ïŒç»åã®ãã¢ããšã«ãã®âºéå£ã®åå¿ãã©ãã ã䌌ãŠããªãããè¡šãâŸåãäœæããŸãïŒ ããã¯ããã®éå£ã®è¡šçŸç¹æ§ã®ç¹åŸŽãšããŠæ©èœããŸãã ãããŠã2 ã€ã®ç°ãªãéå£éã®é¡äŒŒæ§ã¯ããããã®âŸ®é¡äŒŒåºŠâŸåã®çžé¢ãšããŠæž¬å®ãããŸãã ãã®âœ æ³ã¯ã2014 幎㫠(Khaligh-Razavi & Kriegeskorte, 2014), ImageNet ã§åŠç¿ãã AlexNet ãããã¯ãŒã¯ã®åŸæ®µãã âŒéã®èŠèŠç³»ã®è€æ°ã®âŸŒæ¬¡é åããµã«ã® IT ãšã ãããŸã§äœ¿âœ€ãããŠããã¢ãã«ãããããâŒèŽããããšãå®èšŒããŸããã
⟮é¡äŒŒåºŠâŸåã¯ïŒâŸååºâŒãå«ãããããçš®é¡ã®å¿çããäœæã§ããããïŒRSA ã¯ç°ãªãå®éšâŒ¿æ³ãã¢ãã«ã®ããŒã¿ãâœèŒãã⌿段ãšããŠåºãé©âœ€ã§ããã ãŸããååž°æ³ãâŒåºŠã«1ã€ã®ãã¥ãŒãã³ããã¯ã»ã«ã«çŠç¹ãåœãŠãã®ã«å¯ŸããRSA ã¯å šéå£ã®åå¿ãåãâŒããŠâœèŒããããã®ç°¡åãªâœ æ³ã§ãã âŒâœ ïŒååž°æ³ã§ã¯ïŒããŒã¿ã®é©åã«æãé¢é£ããã¢ãã«ã®ç¹åŸŽãéžæçã«éã¿ä»ãããããšãã§ããããïŒ ãæåž«ãªãã ã® RSA æ³ãããæ å ±éãå€ãå¯èœæ§ããããŸãã ãããã®å Žåãã⌿æ³ã解éã®è©³çŽ°ã¯æ éã«æ€èšããå¿ èŠããããŸã (Kornblith,Norouzi, Lee, & Hinton, 2019; Thompson, Bengio, Formisano, & Schönwiesner, 2018)ã
CNN ãšâœ£ç©åŠçããŒã¿ãâœèŒããç 究ã®å€ãã¯ãV4 ã IT ãªã©ã®åŸæèŠèŠéã説æããèœâŒã匷調ããŠããã ãšããã®ãããããã®é åã®è€éãªåå¿ç¹æ§ã®ããã«ãâŒæ¬¡èŠèŠéã«âœã¹ãŠCNN ã®é©åãé£ããããšãç¥ãããŠããããã ã ããããæè¿ã®ç 究ã§ã¯ãã¿ã¹ã¯ãã¬ãŒãã³ã°ããã CNN ã®åæããäžéå±€ããããäŒçµ±ç㪠V1 ã¢ãã«ã®èœâŒãè¶ ããŠV1掻åãäºæž¬ã§ããããšã✰ãããŠãã (Cadena, Denfield, et al.2019)ã
ç¥çµæŽ»åãäºæž¬ããããå šäœçãªè¡šçŸãâŒèŽããããããã ãã§ãªããCNN ã¯ã·ã¹ãã ç¥çµç§åŠã§äŒçµ±çã«âœ€ããããŠããããå ·äœçãªç¹åŸŽã✀ããŠç¥çµããŒã¿ãšâœèŒããããšãã§ããã ãããã¯ãŒã¯å ã®ãŠããããšåã ã®ãã¥ãŒãã³ã®éã«ã¯ãå¿çã®çå¯ããµã€ãºã®ãã¥ãŒãã³ã°ã«ã€ããŠé¡äŒŒæ§ãâŸåºãããŠããããç©äœéžææ§ã✠åã®ãã¥ãŒãã³ã°ã«ã€ããŠã¯éããâŸåºãããŠãã (Tripp, 2017)ã ä»ã®ç 究ã§ã¯ã圢ç¶ãã«ããŽãªãŒãžã®ãã¥ãŒãã³ã° (Zeman, Ritchie, Bracci, & Op de Beeck, 2019) ããããŒãºãäœçœ®ããµã€ãºã®å€åã«äŒŽã£ãŠå¿çãã©ã®ããã«å€åãããã調ã¹ãŠããŸã (Murty & Arun, 2017, 2018; Hong, Yamins, Majaj, & DiCarlo, 2016)ã
âŒè¬çã«ãç©äœèªèãâŸãããã« CNN ããã¬ãŒãã³ã°ãããšãã«çŸããå®éã®ç¥çµæŽ»åãšã®é¡äŒŒæ§ã¯ããã®ã¢ãŒããã¯ãã£ãšã¿ã¹ã¯ãå®éã«èŠèŠã·ã¹ãã ã®ã¢ãŒããã¯ãã£ãšâœ¬çã«ããçšåºŠé¡äŒŒããŠããããšã✰åããŠããã
âŸåã¬ãã«ã§ã®âœèŒ
å®äžçã®ç»ååé¡ã«ãããŠãCNN ããããŸã§ã®èŠèŠç³»ã¢ãã«ãåé§ããŠããéããCNN ã¯âŒéã®âŸåã«ãããããããŠãããšèããããã ããããæšæºçãªImageNetã¿ã¹ã¯ã§ã®ç·åçãªç²ŸåºŠã¯ãCNN ã®åäœã✰ãææšã®ã²ãšã€ã«ããããããã¯ãããã¯ãŒã¯ãæ✰çã«æé©åããããã®ã§ããã
ãããã®ãããã¯ãŒã¯ãç¯ã誀ããâŸãããšã§ãããæ·±ãâœèŒãå¯èœã«ãªããŸãã æ£ãã✠æ³ã¯âŒã€ãããããŸããããâŒéãšã¢ãã«ãåé¡ãééãã✠æ³ã¯æ°å€ããããããããã®ä»çµã¿ãç解ããã®ã«åœ¹âœŽã¡ãŸãã äŸãã°ãæ··åâŸåã¯ãããã«ããŽãªãŒã®ç»åãå¥ã®ã«ããŽãªãŒã«åé¡ãããé »åºŠãè¡šããã®ã§ãã¢ãã«ãšåç©ã®âŸåãâœèŒããããšãã§ããŸãã ⌀èŠæš¡ãªç 究ã§ã¯ãããã€ãã®ç°ãªããã£ãŒãCNNã¢ãŒããã¯ãã£ããã¢ãã«ãç»åã¬ãã«ãŸã§äºæž¬ã§ããªãã«ãããããããâŒéãšãµã«ã®ç©äœåé¡ã«é¡äŒŒããâŒèŽã✰ãããšã✰ããã(Rajalingham et al.ã2018)ã
ä»ã®ç 究ã§ã¯ãç»åéã®é¡äŒŒæ§ãè©äŸ¡ããããåå è ã«æ✰çã«æ±ããâŒéã®å€æãšã¢ãã«è¡šçŸãâœèŒããŠããŸã (King, Groen, Steel, Kravitz, & Baker, 2019; Jozwik, Kriegeskorte, Storrs,& Mur, 2017)ã äŸãã°ãRosenfeld, Solbach, and Tsotsos (2018) ã®ç 究ã§ã¯ã⌀èŠæš¡ãªããŒã¿ã»ããã Web ãµã€ã "Totally Looks Like" ããååŸãããŸããã å€ãã®é¡äŒŒæ§ç 究ã§ã¯ãCNN ããè¯å¥œãªãããã³ã°ãåŸãããŸããããã®ç 究ã§ã¯ãCNN ãåçŸããã®ã«èŠåŽãããããå°é£ãªé¡äŒŒæ§ã®èŠçŽ ã✰ãããŸããã
é¡äŒŒæ§ã«å ããŠãCNN ã§ãã¹ããããä»ã®âŒŒçåŠçæŠå¿µã«ã¯ãå žåæ§ (Lake, Zaremba, Fergus, & Gureckis, 2015), ã²ã·ã¥ã¿ã«ãåç (Kim, Reif, Wattenberg, & Bengio, 2019), ã¢ããã·ãŒ(Bracci, Ritchie, Kalfas, & Op de Beeck, 2019) ãªã©ãããã Jacob, Pramod, Katti, and Arun (2019) ã§ã¯ãèŠèŠâŒŒçç©çåŠã®ç¥âŸã«è§ŠçºãããâŒé£ã®ãã¹ãã CNN ã«é©âœ€ãããšããããããåæ°ã®ãã¹ãã«ããã°ãCNN ã¯âœ£ç©åŠçãªèŠèŠã«é¡äŒŒããŠããããšãããã£ããšããã
åç©ã CNN ã®âŸåãæ¢ããã1ã€ã®âœ æ³ã¯ãç»åå質ãå£åãããããšã§åé¡ã¿ã¹ã¯ãããå°é£ã«ããããšã§ãã ããã€ãã®ç 究ã§ã¯ãæšæºçãªç»åã«ããŸããŸãªçš®é¡ã®ãã€ãºãé èœããŸãã¯ãŒãããå ããŠãåé¡æ§èœã®äœäžã芳å¯ããŠãã (Geirhos, Temme, et al., 2018; Tang et al., 2018; Geirhos et al., 2017; Wichmann et al., 2017; Ghodrati, Farzmahdi, Rajaei,Ebrahimpour, & Khaligh-Razavi, 2014)ã éèŠãªã®ã¯ããã®æ§èœäœäžã¯éåžžãCNN ã§ã¯âŒéãããæ·±å»ã§ããã✣ç©åŠçèŠèŠã«ã¯å£åãå æããã¡ã«ããºã ãããããšã✰åããŠããã ãã®ãããªãã¹ãããã¯ãä»åŸã®ç 究ãé²ããäžã§éèŠãªãã€ã³ããšãªããŸã (ã代æ¿ããŒã¿ã»ãããã®ã»ã¯ã·ã§ã³ãåç §)ã ã«ãã»ã«ãããã¯ãŒã¯ (Roy, Ghosh, Bhattacharya, & Pal, 2018)ãšåŒã°ããç¹å®ã® CNN ã¢ãŒããã¯ãã£ã¯ãå£åã«å¯ŸããŠããé å¥ã§ããããšã✰ããããã ãã®ã¢ãŒããã¯ãã£ã®ãã«ãã»ã«ããèŠèŠç³»ã®âŒéšãšã©ã®ããã«é¢é£ä»ãããã¯æ£ç¢ºã«ã¯æããã«ãªã£ãŠããªãã
CNN ã®âŸååæã«ãããããã²ãšã€ã®æ°ããªçºâŸã¯ããã¯ã¹ãã£ãŒãžã®äŸå床ã§ããã CNN ãâŒéã®åœ¢ç¶æ床ã®ã¢ãã«ã§ãããšããè°è«ããªãããŠããã (Kubilius, Bracci, & Op deBeeck, 2016), ä»ã®ç 究ã§ã¯ãCNN ãç»åãåé¡ããéã«ããã¯ã¹ãã£ãŒã«é ŒããããŠã圢ç¶ã«ã¯âŒåã«é Œããªãããšãå®èšŒãããŠãã (Baker, Lu, Erlikhman, & Kellman, 2018; Geirhos,Rubisch, et al. 2018)ã
èå³æ·±ãããšã«ãããçš®ã®æ·±å±€ãããã¯ãŒã¯ãããã€ãã®ã¿ã¹ã¯ã§âŒéãåé§ããããšã¯å¯èœã§ã (Kheradpisheh, Ghodrati, Ganjtabesh, & Masquelier, 2016)ã ãã®ããšã¯ãã³ã³ãã¥ãŒã¿ããžã§ã³ã®âœ¬æšãšç¥çµç§åŠã®âœ¬æšãšã®éã®éèŠãªç·åŒµé¢ä¿ãæµ®ã圫ãã«ããŠããŸãã åè ã§ã¯âŒéã®ããã©ãŒãã³ã¹ãè¶ ããããšãæãŸããããåŸè ã§ã¯ã¢ãã«ãšããŒã¿ã®ãã¹ããããšããŠã«ãŠã³ããããŠããŸãã
CNN ã®åäœãç 究ããéã«çæãã¹ãããšã¯ãæšæºçãªãã£ãŒããã©ã¯ãŒã CNN ã¢ãŒããã¯ãã£ã¯ãããŸããŸãªçš®é¡ã®ååž°çåŠçãâŸãããåã®ãèŠèŠåŠçã®ããåæã®æ®µéãè¡šããŠãããšèããããŠããããšã ã ãã®ãããCNN ã®æåãåç©ã®âŸåãšâœèŒããéã«ã¯ã⟌éãªåºæ¿æ✰ãšåŸâœ ãã¹ãã³ã°ãæšå¥šããããããã¯ååž°çåŠçã®å€ãã®æ®µéã劚ããããšãç¥ãããŠããããã ïŒTang et al. 2018).
劥åœæ§æ€èšŒã®ããã®ä»ã®åœ¢æ
ç¥çµãšâŸåã®âœèŒã¯ãCNN ãèŠèŠã·ã¹ãã ã®ã¢ãã«ãšããŠæ€èšŒãããäž»ãªâœ æ³ã§ããããä»ã®ã¢ãããŒãããããªãè£ä»ããšãªãã
ãããã¯ãŒã¯ã®ç°ãªãå±€ã®ãŠããããé§åããç»åç¹åŸŽãå¯èŠåããâŒ¿æ³ (å³4ãOlah, Mordvintsev, & Schubert, 2017ãZeiler & Fergus, 2014) ã«ãããç¥çµç§åŠã§âŸããããã®ãšâŒèŽãã奜ãŸããèŠèŠãã¿ãŒã³ãæããã«ãªã£ãã äŸãã°ãCNN ã®æåã®å±€ã«ã¯ Gabor ã®ãããªãã£ã«ã¿ããããã åŸã®å±€ã§ã¯éšåçãªç©äœã®ç¹åŸŽã«åå¿ããæçµçã«ã¯é¡ã®ãããªå®å šãªç¹åŸŽã«åå¿ããã ãã®ããšã¯ãCNN ã®åŠçã¹ããããè ¹åŽã¹ããªãŒã ã®ãããšâŒèŽããŠãããšããèããâœæããŠããã ãŸããCNN ã¯å®éã®ãã¥ãŒãã³ã«æé©ãªåºæ¿ãäœãåºãããã«ã䜿ãããŠããã åè¿°ã® CNN ã«ããç¥çµæŽ»åäºæž¬ã®âŒ¿é ãåºçºç¹ãšããŠããã¥ãŒãã³ã®çºâœçãæ⌀éã«é§åããããšãæå³ããåºæ¿ã✣æãã (Bashivan, Kar, & DiCarlo, 2019)ã çµæãšããŠåŸãããåºæ¿ã¯ããããã¯ãŒã¯ãåŠç¿ããç»åãšã¯ç¡é¢ä¿ã®äžâŸç¶ãªãã®ã§ãã£ãã«ãããããããå®éã«ãã¥ãŒãã³ãéåžžã®ã¬ãŒã以äžã«é§åãããå¹æããã£ããšããäºå®ã¯ãCNNãèŠèŠåŠçã¹ããªãŒã ã«é¢ããåºæ¬çãªäœããæããŠãããšããèããããã«è£ä»ãããã®ã§ããã
å³4. ç°ãªããããã¯ãŒã¯ã³ã³ããŒãã³ãã®å¥œã¿ãèŠèŠåããã ãããã®ç»åãäœæããããã®âŒè¬çãªâœ æ³ãäžéšã«âœ°ãã å ·äœçã«ã¯, ã¢ãã«ã®ã³ã³ããŒãã³ããæé©åã®ããã«éžæãã, ãããã¯ãŒã¯ãéããŠéãããŠããåŸé èšç®ã«ãã£ãŠ, æåã¯ãã€ãºã®å€ãç»åã®ç»çŽ ã éžæãããã³ã³ããŒãã³ããæãæå¹ã«ããããã«ã©ã®ããã«å€åãã¹ããã✰ãããã ã³ã³ããŒãã³ãã«ã¯ åã ã®ãŠããã, ç¹åŸŽå°å³å šäœ,DeepDream ã¢ã«ãŽãªãºã ã✀ããå±€, æçµå±€ã®ç¹å®ã®ã¯ã©ã¹ãè¡šããŠããããªã©ãããã Olah et al. (2017) ãã
ããã«é¢é£ããŠãCNN ã䜿ã£ãŠç¥çµæŽ»åã解èªããåå è ã«æ✰ãããŠããåºæ¿ãåçŸããç 究ãâŸãããŠãã (Shen, Horikawa, Majima, & Kamitani, 2019)ã
ã¢ãã«å€æŽããåŠã¶ããš
äžè¿°ã®ç 究ã§ã¯ãäž»ã«æšæºçãªãã£ãŒããã©ã¯ãŒã CNN ã¢ãŒããã¯ãã£ã«çŠç¹ãåœãŠãæåž«ä»ãåŠç¿ã«ãã£ãŠåŠç¿ãããç©äœèªèãâŸã£ãŠããã ãããããããã®ã¢ãã«ãå®å šã«å¶åŸ¡ããããšãã§ããã°ãããŒã¿ã»ãããã¢ãŒããã¯ãã£ãåŠç¿âŒ¿é ãªã©ãããŸããŸãªããªãšãŒã·ã§ã³ãæ€èšããããšãã§ããã ãããã®å€æŽã«ãã£ãŠã¢ãã«ã®ããŒã¿ãžã®é©åæ§ãã©ã®ããã«è¯ããªã£ããæªããªã£ãããããã芳å¯ããããšã§ã✣ç©åŠçãªèŠèŠåŠçã®ç¹åŸŽãã©ã®ããã«ããŠããŸããªãååšããã®ãã«ã€ããŠã®æŽå¯ãåŸãããšãã§ããã
ããŒã¿ã»ããã®ä»£æ¿
ImageNet ããŒã¿ã»ããã¯ãMNIST ã CIFAR-10 ã®ãããªä»¥åã®âŒ©ãããŠåçŽãªããŒã¿ã»ãããšã¯ç°ãªããæ§ã ãªã¿ã¹ã¯ã«é©å¿ã§ããåºæ¬çãªèŠèŠçç¹åŸŽã®ã»ãããåŠç¿ããã®ã«âŸ®åžžã«æ✀ã§ããããšã蚌æãããŠããŸãã ãããããã®ããŒã¿ã»ããã«ã¯ç©äœã«çŠç¹ãåœãŠãç»åãå«ãŸããŠãããçµæçã«ç©äœèªèçµè·¯ã®ç 究ã«æãé©ããŠããŸãã åŸé èé åã®ãããªå ŽâŸ¯åŠçé åãç 究ããããã«ãããã€ãã®ç 究ã§ã¯ã代ããã«å ŽâŸ¯ç»åã䜿ã£ãŠåŠç¿ãâŸã£ãŠããŸãã Bonner ãš Epstein (2018) ã®ç 究ã§ã¯ãã·ãŒã³ãåé¡ããããã«èšç·Žããããããã¯ãŒã¯ã✀ããŠãåŸé èå Žæé åã®åå¿ãäºæž¬ããããšãã§ããŸããã ããã«èè ãã¯ããããã¯ãŒã¯ãæããç¹åŸŽããã·ãŒã³å ã®ããã²ãŒã·ã§ã³ã¢ãã©ãŒãã³ã¹ã«é¢é£ä»ããããšãã§ããã Cichy, Khosla, Pantazis, and Oliva (2017) ã®ç 究ã§ã¯ãã·ãŒã³ãåŠç¿ãããããã¯ãŒã¯ããMEGã✀ããŠåéããã·ãŒã³ãµã€ãºã®è¡šçŸãæããããšãã§ããã ãã®ãããªç 究ã¯âŒè¬çã«ãç°ãªãè³é åã®è¡šçŸããç°ãªãç¹æ®ãªããŒã¿ã»ããã§èšç·Žããããããã¯ãŒã¯ãšã©ã®çšåºŠâŒèŽãããã«åºã¥ããŠãç°ãªãè³é åã®é²åçãŸãã¯çºéçã«æ±ºå®ããã圹å²ãç¹å®ããã®
ã«åœ¹âœŽã¡ãŸãã ç°ãªãç»åã»ããã«å¯Ÿãã fMRI ã®å¿çã®ãªãŒãã³ããŒã¿ã»ããã«ã€ããŠã¯ Chang et al. (2019) ãåç §ããŠãã ããã
ãŸããCNN ãâŒéã®âŸåãšâŒèŽããªãç¹ãä¿®æ£ããããšãæ確ã«æå³ããããŒã¿ã»ãããéçºãããŠããã äŸãã°ãç°ãªãç»åå£åã§ãã¬ãŒãã³ã°ããããšã§ããããã¯ãŒã¯ããã®å£åã«å¯ŸããŠããããã¹ãã«ããããšãã§ãã (Geirhos, Temme, et al.2018)ã ããããããã¯ç¹å¥ã«èšç·Žããããã€ãºã¢ãã«ã«å¯ŸããŠã®ã¿æå¹ã§ãããæ°ãããã€ãºã®çš®é¡ã«ã¯âŒè¬åããªãã ãŸã Geirhos, Rubisch, et al. (2018) ã®ç 究ã§ã¯ãæœè±¡çãªåœ¢ç¶ãä¿æãã€ã€ãäœã¬ãã«ã®ãã¯ã¹ãã£èŠçŽ ãå€åãããããŒã¿ã»ããããCNN ã®ãã¯ã¹ãã£ãã€ã¢ã¹ãæžå°ãããããšã✰ãããã åç©ãåªããèŠèŠæ§èœãæã€ã®ã¯ãçºéæã«åæ§ã®å€æ§ãªããŒã¿ã«è§Šããããšã«ãããã®ãªã®ãããããšã代ããã«å èµããããã©ã€ã¢ã«ãããã®ãªã®ãã¯ããŸã 解æãããŠããªãã
ã¢ãŒããã¯ãã£ã®ä»£æ¿
æ©æ¢°åŠç¿ã³ãã¥ããã£ãç¥çµç§åŠè ã®éã§ã¯ãããŸããŸãªçŽç²ãªãã£ãŒããã©ã¯ãŒãã»ã¢ãŒããã¯ãã£ãç 究ãããŠããŸããããŒã¿ãšã®âœèŒã§ã¯ãAlexNet ããã䜿ãããŠãããæ§èœãè¯ãã ããããVGG㢠ãã«ã ResNets ãªã©ã®ããæ·±ãã¢ãŒããã¯ãã£ã«è² ããããšããããŸã (Schrimpf et al.2018; Wen, Shi, Chen, & Liu, 2018)ã ⟮垞ã«æ·±ããããã¯ãŒã¯ã§ã¯ãç»åã¿ã¹ã¯ã§ã¯åªããæ§èœãçºæ®ãããããããªããããããã¯ãŒã¯å ã®å±€ãšè³å ã®é åã®é¢ä¿ã厩ããããŒã¿ãžã®é©åæ§ãæªããªãå¯èœæ§ãããã ãããã⟮垞ã«æ·±ããããã¯ãŒã¯ã§ã®åŠçã¯ãè³å ã®è€æ°æ®µéã®ãªã«ã¬ã³ãåŠçãšåçãšèããããå ŽåããããŸã (Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019)ã
éâ»é¡ã®èŠèŠã§ã¯ãæåã® 2 ã€ã®åŠç段é (網èãšå€åŽåž¯ç¶æ ž) ã«ãäžâŒŒã«ããå ã®ãã¿ãŒã³ãšäžâŒŒããå€ããå ã®ãã¿ãŒã³ã«åªå çã«åå¿ãã现èãå«ãŸããŠãããV1 ã§ã®ã¿âœ äœã®ãã¥ãŒãã³ã°ãâœé çã«ãªããŸãã ããããã¬ããŒã«ãã£ã«ã¿ãŒã¯ CNN ã®ç¬¬ 1 å±€ã§åŠç¿ãããããšãå€ããV1 ã®ãããªè¡šçŸã«ãªã£ãŠããŸãã èŠç¥çµã«ãã£ãŠâœ£ç©åŠçã«å¶çŽãããŠããããã«ãåæå±€ããã®æ¥ç¶ãä¿®æ£ããã¢ãŒããã¯ãã£ã䜿✀ãããšãåæ段éã§ã¯äžâŒŒã«ããå¿çãšäžâŒŒããå€ããå¿çãããã✠åæ§ã®ãã¥ãŒãã³ã°ã¯åŸã«ãªã£ãŠããâŸãããCNNãã§ããŸã (Lindsey, Ocko, Ganguli, & Deny, 2019)ã ãã®ããã«ãéâ»é¡ã®èŠèŠã§âŸãããéžææ§ã®ãã¿ãŒã³ã¯ã解ååŠçãªå¶çŽã®çµæã§ããå¯èœæ§ãããã
Kell, Yamins, Shook, Norman-Haignere, McDermott (2018) ã®ç 究ã§ã¯ãâŒå¯Ÿã®èŽèŠã¿ã¹ã¯ (⟳声èªèãšâŸ³æ¥œèªè) ãå®âŸããããã«ãããŸããŸãªã¢ãŒããã¯ãã£ãèšç·ŽãããŸããã ãããéããŠèè ãã¯ã2 ã€ã®ã¿ã¹ã¯ã 3 ã€ã®å±€ã®åŠçãå ±æã§ããããšãçºâŸãããã®åŸããããã¯ãŒã¯ã䞡✠ã®ã¿ã¹ã¯ã§è¯å¥œãªããã©ãŒãã³ã¹ãçºæ®ããããã«å°âŸšçãªã¹ããªãŒã ã«åå²ããå¿ èŠãããããšãæããã«ããŸããã æè¿ã§ã¯ãåæ§ã®âŒ¿é ãèŠèŠã¿ã¹ã¯ã«é©âœ€ããé¡ã®åŠçã«ç¹åããçµè·¯ãèŠèŠç³»ã§âœ£ããçç±ã®èª¬æã«âœ€ãããã (Dobs, Kell, Palmer, Cohen, &Kanwisher, 2019)ã ãŸããåæ§ã®åé¡ã¯ãåâŒã®ãããã¯ãŒã¯ã 2 ã€ã®ã¿ã¹ã¯ãå®âŸããããã«èšç·Žããèšç·Žããããããã¯ãŒã¯ã®äžã§åºçŸãããµããããã¯ãŒã¯ãæ¢ãããšã§ãã¢ãããŒ
ããããŸãã (Scholte, Losch, Ramakrishnan, de Haan, & Bohte, 2018)ã ãã®ãããªç 究ã¯ãèåŽãšè ¹åŽã®ã¹ããªãŒã ã®ååšããèŠèŠã¢ãŒããã¯ãã£ã®ãã®ä»ã®è©³çŽ°ã説æããããšãã§ããŸãã
✣ç©åŠã«ãã³ããåŸãŠãå€ãã®ç 究ã§ã¯ãå±æçãªååž°æ§ãšãã£ãŒãããã¯çãªååž°æ§ã®äž¡âœ ãæçãªåœ¹å²ãæããããšãæ€èšãããŠããŸã (å³3)ã å±æçååž°æ§ãšã¯ã1 ã€ã®èŠèŠéã®äžã§ã®âœå¹³âœ åã®ã€ãªããã®ããšã§ããã ãããã®æ¥ç¶ã CNN ã«è¿œå ããããšã§ããããã®ååž°çãªæ¥ç¶ããããã¯ãŒã¯ãããå°é£ãªã¿ã¹ã¯ã«é©ãããã®ã«ãããšããç 究çµæããã(Hasani, Soleymani, & Aghajan, 2019; Montobbio, Bonnasse-Gahot, Citti, & Sarti, 2019; Spoerer, Kietzmann, & Kriegeskorte, 2019; Tang et al. 2018)ã ãŸãããããã®æ¥ç¶ã¯ãç¹ã«å°é£ãªç»åãå¿çã®åŸã®æç¹ã§ãCNN è¡šçŸãç¥çµããŒã¿ãšããããâŒèŽãããã®ã«åœ¹âœŽã¡ãŸã (Kar et al., 2019; Kubilius et al., 2019; Rajaei, Mohsenzadeh, Ebrahimpour, & Khaligh-Razavi, 2019;Shi, Wen, Zhang, Han, & Liu, 2018; McIntosh, Maheswaranathan, Nayebi, Ganguli, & Baccus, 2016)ã ãããã®ç 究ã¯ããããã®å±æçãªæ¥ç¶ã®èšç®äžã®åœ¹å²ã匷ã䞻匵ããŠããŸãã
ãã£ãŒãããã¯æ¥ç¶ã¯ãåé èãé é èã®é åããèŠèŠç³»ã®é åãžããããã¯âŸŒæ¬¡ã®èŠèŠé åããäœæ¬¡ã®é åãžãšæ»ã (Wyatte, Jilk, & O'Reilly, 2014)ã âœå¹³âœ åã®ååž°æ§ãšåæ§ã«ã✣ç©ã®èŠèŠã§ã¯ããâŸãããããšãç¥ãããŠããŸãã åé èãšé é èããã®æ¥ç¶ã¯ã✬æšã«åããéžæç泚æãå®çŸãããšèããããŠããã ãã®ãããªãã£ãŒãããã¯ã¯ããããã¯ãŒã¯ã¢ãã«ã«è¿œå ããã⌿ãããä»ãæ€åºã¿ã¹ã¯ãå®è£ ããŠãã (Thorat, van Gerven, & Peelen, 2019; Wang, Zhang, Song, & Zhang, 2014)ã ⟌次ã®èŠèŠéããäœæ¬¡ã®èŠèŠéã«æ»ããã£ãŒãããã¯ã¯ããã€ãºé€å»ã®ãããªããå³æçã§âŒè¬çãªç»ååŠçãå®è£ ãããšèããããŠããã ããã€ãã®ç 究ã§ã¯ãå±æçãªååž°æ§ã«å ããŠãããã®æ¥ç¶ãè¿œå ããåæ§ã«ããã©ãŒãã³ã¹ãå©ããããšãã§ããããšãããã£ãŠããŸã (Kim, Linsley,Thakkar, & Serre, 2019; Spoerer, McClure, & Kriegeskorte, 2017)ã ç°ãªããã£ãŒãããã¯ãšãã£ãŒããã©ã¯ãŒãã®ã¢ãŒããã¯ãã£ãâœèŒããç 究ã§ã¯ããã£ãŒãããã¯ã现èã®æå¹ãªå容éãµã€ãºãå¢ããããšã§âŒéšåœ¹âœŽã€ããšã✰åãããŠããŸã (Jarvers & Neumann, 2019)ã
èšç·ŽâŒ¿ç¶ãã®ä»£æ¿
ããã¯ãããã²ãŒã·ã§ã³ã✀ããæåž«ä»ãåŠç¿ã¯ãCNN ãåŠç¿ããæãâŒè¬çãªâœ æ³ã§ããããä»ã®âœ æ³ã§ããèŠèŠã·ã¹ãã ã®åªããã¢ãã«ãããããå¯èœæ§ãããã äŸãã°ãç©äœèªèã®ããã©ãŒãã³ã¹ãšç¥çµåå¿ãæããèœâŒã®éã«çžé¢é¢ä¿ãããããšãæåã«âœ°ãã2014幎ã®ç 究 (Yaminsãã2014) ã§ã¯ãããã¯ãããã²ãŒã·ã§ã³ã§ã¯ãªããã¢ãžã¥ãŒã«åŒã®æé©å⌿é ã✀ããŠããã
ãããã¯ãŒã¯ãâŒâŒãšåºâŒãâŒèŽãããã®ã§ã¯ãªããâŒâŒããŒã¿ã®é¢é£ããçµ±èšãæããããšã✬çãšããæåž«ãªãåŠç¿ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã®åŠç¿ã«äœ¿âœ€ããããšãã§ãã
(Fleming & Storrs, 2019; å³3)ã ãããã®âœ æ³ã¯ã⟌次å ã®èŠèŠâŒâŒã®æ ¹åºã«ããäœæ¬¡å ã®ç¹åŸŽã®ã»ãããç¹å®ããã®ã«åœ¹âœŽã¡ããã®çµæãåç©ãäžçãããããç解ã§ããããã«ãªããããããæ✀ãªå æé¢ä¿ã¢ãã«ãæ§ç¯ã§ããããã«ãªããããããªã (Lake, Ullman, Tenenbaum, & Gershman, 2017)ã ããã«ãæåž«ããã®åŠç¿ã«ã¯âŒ€éã®ã©ãã«ä»ãããŒã¿ãã€ã³ããå¿ èŠãªãããè³ã¯æåž«ãªãã®åŠç¿ãå©âœ€ããªããã°ãªããªããšèããããŠããã ããããä»ã®ãšãããæåž«ãªãã®âŒ¿æ³ã§ã¯ãæåž«ããã®âŒ¿æ³ãšåæ§ã«ç¥çµè¡šçŸãæããã¢ãã«ãåŸãããªãã âŸååŠçã«ã¯ã✣æã¢ãã«ã¯ãâŒéã®ç»ååé¡ãæããäžã§ãæåž«ããã¢ãã«ãããperformmuchæªãããšã✰ãããŸãã (Peterson, Abbott, & Griffiths, 2018)ã ãŸããäºæž¬ç¬Šå·åã®æŠå¿µã«åºã¥ããŠåŠç¿ãããã¢ãã« (Lotter, Kreiman, & Cox, 2016) ã¯ãç©äœã®åããäºæž¬ããåãã®é¯èŠãåçŸããããšãã§ããŸãã (Watanabe, Kitaoka, Sakamoto, Yasugi, & Tanaka,2018)ã å šäœãšããŠãèŠèŠã¢ãã«ã®æåž«ãªãèšç·Žã®éçãšå©ç¹ã«ã€ããŠã¯ããããªã調æ»ãå¿ èŠã§ããã
èå³æ·±ãããšã«ãæè¿ã®ç 究ã§ã¯ãã¯ã©ã¹æ¡ä»¶ä»ãã®âœ£æã¢ãã«ããåé¡ãâŸãããã«èšç·Žãããã¢ãã«ããããè«äºã®çã«ãªã£ãŠããåºæ¿ã«å¯ŸããâŒéã®å€æãšããâŒèŽããããšããã
ã£ã (Golan, Raju, & Kriegeskorte, 2019)ã ãã®âœ£æã¢ãã«ã¯äŸç¶ãšããŠã¯ã©ã¹ã©ãã«ã«äŸåããŠåŠç¿ããŠãããæåž«ãªãã§ã¯ãªãããâŒéã®ç¥èŠã®åŽâŸ¯ãåçŸããèœâŒã¯ãèŠèŠã·ã¹ãã ãåã«ç»åãç©äœã®ã«ããŽãªãŒã«æŒããã®ã§ã¯ãªããèŠèŠäžçã®ååžãææ¡ããããšãéšåçã«âœ¬çãšããŠãããšããèã✠ãâœæãããã®ã§ããã
æåž«ãªãæ³ãšæåž«ããæ³ã®åŠ¥åç¹ããåæåž«ãããåŠç¿ã§ãããæè¿ã§ã¯ãã✣ç©åŠçã«çŸå®çãªãããã¯ãŒã¯ãäœãããã®âŒ¿æ®µãšããŠæ¢æ±ãããŠããïŒZhuang, Yan, Nayebi, & Yamins,2019; Zhuang, Zhai, & Yamins, 2019ïŒã
匷ååŠç¿ã¯ãæ©æ¢°åŠç¿ã«ããã3çªâœ¬ã®äž»èŠãªåŠç¿âœ æ³ã§ãããããã®ã·ã¹ãã ã§ã¯ãâŒâŒ¯çãªãšãŒãžã§ã³ãã¯ãå ±é ¬ãå«ãç°å¢ããã®æ å ±ã«å¿ããŠâŸåã®çµæã✣ã¿åºãããšãåŠç¿ããªããã°ãªããªãã ãã®ãããªâŒâŒ¯ã·ã¹ãã ã®ããã€ãã¯ãäžçã«é¢ããèŠèŠæ å ±ãåŠçããããã«ãããã³ããšã³ãã«ç³ã¿èŸŒã¿ã¢ãŒããã¯ãã£ã䜿✀ããŠãã (å³3ïŒMerel et al.2018ïŒPaine et al.2018ïŒZhu et al. 2018)ã ãããã®ã¢ãã«ã®âœèã§åŠç¿ãããè¡šçŸããä»ã®ã¡ã«ããºã ã§åŠç¿ãããè¡šçŸãããŒã¿ãšâœèŒããããšã¯èå³æ·±ãããšã§ãã
ãããã¯ãŒã¯ã®ãã¬ãŒãã³ã°ã®éèŠæ§ãç解ããç°¡åãªâœ æ³ã¯ãåãã¢ãŒããã¯ãã£ã ãã©ã³ãã ãªéã¿ãæã€ãããã¯ãŒã¯ãšâœèŒããããšã§ãã Kim, Reif, et al.(2019)ã§ã¯ããããã¯ãŒã¯ãç¥èŠçãªã¯ããŒãžã³ã°ãâŸãèœâŒã¯ãã©ã³ãã ãªãããã¯ãŒã¯ã§ã¯ãªããâŸç¶ãªç»åã§èšç·Žããããããã¯ãŒã¯ã«ã®ã¿ååšããã
äžèšã®âœ æ³ã¯ãç¥çµèšé²ããåŸãããªãâŒé£ã®ç»åãªã©ã®ãã¬ãŒãã³ã°ããŒã¿ã«äŸåããŠããŸãã ãããããããã®ã¢ãŒããã¯ãã£ããã¬ãŒãã³ã°ããŠãç¥çµæŽ»åãçŽæ¥åçŸããããšã¯å¯èœã§ãã ããããããšã§ããã¥ãŒãã³ã®çºâœã«æãé¢äžããâŒâŒç»åã®ç¹åŸŽãç¹å®ããããšãã§ããŸã (Cadena, Denfield, et al. 2019; Kindel, Christensen, & Zylberberg, 2019; Sinz et al.,2018). çæ³çã«ã¯ããããã¯ãŒã¯ãåé¡ã¿ã¹ã¯ã§èšç·Žããããšãã«âŸãããããã«ãã¢ãã«ã®æ§æèŠçŽ ãåœè©²åè·¯ã®è§£ååŠçç¹åŸŽã«ãé¢é£ä»ããããšãã§ãã (GÃŒnthner et al., 2019;Maheswaranathan et al., 2018)ã
æåŸã®ãã€ããªãããªãã·ã§ã³ã¯ãåé¡ã¿ã¹ã¯ã§ãã¬ãŒãã³ã°ããªããããã¥ãŒã©ã«ããŒã¿ã䜿ã£ãŠäžéè¡šçŸãè³ã®ãããªãã®ã«å¶çŽããããšã§ã (Fong, Scheirer, & Cox, 2018)ã ãã®çµæããã¥ãŒã©ã«ããŒã¿ãšãããããããããã¿ã¹ã¯ãå®âŸã§ãããããã¯ãŒã¯ãã§ãããããŸãã
CNN ã®ç解ã®ä»âœ ãããã¯ãŒã¯ã®æ§é ããã¬ãŒãã³ã°ãå€åãããããšã¯ããããã¯ãŒã¯ãã©ã®ããã«æ©èœããããæ¢ãããã®1ã€ã®âœ æ³ã§ãã ããã«ãèšç·Žããããããã¯ãŒã¯ã¯ãéåžžãè³ã«é©âœ€ãããæè¡ãã¢ãã«ã§ããå©âœ€ã§ããªãæè¡ã䜿ã£ãŠçŽæ¥èª¿ã¹ãããšãã§ããïŒSamek & MÃŒller, 2019ïŒã ãããã«ããŠããCNN ãèŠèŠã·ã¹ãã ã®ã¢ãã«ãšããŠæ€èšŒãããŠãããšèããã°ããã®ä»çµã¿ãæ¢ãããšã§åŸãããç¥âŸã¯ã✣ç©åŠçãªèŠèŠã«ãå¿âœ€ã§ãããããããªãã
å®éšç✠æ³
æšæºçãªç¥çµç§åŠã®ããŒã«ããã¯ã¹ã§ãããç å€ãèšé²ã解ååŠçãã¬ãŒã·ã³ã°ãåºæ¿ããµã€ã¬ã³ã·ã³ã°ãªã©ã®ããŒã«ã¯ããã¹ãŠâŒâŒ¯ãã¥ãŒã©ã«ãããã¯ãŒã¯ã§å®¹æã«å©âœ€ã§ãã Barrett,Morcos, & Macke, 2019ïŒããããã®ãããã¯ãŒã¯ã®åãã«é¢ããçåã«çããããã«äœ¿âœ€ããããšãã§ããŸãã
äŸãã° CNN ã®åã ã®ãŠãããã "ã¢ãã¬ã€ãã£ã³ã°" ããããšã§ãåé¡ç²ŸåºŠã«åœ±é¿ãäžããããšãã§ããŸããã ç¹å®ã®ãŠããããã¢ãã¬ã€ãã£ã³ã°ããããšã®åœ±é¿ã¯ããã®ãŠãããã®éžææ§ã®ç¹æ§ãšã¯åŒ·ãé¢ä¿ããããŸãã (Morcos, Barrett, Rabinowitz, & Botvinick, 2018; Zhou, Sun, Bau, & Torralba, 2018)ã
Bonner and Epstein (2018) ã®ç 究ã§ã¯ãç»åãæäœããããé èœãããããŠãã©ã®ç¹åŸŽãã·ãŒã³ã«å¯Ÿãã CNN ã®åå¿ãæ ã£ãŠãããããæ¬ç©ã®ãã¥ãŒãã³ã®æ©èœãæ¢ãã®ãšåæ§ã«èª¿ã¹ãã
ä»ã®ç 究 (Lindsey et al., 2019; Spoerer et al., 2019) ã§ã¯ãèšç·Žããããããã¯ãŒã¯ã®æ¥ç¶ç¹æ§ãåæããŠãèŠèŠâœªè³ªã®è§£ååŠçç¹åŸŽãåçŸããŠãããã©ããã確èªããŠããŸãã
èŠèŠåŠçã®å段éã§âŸãããŠããããšã説æããããã®âŒã€ã®æŠå¿µçæ çµã¿ãšããŠã"untangling" ããããŸãã ãã¯ã»ã«ã網èã®è¡šçŸã®äžã§çµ¡ã¿åã£ãŠãã⟌ã¬ãã«ã®æŠå¿µããåŸã®è¡šçŸã®äžã§åé¢ããããã¯ã©ã¹ã¿ãŒã圢æããããã«åŒãé¢ãããã ãã®çè«ã¯ã✣ç©åŠçããŒã¿ã✀ããŠéçºãããŠããã (DiCarlo & Cox, 2007), æè¿ã§ã¯ã ãããã®ã¯ã©ã¹ã¿ãŒ (ãŸãã¯å€æ§äœ) ã®åœ¢ç¶ãèšè¿°ããæè¡ãéçºããã深局ãã¥ãŒã©ã«ãããã¯ãŒã¯ã«ãããã¢ã³ã¿ã³ã°ãªã³ã°ããã»ã¹ã®ç解ã«âœ€ããããŠãã (Cohen, Chung, Lee, & Sompolinsky, 2019; Chung, Lee, &Sompolinsky, 2018)ã æ¬ç 究ã§ã¯ããããã®å€æ§äœã®åé¡ã«é¢é£ããç¹åŸŽãšãããããåŠç¿ããã³åŠç段éãéããŠã©ã®ããã«å€åããããæããã«ããŠããŸãã ããã¯ãããŒã¿ã®äžããã©ã®å¿çã®ç¹åŸŽãæ¢ãã¹ãããç¹å®ããã®ã«åœ¹âœŽã¡ãŸãã
æ°åŠçåæ
CNN ãå®çŸ©ãã✠çšåŒã«å®å šã«ã¢ã¯ã»ã¹ã§ããã°ãçŸåšã¯è³ã«é©âœ€ã§ããªãå€ãã®æ°åŠç⌿æ³ãå®âŸã§ããã ãã®ãããªâŒè¬çãªããŒã«ã®ã²ãšã€ããã°ã©ããŒã·ã§ã³ã®èšç®ã§ãããã°ã©ããŒã·ã§ã³ã¯ããããã¯ãŒã¯å ã®ç¹å®ã®ã³ã³ããŒãã³ãããé ãé¢ããä»ã®ã³ã³ããŒãã³ãã«ã©ã®ãããªåœ±é¿ãäžãããã✰ãã ã°ã©ããŒã·ã§ã³ã¯ãããå±€ã®éã¿ãã©ã®ããã«å€ããã°åºâŒã§ã®ãšã©ãŒãæžå°ããããå€æãããããã¯ãŒã¯ãèšç·Žããããã«äœ¿âœ€ãããŸãã ãããããããã¯ãŒã¯å ã®ãããŠãããã®ç¹åŸŽãèŠèŠåããããã«ã䜿✀ãããŸãã ãã®å ŽåãåŸé ã®èšç®ã¯âŒâŒç»åã«ãŸã§åã³ãç¹å®ã®ãŠãããã®æŽ»åã⟌ããããã«åã ã®ãã¯ã»ã«ãã©ã®ããã«å€åãããã¹ããã決å®ããããšãã§ãã (Olah et al.2017; Figure 4)ã ãŸããç¹åŸŽã®å¯èŠ
åã®è€æ°ã®ããªãšãŒã·ã§ã³ã¯ãäŸãã°ãç°ãªãå±€ã«ã©ã®äžå€æ§ãååšãããã✰ãããšã§ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®æ©èœãæ¢ãããã«ã䜿✀ãããŠãã (Nguyen, Yosinski, & Clune,2019) (Cadena, Weis, Gatys, Bethge, & Ecker, 2018)ã
åŸé ã¯ãâŒâŒ¯ãã¥ãŒãã³ãåé¡ã¿ã¹ã¯ã§æãã圹å²ãå€æããããã«ã䜿✀ã§ããŸãã LindsayãšMillerïŒ2018ïŒã®ç 究ã§ã¯ããããŠããããç»åãç¹å®ã®ã«ããŽãªãŒã®ãã®ãšããŠåé¡ããéã«æãã圹å²ã¯ããã®ã«ããŽãªãŒã®ç»åã«ã©ãã ã匷ãåå¿ããããšã¯å¯æ¥ã«çžé¢ããŠããªãããšãããããŸããã äžèšã®ã¢ãã¬ãŒã·ã§ã³ã®ç 究ãšåæ§ã«ãããã¯ãã¥ãŒãã³ã°ãšæ©èœã®è§£é¢ã✰ããŠãããç¥çµç§åŠè ãã¡ã¯ãè³ã®ãã¥ãŒãã³ã°ã«åºã¥ãåæãã©ãã»ã©æçã§ãããã«ã€ããŠèºèºããã¯ãã§ãã
æ©æ¢°åŠç¿ã®ç 究è ãæ°åŠè ããCNN ãäŒçµ±çãªæ°åŠçåæã«é©ãããã®ã«ããããšã✬æããŠããã ããšãã°ãæ å ±çè« (Tishby & Zaslavsky, 2015) ããŠã§ãŒãã¬ããæ£ä¹±ãªã©ã®æŠå¿µããã®âœ¬çã®ããã«äœ¿ãããŠãã (Mallat, 2016)ã æ°åŠçåæã®ããâŒã€ã®å®ãããã¢ãããŒãã¯ããã®è¿äŒŒãããå€ãã®åæãå¯èœã«ãããšããŠã深局ç·åœ¢ãã¥ãŒã©ã«ãããã¯ãŒã¯ãç 究ããããšã§ãã (Saxe, McClelland, & Ganguli, 2013)ã
èšç®è«çç¥çµç§åŠã«ãããããã€ãã®ç 究ã§ã¯ã åçŽãªãªã«ã¬ã³ããã¥ãŒã©ã«ãããã¯ãŒã¯ãèšç·ŽããŠã¿ã¹ã¯ãå®âŸããããã®åŸãèšç·Žãããããã¯ãŒã¯ã®ç¹æ§ã調ã¹ãŠããã®ä»çµã¿ãç¥ã⌿ããããšããŠãã (Barak, 2017; Foerster, Gilmer, Sohl-Dickstein, Chorowski, & Sussillo, 2017; Sussillo & Barak, 2013)ã ä»åŸã¯ãCNN ã®åŠç¿ãããè¡šçŸãæ¥ç¶ãã¿ãŒã³ãç解ããããã®ãããæŽç·Žãããæè¡ãéçºããå¿ èŠãããã ããã¯ããããã®ãããã¯ãŒã¯ãã©ã®ããã«æ©èœãããã«ã€ããŠã®æŽå¯ãæäŸãããšåæã«ãã©ã®ãããªå®éšâŒ¿æ³ãããŒã¿è§£æ✠æ³ãè¿œæ±ããããšãå®ãããããšãªã®ãã✰ãããšãã§ããã
CNN ã¯ç解å¯èœãïŒ
ç 究è ã®ãªãã«ã¯ãCNN ã¯è€éããšè§£éå¯èœæ§ã®éã®é¿ããããªããã¬ãŒããªãã®ååšã§ãããšèããâŒãããã å®äžçã®ã¿ã¹ã¯ãå®âŸã§ããã»ã©è€éãªã¢ãã«ãäœãããã«ã¯ãã·ã¹ãã ç¥çµç§åŠã®å€ãã«åºæã®âœ¬æšã§ãããåã¹ããŒãžãã©ã®ããã«åäœããããç°¡åã«èª¬æãããšãã欲æ±ãç ç²ã«ããªããã°ãªããªãã ãã®ãããªçç±ãããèšç®ã«é¢ããåçŽãªèšè¿°ã«é Œããã«ã³ã³ãã¯ãã«ãããã¯ãŒã¯ãèšè¿°ããå¥ã®âœ æ³ãææ¡ãããŠããïŒLillicrap & Kording, 2019; Richards et al. 2019). ãã®èŠç¹ã§ã¯ãå ·äœçãªèšç®ãèšè¿°ããã®ã§ã¯ãªãããããã¯ãŒã¯ã®ã¢ãŒããã¯ãã£ãæé©åæ©èœãåŠç¿ã¢ã«ãŽãªãºã ãèšè¿°ããããšã«çŠç¹ãåœãŠãŠããŸãããªããªããå ·äœçãªèšç®ãè¡šçŸã¯ãããã3ã€ã®èŠçŽ ããåçŽã«âœ£ãŸããŠãããšèããããããã§ãã
確ãã«ãåã ã®ãã¥ãŒãã³ããã¥ãŒãã³çŸ€ã®åœ¹å²ãâŸèªããŒã¹ã§èšè¿°ãããšããæŽå²çãªâœ¬ç (ããšãã° "✠åã«å調ãã" ãšã "é¡æ€åºåš" ãšãïŒã¯ã CNN ã®æ¬è³ªçãªèšç®ãæãã✠æ³ãšããŠã¯âŸ®åžžã«äžå®å šã§ããããã«æããã ãããããããã®ãããã¯ãŒã¯ã®æ©èœãèšè¿°ããã«ã¯ããã£ãšã³ã³ãã¯ããªâœ æ³ãããããã®ãããªã·ã³ãã«ãªèšè¿°ãâŸã€ããããšã§ããããªãç解ãåŸãããšãã§ãããšãèããããã ããšãã°ãããçš®ã®éã¿ã®ã»ããã ãããå®äžçã®åé¡ã¿ã¹ã¯ã§ãããã¯ãŒã¯ãããŸãåäœãããããšãã§ããã ä»ã®ãšããããã® ãè¯ããéã¿ã«ã€ããŠã¯ãæé©åã«ãã£ãŠåŸããããšããããšããããã£ãŠããŸããã ãããããããã¯ãŒã¯ã®èª¬æãããåçž®ãããã®ã«ããããã«ãç¹å®ã§ããæ¬è³ªçãªç¹æ§ã¯ããã®ã§ããããïŒ å®ãããã®ãããªåŠç¿âœ æ³ã§ã¯ãâŒéšã®éã¿ã ãã䜿ã£ãŠâŸŒå¯åºŠãªãããã¯ãŒã¯ãšåãããã«æ©èœãããããã¯ãŒã¯ãäœãåºãããšãã§ããŸãã æè¿ã®ç 究ã§ã¯ãã¢ãã«ã®éã¿ã®95%ããæ®ãã® 5% ããæšæž¬ã§ããããšãææãããŠããŸã (Frankle & Carbin, 2018; Denil, Shakibi, Dinh, Ranzato, & de Freitas, 2013)ã ãã®ãããªç¥âŸã¯ã⟌ããã©ãŒãã³ã¹ã®ãããã¯ãŒã¯ã®èšç®ã«ã€ããŠãããåçž®ããã (ãããã£ãŠãããç解ããããå¯èœæ§ã®ãã) èšè¿°ãå¯èœã§ããããšã✰åããŠããã
ã¢ãŒããã¯ãã£ã®åæ§ã®åæã¯ãæ¥ç¶æ§ã®ã©ã®åºç¯ãªç¹åŸŽãè¯å¥œãªããã©ãŒãã³ã¹ã✣ã¿åºãã®ã«âŒåã§ããããå€æããããã«âŸãããšãã§ããã äŸãã°ãã©ã³ãã ãªéã¿ã✀ããæè¿ã®ç 究ã§ã¯ãç¹å®ã®éã¿ã®å€ã«å¯Ÿããã¢ãŒããã¯ãã£ã®åœ¹å²ãåé¢ããã®ã«åœ¹âœŽã€ïŒGaier & Ha, 2019ïŒã ⟌ããã©ãŒãã³ã¹ã®ãããã¯ãŒã¯ã®æ¬è³ªçãªç¹åŸŽãããã³ã³ãã¯ãã«èšè¿°ããããšã¯ãæ©æ¢°åŠç¿ãšç¥çµç§åŠã®äž¡âœ ã«ãšã£ãŠã®âœ¬æšã§ããã
深局åŠç¿ã®çŽãããªã⟮ã³ã³ãã¯ããªã³ã³ããŒãã³ãã®âŒã€ãããŒã¿ã»ããã§ãã ãããŸã§è¿°ã¹ãŠããããã«ã✣ç©åŠçãªèŠèŠã®æ§èœãšç¥çµè¡šçŸãâŒèŽãããã«ã¯ã⌀èŠæš¡ãªçŸå®äžçã®ããŒã¿ã»ãããå¿ èŠã§ãã ç§ãã¡ã¯ãèŠèŠãã©ã®ããã«æ©èœããã®ãã説æããããã«ãImageNetããŒã¿ã»ãããæã¡æ©ãããšãéåœã¥ããããŠããã®ã§ããããïŒ ãããšããâŒåãªçµ±èšéã®ã»ãããå®çŸ©ããããšãã§ããã®ã§ããããïŒâŸç¶ãªã·ãŒã³ã®çµ±èšã¯ãèšç®è«çç¥çµç§åŠãšã³ã³ãã¥ãŒã¿ããžã§ã³ã®äž¡âœ ã«ãããŠãæŽå²çã«âŒ€ããªåœ¹å²ãæãããŠããŸãã
ïŒSimoncelli & Olshausen, 2001ïŒã ãã®ç 究ã®å€ãã¯ãç©äœèªèã«é¢é£ããç¹åŸŽãæããã«ã¯äžâŒåãªäœæ¬¡ã®çžé¢é¢ä¿ã«çŠç¹ãåœãŠãŠããŸãããã⟌次ã®ç»åçµ±èšãæ¢æ±ããã«ã¯æ©ãçããŠããããã«æãããŸããç¹ã«ã✣æã¢ããªã³ã°ïŒSalakhutdinov, 2015ïŒã®é²æ©ã«ãããâŸç¶ç»åã®å®å šã§è€éãªç¹åŸŽãã¢ãã«ã«åçž®ããèœâŒãææãããŠããŸãã
ãããã«ããŠããCNN ã®è§£éå¯èœæ§ã«å¯Ÿããæ¹å€ã®ã»ãšãã©ãã¹ãŠãã✣ç©åŠçãããã¯ãŒã¯ã«ãåæ§ã«é©âœ€ã§ããã ãããã£ãŠãCNN ã¯ãç¥çµã·ã¹ãã ã«ãããç解ã®ãã✠ã決ããããã®ããå®éšå Žãšãªãã
åºç€ãè¶ ããŠ
2014 幎以éãCNN ã®ããŸããŸãªæ©èœãå€åãããããšã§ãCNNã®ç¹æ§ãããŒã¿ç §åèœâŒãã©ã®ããã«å€åããã®ããå€ãã®çåãççºçãªç 究ã«ãã£ãŠè§£æ±ºãããŠããŸããã ãããã®çºâŸã¯ã✣ç©åŠçèŠèŠã® "âœ æ³ "ãš "çç± "ã®äž¡âœ ãç解ããäžã§ã®å©ããšãªããŸããã ããã«ãèŠèŠã·ã¹ãã ã®ãç»åèšç®å¯èœããªã¢ãã«ãå©âœ€ã§ããããã«ãªããšãèŠèŠã®äžæ žéšåã ãã§ãªããããŸããŸãªã¢ããªã³ã°ã®å¯èœæ§ãéããŠããã
å³5. è³ã«ããèŠèŠæ å ±ã®ããŸããŸãªå©âœ€æ³ã®âŒäŸã èŠèŠåŠçã¢ãã«ãšããŠã® CNN ã®å€ãã¯ è ¹åŽèŠèŠçµè·¯ã«çŠç¹ãåœãŠãŠããã ããã èŠèŠç³»ã¯è³ã®ä»ã®å€ãã®éšåãšçžäºäœâœ€ã, ããŸããŸãªâœ¬æšãéæããŠããã å°æ¥ã®ã¢ãã«ã§ã¯ CNN ããã®âŒ€ããªå³åŒã«çµ±åããäœæ¥ãå¿ èŠã§ããã
âœç®
Baker, N., Lu, H., Erlikhman, G., & Kellman, P. J. (2018). Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14, e1006613.https://doi.org/10.1371/journal.pcbi.1006613. PMID: 30532273. PMCID: PMC6306249.
Barak, O. (2017). Recurrent neural networks as versatile tools of neuroscience research. Current Opinion in Neurobiology, 46, 1â6. https://doi.org/10.1016/j.conb.2017.06.003. PMID:28668365.
Barrett, D. G. T., Morcos, A. S., & Macke, J. H. (2019). Analyzing biological and artificial neural networks: Challenges with opportunities for synergy? Current Opinion in Neurobiology, 55,55â64. https://doi.org/10.1016/j.conb.2019.01.007. PMID: 30785004.
Bartunov, S., Santoro, A., Richards, B. A., Marris, L., Hinton, G. E., & Lillicrap, T. P. (2018). Assessing the scalability of biologically-motivated deep learning algorithms and architectures.In Advances in neural information processing systems (pp. 9390â9400). Montreal, Canada.
Bashivan, P., Kar, K., & DiCarlo, J. J. (2019). Neural population control via deep image synthesis. Science, 364, eaav9436. https://doi.org/10.1126/science.aav9436. PMID: 31048462.
Bonner, M. F.,&Epstein, R. A. (2018). Computational mechanisms underlying cortical responses to the affordance properties of visual scenes. PLoS Computational Biology, 14, e1006111.https://doi.org/10.1371/journal.pcbi.1006111. PMID: 29684011. PMCID: PMC5933806
Bracci, S., Ritchie, J. B., Kalfas, I., & Op de Beeck, H. P. (2019). The ventral visual pathway represents animal appearance over animacy, unlike human behavior and deep neuralnetworks. Journal of Neuroscience, 39, 6513â6525. https://doi.org/10.1523/JNEUROSCI.1714-18.2019. PMID: 31196934. PMCID: PMC6697402.
Cadena, S. A., Denfield, G. H., Walker, E. Y., Gatys, L. A., Tolias, A. S., Bethge, M., et al. (2019). Deep convolutional models improve predictions of macaque V1 responses to naturalimages. PLoS Computational Biology, 15, e1006897. https://doi.org/10.1371/journal.pcbi.1006897. PMID: 31013278. PMCID: PMC6499433.
Cadena, S. A., Sinz, F. H., Muhammad, T., Froudarakis, E., Cobos, E., Walker, E. Y., et al. (2019). How well do deep neural networks trained on object recognition characterize the mousevisual system? In NeurIPS Workshop on Real Neurons and Hidden Units. Vancouver, Canada.
Cadena, S. A., Weis, M. A., Gatys, L. A., Bethge, M., & Ecker, A. S. (2018). Diverse feature visualizations reveal invariances in early layers of deep neural networks. In Proceedings ofthe European Conference on Computer Vision (ECCV) (pp. 217â232).Munich, Germany. https://doi.org/10.1007/978-3-030-01258-8_14.
Cadieu, C., Kouh, M., Pasupathy, A., Connor, C. E., Riesenhuber, M., & Poggio, T. (2007). A model of V4 shape selectivity and invariance. Journal of Neurophysiology, 98, 1733â1750.https://doi.org/10.1152/jn.01265.2006. PMID: 17596412.
Cao, Y., Chen, Y., & Khosla, D. (2015). Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 113, 54â66.https://doi.org/10.1007/s11263-014-0788-3.
Chang, N., Pyles, J. A., Marcus, A., Gupta, A., Tarr, M. J., & Aminoff, E. M. (2019). BOLD5000, a public fMRI dataset while viewing 5000 visual images. Scientific Data, 6, 49.https://doi.org/10.1038/s41597-019-0052-3. PMID: 31061383. PMCID: PMC6502931
Chung, S., Lee, D. D., & Sompolinsky, H. (2018). Classification and geometry of general perceptualmanifolds. Physical Review X, 8, 031003. https://doi.org/10.1103/PhysRevX.8.031003.
Cichy, R. M., Khosla, A., Pantazis, D., & Oliva, A. (2017). Dynamics of scene representations in the human brain revealed by magnetoencephalography and deep neural networks.Neuroimage, 153, 346â358. https://doi.org/10.1016/j.neuroimage. 2016.03.063. PMID: 27039703. PMCID: PMC5542416.
Cichy, R. M., Kriegeskorte, N., Jozwik, K. M., van den Bosch, J. J. F., & Charest, I. (2019). The spatiotemporal neural dynamics underlying perceived similarity for real-world objects.Neuroimage, 194, 12â24. https://doi.org/10.1016/j.neuroimage. 2019.03.031. PMID: 30894333. PMCID: PMC6547050.
Cohen, U., Chung, S., Lee, D. D., & Sompolinsky, H. (2019). Separability and geometry of object manifolds in deep neural networks. bioRxiv, 644658. https://doi.org/10.1101/644658.
Denil, M., Shakibi, B., Dinh, L., Ranzato, M. A., & de Freitas, N. (2013). Predicting parameters in deep learning. In Advances in neural information processing systems (pp. 2148â2156).Lake Tahoe, NV.
Devereux, B. J., Clarke, A., & Tyler, L. K. (2018). Integrated deep visual and semantic attractor neural networks predict fMRI pattern-information along the ventral object processingpathway. Scientific Reports, 8, 10636. https://doi.org/10.1038/ s41598-018-28865-1. PMID: 30006530. PMCID: PMC6045572.
de Vries, S. E. J., Lecoq, J. A., Buice, M. A., Groblewski, P. A., Ocker, G. K., Oliver, M., et al. (2019). A large-scale, standardized physiological survey reveals functional organization ofthe mouse visual cortex. Nature Neuroscience, 23, 138â151. https://doi.org/10.1038/s41593-019-0550-9. PMID: 31844315. PMCID: PMC6948932.
DiCarlo, J. J., & Cox, D. D. (2007). Untangling invariant object recognition. Trends in Cognitive Sciences, 11, 333â341. https://doi.org/10.1016/j.tics.2007.06.010. PMID: 17631409. * Dill,M., & Fahle, M. (1998). Limited translation invariance of human visual pattern recognition. Perception&Psychophysics, 60, 65â81. https://doi.org/10.3758/BF03211918. PMID: 9503912
Dobs, K., Kell, A. J. E., Palmer, I., Cohen, M., & Kanwisher, N. (2019). Why are face and object processing segregated in the human brain? Testing computational hypotheses with deepconvolutional neural networks. Oral presentation at Cognitive Computational Neuroscience Conference, Berlin, Germany. https://doi.org/10.32470/CCN.2019.1405-0.
Dwivedi, K., & Roig, G. (2018). Task-specific vision models explain task-specific areas of visual cortex. bioRxiv, 402735. https://doi.org/10.1101/402735.
Eickenberg, M., Gramfort, A., Varoquaux, G.,&Thirion, B. (2017). Seeing it all: Convolutional network layers map the function of the human visual system. Neuroimage, 152, 184â194.https://doi.org/10.1016/j.neuroimage.2016.10.001. PMID: 27777172.
Fleming, R. W., & Storrs, K. R. (2019). Learning to see stuff. Current Opinion in Behavioral Sciences, 30, 100â108. https://doi.org/10.1016/j.cobeha.2019.07.004. PMID: 31886321.PMCID: PMC6919301.
Foerster, J. N., Gilmer, J., Sohl-Dickstein, J., Chorowski, J., & Sussillo, D. (2017). Input switched affine networks: An RNN architecture designed for interpretability. In Proceedings of the34th International Conference on Machine Learning (Vol. 70, pp. 1136â1145). Sydney, Australia.
Fong, R. C., Scheirer, W. J.,&Cox, D. D. (2018). Using human brain activity to guide machine learning. Scientific Reports, 8, 5397. https://doi.org/10.1038/s41598-018-23618-6. PMID:29599461. PMCID: PMC5876362.
Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193â202.https://doi.org/10.1007/BF00344251. PMID: 7370364.
Gaier, A., & Ha, D. (2019). Weight agnostic neural networks. arXiv preprint arXiv:1906.04358.
Geirhos, R., Janssen, D. H., SchÃŒtt, H. H., Rauber, J., Bethge, M., & Wichmann, F. A. (2017). Comparing deep neural networks against humans: Object recognition when the signal getsweaker. arXiv preprint arXiv:1706.06969.
Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracyand robustness. arXiv preprint arXiv:1811.12231.
Geirhos, R., Temme, C. R., Rauber, J., SchÃŒtt, H. H., Bethge, M., & Wichmann, F. A. (2018). Generalisation in humans and deep neural networks. In Advances in neural informationprocessing systems (pp. 7538â7550).
Ghodrati, M., Farzmahdi, A., Rajaei, K., Ebrahimpour, R., & Khaligh-Razavi, S.-M. (2014). Feedforward object-vision models only tolerate small image variations compared to human.Frontiers in Computational Neuroscience, 8, 74. https://doi.org/10.3389/fncom.2014.00074. PMID:25100986. PMCID: PMC4103258.
Golan, T., Raju, P. C., & Kriegeskorte, N. (2019). Controversial stimuli: Pitting neural networks against each other as models of human recognition. arXiv preprint arXiv:1911.09288.
GÌçlÃŒ, U., & van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35,10005â10014. https://doi.org/10.1523/JNEUROSCI.5023-14.2015. PMID: 26157000. PMCID: PMC6605414.
GÃŒnthner, M. F., Cadena, S. A., Denfield, G. H., Walker, E. Y., Tolias, A. S., Bethge, M., et al. (2019). Learning divisive normalization in primary visual cortex. bioRxiv, 767285.https://doi.org/10.32470/CCN.2019.1211-0.
Hasani, H., Soleymani, M., & Aghajan, H. (2019). Surround modulation: A bio-inspired connectivity structure for convolutional neural networks. In Advances in neural informationprocessing systems (pp. 15877â15888). Vancouver, Canada.
Hasson, U., Nastase, S. A., & Goldstein, A. (2019). Robust-fit to nature: An evolutionary perspective on biological (and artificial) neural networks. bioRxiv, 764258.https://doi.org/10.1101/764258.
Hong, H., Yamins, D. L. K., Majaj, N. J., & DiCarlo, J. J. (2016). Explicit information for category-orthogonal object properties increases along the ventral stream. Nature Neuroscience,19, 613â622. https://doi.org/10.1038/nn.4247. PMID: 26900926.
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the catâs visual cortex. Journal of Physiology, 160, 106â154.https://doi.org/10.1113/jphysiol.1962.sp006837. PMID: 14449617. PMCID: PMC1359523.
Jacob, G., Pramod, R. T., Katti, H., & Arun, S. P. (2019). Do deep neural networks see the way we do? bioRxiv, 860759. https://doi.org/10.1101/860759.
Jaegle, A., Mehrpour, V., Mohsenzadeh, Y., Meyer, T., Oliva, A., & Rust, N. (2019). Population response magnitude variation in inferotemporal cortex predicts image memorability. eLife,8, e47596. <https://doi.org/10.7554/eLife.47596.015. https://doi.org/10.7554/eLife.47596>. PMID: 31464687. PMCID: PMC6715346.
Jarvers, C., & Neumann, H. (2019). Incorporating feedback in convolutional neural networks. In Proceedings of the Cognitive Computational Neuroscience Conference (pp. 395â398).Berlin, Germany. https://doi.org/10.32470/CCN.2019.1191-0.
Jozwik, K. M., Kriegeskorte, N., Storrs, K. R., & Mur, M. (2017). Deep convolutional neural networks outperform featurebased but not categorical models in explaining object similarityjudgments. Frontiers in Psychology, 8, 1726. https://doi.org/10.3389/fpsyg.2017.01726. PMID: 29062291. PMCID: PMC5640771.
Kar, K., Kubilius, J., Schmidt, K., Issa, E. B., & DiCarlo, J. J. (2019). Evidence that recurrent circuits are critical to the ventral streamâs execution of core object recognition behavior.Nature Neuroscience, 22, 974â983. https://doi.org/10.1038/s41593-019-0392-5. PMID: 31036945.
Kay, K. N. (2018). Principles for models of neural information processing. Neuroimage, 180, 101â109. https://doi.org/10.1016/j.neuroimage.2017.08.016. PMID: 28793238.
Kell, A. J. E., Yamins, D. L. K., Shook, E. N., Norman-Haignere, S. V., & McDermott, J. H. (2018). A task-optimized neural network replicates human auditory behavior, predicts brainresponses, and reveals a cortical processing hierarchy. Neuron, 98, 630â644. https://doi.org/10.1016/j.neuron.2018.03.044. PMID: 29681533.
Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Computational Biology, 10, e1003915.https://doi.org/10.1371/journal.pcbi.1003915. PMID: 25375136. PMCID: PMC4222664.
Kheradpisheh, S. R., Ghodrati, M., Ganjtabesh, M., & Masquelier, T. (2016). Deep networks can resemble human feed-forward vision in invariant object recognition. Scientific Reports, 6,32672. https://doi.org/10.1038/srep32672. PMID: 27601096. PMCID: PMC5013454.
Kietzmann, T. C., McClure, P., & Kriegeskorte, N. (2019). Deep neural networks in computational neuroscience. In Oxford research encyclopedia of neuroscience. Oxford: OxfordUniversity Press. https://doi.org/10.1093/acrefore/9780190264086. 013.46.
Kim, J., Linsley, D., Thakkar, K., & Serre, T. (2019). Disentangling neural mechanisms for perceptual grouping. arXiv preprint arXiv:1906.01558. https://doi.org/10.32470/CCN.2019.1130-0.
Kim, B., Reif, E., Wattenberg, M., & Bengio, S. (2019). Do neural networks show Gestalt phenomena? An exploration of the law of closure. arXiv preprint arXiv:1903.01069.
Kindel, W. F., Christensen, E. D., & Zylberberg, J. (2019). Using deep learning to probe the neural code for images in primary visual cortex. Journal of Vision, 19, 29.https://doi.org/10.1167/19.4.29. PMID: 31026016. PMCID: PMC6485988.
King, M. L., Groen, I. I. A., Steel, A., Kravitz, D. J., & Baker, C. I. (2019). Similarity judgments and cortical visual responses reflect different properties of object and scene categories innaturalistic images. Neuroimage, 197, 368â382. https://doi.org/10.1016/j.neuroimage.2019.04.079. PMID: 31054350. PMCID: PMC6591094.
Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. (2019). Similarity of neural network representations revisited. In Proceedings of the 36th International Conference on Machine Learning(pp. 3519â3529). Long Beach, CA.
Kriegeskorte, N. (2015). Deep neural networks: A new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1, 417â446.https://doi.org/10.1146/annurev-vision-082114-035447. PMID: 28532370.
Kriegeskorte, N., & Golan, T. (2019). Neural network models and deep learning. Current Biology, 29, R231âR236. https://doi.org/10.1016/j.cub.2019.02.034. PMID: 30939301.
Kriegeskorte, N., Mur, M., & Bandettini, P. (2008). Representational similarity analysisâConnecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 4.https://doi.org/10.3389/neuro.06.004.2008. PMID: 19104670. PMCID: PMC2605405.
Kubilius, J., Bracci, S., & Op de Beeck, H. P. (2016). Deep neural networks as a computational model for human shape sensitivity. PLoS Computational Biology, 12, e1004896.https://doi.org/10.1371/journal.pcbi.1004896. PMID: 27124699. PMCID: PMC4849740.
Kubilius, J., Schrimpf, M., Kar, K., Hong, H., Majaj, N. J., Rajalingham, R., et al. (2019). Brain-like object recognitionwith high-performing shallow recurrent ANNs. arXiv preprintarXiv:1909.06161.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, e253.https://doi.org/10.1017/S0140525X16001837. PMID: 27881212.
Lake, B. M., Zaremba, W., Fergus, R., & Gureckis, T. M. (2015). Deep neural networks predict category typicality ratings for images. In R. Dale, C. Jennings, P. Maglio, T. Matlock, D.Noelle, A. Warlaumont, & J. Yoshimi (Eds.), Proceedings of the 37th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541â551. https://doi.org/10.1162/neco.1989.1.4.541.
Lillicrap, T. P., & Kording, K. P. (2019). What does it mean to understand a neural network? arXiv preprint arXiv:1907.06374.
Lindsay, G. W., & Miller, K. D. (2018). How biological attention mechanisms improve task performance in a large-scale visual system model. eLife, 7, e38105.https://doi.org/10.7554/eLife.38105. https://doi.org/10.7554/eLife.38105.030.
Lindsay, G. W., Rubin, D. B., & Miller, K. D. (2019). A simple circuit model of visual cortex explains neural and behavioral aspects of attention. bioRxiv, 875534.https://doi.org/10.1101/2019.12.13.875534.
Lindsey, J., Ocko, S. A., Ganguli, S., & Deny, S. (2019). A unified theory of early visual representations from retina to cortex through anatomically constrained deep CNNs. bioRxiv,511535. https://doi.org/10.1101/511535.
Lotter, W., Kreiman, G., & Cox, D. (2016). Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104.
Love, B. C., Guest, O., Slomka, P., Navarro, V. M., & Wasserman, E. (2017). Deep networks as models of human and animal categorization. In Proceedings of the 39th Annual Meeting ofthe Cognitive Science Society (pp. 1457â1458). London, UK.
Maheswaranathan, N., McIntosh, L. T., Kastner, D. B., Melander, J., Brezovec, L., Nayebi, A., et al. (2018). Deep learning models reveal internal structure and diverse computations in theretina under natural scenes. bioRxiv, 340943.
Mallat, S. (2016). Understanding deep convolutional networks. Philosophical Transactions of the Royal Society of London, Series A: Mathematical, Physical and Engineering Sciences,374, 20150203. https://doi.org/10.1098/rsta.2015.0203. PMID: 26953183. PMCID: PMC4792410.
Matteucci, G., Marotti, R. B., Riggi, M., Rosselli, F. B., & Zoccolan, D. (2019). Nonlinear processing of shape information in rat lateral extrastriate cortex. Journal of Neuroscience, 39,1649â1670. https://doi.org/10.1523/JNEUROSCI.1938-18.2018. PMID: 30617210. PMCID: PMC6391565.
McIntosh, L. T., Maheswaranathan, N., Nayebi, A., Ganguli, S., & Baccus, S. A. (2016). Deep learning models of the retinal response to natural scenes. In Advances in neural informationprocessing systems (pp. 1361â1369). Barcelona, Spain.
Merel, J., Ahuja, A., Pham, V., Tunyasuvunakool, S., Liu, S., Tirumala, D., et al. (2018). Hierarchical visuomotor control of humanoids. arXiv preprint arXiv:1811.09656.
Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204â2212). Montreal, Camada. Montobbio,N., Bonnasse-Gahot, L., Citti, G., & Sarti, A. (2019). KerCNNs: Biologically inspired lateral connections for classification of corrupted images. arXiv preprint arXiv:1910.08336.
Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C., & Botvinick, M. (2018). On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Murty, N. A. R., &Arun, S. P. (2017). A balanced comparison of object invariances in monkey IT neurons. eNeuro, 4, ENEURO. 0333-16.2017. https://doi.org/10.1523/ENEURO.0333-16.2017. PMID:28413827. PMCID: PMC5390242.
Murty, N. A. R., & Arun, S. P. (2018). Multiplicative mixing of object identity and image attributes in single inferior temporal neurons. Proceedings of the National Academy of Sciences,U.S.A., 115, E3276âE3285. https://doi.org/10.1073/pnas.1714287115. PMID: 29559530. PMCID: PMC5889630.
Nguyen, A., Yosinski, J., & Clune, J. (2019). Understanding neural networks via feature visualization: A survey. arXiv preprint arXiv:1904.08939. https://doi.org/10.1007/978-3-030-28954-6_4.
Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. Distill, 2, e7. https://doi.org/10.23915/distill.00007.
Paine, T. L., Colmenarejo, S. G., Wang, Z., Reed, S., Aytar, Y., Pfaff, T., et al. (2018). One-shot high-fidelity imitation: Training large-scale deep netswith RL. arXiv preprintarXiv:1810.05017.
Peterson, J. C., Abbott, J. T., & Griffiths, T. L. (2018). Evaluating (and improving) the correspondence between deep neural networks and human representations. Cognitive Science, 42,2648â2669. https://doi.org/10.1111/cogs.12670. PMID: 30178468.
Pinto, N., Cox, D. D., & DiCarlo, J. J. (2008). Why is real-world visual object recognition hard? PLoS Computational Biology, 4, e27. https://doi.org/10.1371/journal.pcbi.0040027. PMID:18225950. PMCID: PMC2211529.
Rajaei, K., Mohsenzadeh, Y., Ebrahimpour, R., & Khaligh-Razavi, S.-M. (2019). Beyond core object recognition: Recurrent processes account for object recognition under occlusion.PLoS Computational Biology, 15, e1007001. https://doi.org/10.1371/journal.pcbi.1007001. PMID: 31091234. PMCID:PMC6538196.
Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale, high-resolution comparisons of the core visual object recognition behavior of humans,monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38, 7255â7269. https://doi.org/10.1523/JNEUROSCI.0388-18.2018. PMID: 30006365. PMCID:PMC6096043.
Rawat, W., & Wang, Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation, 29, 2352â2449.https://doi.org/10.1162/neco_a_00990. PMID: 28599112.
Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., et al. (2019). A deep learning framework for neuroscience. Nature Neuroscience, 22, 1761â1770.https://doi.org/10.1038/s41593-019-0520-2. PMID: 31659335. PMCID: PMC7115933.
Riesenhuber, M., & Poggio, T. (2000). Computational models of object recognition in cortex: A review (CBCL Paper 190/AI Memo 1695). Cambridge, MA: MIT Press.https://doi.org/10.21236/ADA458109.
Roelfsema, P. R., van Ooyen, A.,& Watanabe, T. (2010). Perceptual learning rules based on reinforcers and attention. Trends in Cognitive Sciences, 14, 64â71.https://doi.org/10.1016/j.tics.2009.11.005. PMID: 20060771. PMCID: PMC2835467.
Rosenfeld, A., Solbach, M. D., & Tsotsos, J. K. (2018). Totally looks likeâHow humans compare, compared to machines. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition Workshops (pp. 1961â1964). Salt Lake City, UT. https://doi.org/10.1109/CVPRW.2018.00262.
Roy, P., Ghosh, S., Bhattacharya, S., & Pal, U. (2018). Effects of degradations on deep neural network architectures. arXiv preprint arXiv:1807.10108.
Rubin, D. B., Van Hooser, S. D., & Miller, K. D. (2015). The stabilized supralinear network: A unifying circuit motif underlying multi-input integration in sensory cortex. Neuron, 85, 402â417. https://doi.org/10.1016/j.neuron.2014.12.026. PMID: 25611511. PMCID: PMC4344127.
Sacramento, J., Costa, R. P., Bengio, Y., & Senn, W. (2018). Dendritic cortical microcircuits approximate the backpropagation algorithm. In Advances in neural information processingsystems (pp. 8735â8746). Montreal, Canada.
Salakhutdinov, R. (2015). Learning deep generative models. Annual Review of Statistics and Its Application, 2, 361â385. https://doi.org/10.1146/annurev-statistics-010814-020120.
Samek, W., & MÃŒller, K.-R. (2019). Towards explainable artificial intelligence. In W. Samek, G. Montavon, A. Vedaldi, L. K. Hansen, & K.-R. MÃŒller (Eds.), Explainable AI: Interpreting,explaining and visualizing deep learning (pp. 5â22). Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-030-28954-6_1.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
Scholte, H. S., Losch, M. M., Ramakrishnan, K., de Haan, E. H. F., & Bohte, S. M. (2018). Visual pathways from the perspective of cost functions and multi-task deep neural networks.Cortex, 98, 249â261. https://doi.org/10.1016/j.cortex.2017. 09.019. PMID: 29150140.
Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., et al. (2018). Brain-Score: Which artificial neural network for object recognition is most brain-like? bioRxiv,407007. https://doi.org/10.1101/407007.
Seeliger, K., Fritsche, M., GÌçlÃŒ, U., Schoenmakers, S., Schoffelen, J.-M., Bosch, S. E., et al. (2018). Convolutional neural network-based encoding and decoding of visual objectrecognition in space and time. Neuroimage, 180, 253â266. https://doi.org/10.1016/j.neuroimage.2017.07.018. PMID: 28723578.
Serre, T. (2019). Deep learning: The good, the bad, and the ugly. Annual Review of Vision Science, 5, 399â426. https://doi.org/10.1146/annurev-vision-091718-014951. PMID: 31394043.
Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., & Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, 165, 33â56.https://doi.org/10.1016/S0079-6123(06)65004-8.
Shen, G., Horikawa, T., Majima, K., & Kamitani, Y. (2019). Deep image reconstruction from human brain activity. PLoS Computational Biology, 15, e1006633.https://doi.org/10.1371/journal.pcbi.1006633. PMID: 30640910. PMCID: PMC6347330.
Shi, J., Wen, H., Zhang, Y., Han, K., & Liu, Z. (2018). Deep recurrent neural network reveals a hierarchy of process memory during dynamic natural vision. Human Brain Mapping, 39,2269â2282. https://doi.org/10.1002/hbm.24006. PMID: 29436055. PMCID: PMC5895512.
Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193â1216.https://doi.org/10.1146/annurev.neuro.24.1.1193. PMID: 11520932.
Sinz, F., Ecker, A. S., Fahey, P., Walker, E., Cobos, E., Froudarakis, E., et al. (2018). Stimulus domain transfer in recurrent models for large scale cortical population prediction on video.In Advances in neural information processing systems (pp. 7199â7210). Montreal, Canada. https://doi.org/10.1101/452672.
Spoerer, C. J., Kietzmann, T. C., & Kriegeskorte, N. (2019). Recurrent networks can recycle neural resources to flexibly trade speed for accuracy in visual recognition. bioRxiv, 677237.https://doi.org/10.32470/CCN.2019.1068-0.
Spoerer, C. J., McClure, P., & Kriegeskorte, N. (2017). Recurrent convolutional neural networks: A better model of biological object recognition. Frontiers in Psychology, 8, 1551.https://doi.org/10.3389/fpsyg.2017.01551. https://doi.org/10.1101/133330.
Storrs, K. R., & Kriegeskorte, N. (2019). Deep learning for cognitive neuroscience. In M. Gazzaniga (Ed.), The cognitive neurosciences (6th ed.). Cambridge, MA: MIT Press.
Sussillo, D., & Barak, O. (2013). Opening the black box: Lowdimensional dynamics in high-dimensional recurrent neural networks. Neural Computation, 25, 626â649.https://doi.org/10.1162/NECO_a_00409. PMID: 23272922.
Tacchetti, A., Isik, L., & Poggio, T. (2017). Invariant recognition drives neural representations of action sequences. PLoS Computational Biology, 13, e1005859.https://doi.org/10.1371/journal.pcbi.1005859. PMID: 29253864. PMCID: PMC5749869.
Tang, H., Schrimpf, M., Lotter, W., Moerman, C., Paredes, A., Caro, J. O., et al. (2018). Recurrent computations for visual pattern completion. Proceedings of the National Academy ofSciences, U.S.A., 115, 8835â8840. https://doi.org/10.1073/pnas.1719397115. PMID: 30104363. PMCID: PMC6126774.
Thompson, J. A. F., Bengio, Y., Formisano, E., & Schönwiesner, M. (2018). How can deep learning advance computational modeling of sensory information processing? arXiv preprintarXiv:1810.08651.
Thorat, S., van Gerven, M., & Peelen, M. (2019). The functional role of cue-driven feature-based feedback in object recognition. arXiv preprint arXiv:1903.10446.https://doi.org/10.32470/CCN.2018.1044-0.
Tishby, N., & Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW) (pp. 1â5). Jerusalem, Israel.https://doi.org/10.1109/ITW.2015.7133169.
Tripp, B. P. (2017). Similarities and differences between stimulus tuning in the inferotemporal visual cortex and Convolutional networks. In 2017 International Joint Conference on NeuralNetworks (IJCNN) (pp. 3551â3560). Anchorage, AK. https://doi.org/10.1109/IJCNN.2017.7966303.
Tschopp, F. D., Reiser, M. B., & Turaga, S. C. (2018). A connectome based hexagonal lattice convolutional network model of the Drosophila visual system. arXiv preprintarXiv:1806.04793.