Временная иерархическая память. Статьи

/CorticalCircuits.pdfBelief Propagation and Wiring LengthOptimization as Organizing Principles for

Cortical Microcircuits.

Dileep GeorgeElectrical Engineering, Stanford University &

Numenta Inc., Menlo Park, CA [email protected]

Jeff HawkinsRedwood Neuroscience Institute &

Numenta Inc., Menlo Park, CA [email protected]

Abstract

In this paper we explore how functional and anatomical constraints andresource optimization could be combined to obtain a canonical corticalmicro-circuit and an explanation for its laminar organization. We startwith the assumption that cortical regions are involved in Bayesian BeliefPropagation. This imposes a set of constraints on the type of neurons andthe connection patterns between neurons in that region. In addition thereare anatomical constraints that a region has to adhere to. There are sev-eral different configurations of neurons consistent with both these con-straints. Among all such configurations, it is reasonable to expect thatNature has chosen the configuration with the minimum wiring length.We cast the problem of finding the optimum configuration as a combi-natorial optimization problem. A near-optimal solution to this problemmatched anatomical and physiological data. As the result of this inves-tigation, we propose a canonical cortical micro-circuit that will supportBayesian Belief Propagation computation and whose laminar organiza-tion is near optimal in its wiring length. We describe how the details ofthis circuit match many of the anatomical and physiological findings anddiscuss the implications of these results to experimenters and theorists.

1 Introduction

Perceptual systems have to deal with uncertain information in the world. Thus Bayesiantechniques have come to be widely viewed as learning and inference mechanisms em-ployed by the cortex. Bayesian Belief Propagation (BBP) introduced by Pearl [6] is amongthe most successful inference algorithms in computer vision and machine learning. In [5]Lee and Mumford suggest that cortical regions could actually be doing BBP computations,without giving details of the required mechanisms. Recent work by Rao [7] and Deneve[2] show that Bayesian Belief Propagation can be implemented in spiking neurons. Theydid not investigate an anatomical connection and treated single neurons as the BBP com-putation engine there by restricting them to encode binary states. What are the neural andanatomical substrates of the Bayesian computations employed by neo-cortex?

The neo-cortex in mammals is believed by many to have a surprisingly prototypical archi-

tecture that remains consistent across different species [3]. In all the examined species, theneurons in the cortical sheet are organized in to 6 layers, with the top layer mostly filledwith axons [9]. Several researchers have proposed canonical cortical circuits [3] that arereplicated all over cortex. Is there a canonical cortical circuit for Bayesian Inference? Ifyes, is that circuit related to the prototypical laminar organization of the cortex?

Many researchers have explored the role of wiring length minimization in the organizationof neocortex. [10] [1] [8]. The positioning of cortical regions in 3 dimensional spaceobtained as a result of wiring length optimization matched the positioning of cortical areason the cortical surface [1]. Does the laminar organization of neurons within the corticalregions, also follow from such a principle?

In this paper we investigate these questions by combining the principles of BBP compu-tations with anatomical constraints and wiring length optimization. The requirement thata cortical region should implement Bayesian Belief Propagation sets a set of constraintson the type of neurons and the connections patterns between neurons in that region (Sec-tion 2). Moreover there are anatomical constraints that a region has to adhere to (Section3). There are several different configurations of neurons consistent with both these con-straints (Sections 3). Among all such configurations, it is reasonable to expect that Naturehas chosen the configuration with the minimum wiring length. We show how to calculatethe wiring lengths for these configurations and explore the solution space (Section 4). Asa result of this investigation, we propose a canonical cortical micro-circuit that will sup-port BBP computation and whose laminar organization is near optimal in its wiring length.We discuss how the details of this circuit match many of the anatomical and physiologicalfindings (Section 5). These results have several anatomical and physiological implications(Section 6).

2 Bayesian Belief Propagation, Cell Types and Connections

In this section we describe the assumptions involved in the mapping of a Bayesian Networkto the cortical Hierarchy. Every region of the cortex can be thought of as maintaininga set of hypotheses in relation to the concepts encoded by its surrounding regions. Thehypotheses at a higher region in the cortex are causally linked to the ones in the lowerlevel. The set of hypotheses encoded by a region can be considered a random variable,with cortical columns encoding its particular values. Each region maintains the associationof its hypotheses with the causes in a probability table. Observed information anywherein the cortex can alter the probability values associated with hypotheses maintained everywhere else. This is done through Bayesian Belief Propagation. In general, the networkscan have loops, and we assume that the inference is done through loopy BP. This does notaffect our results.

With these broad assumptions, the inputs and outputs of a region of cortex can be mappedto Belief Propagation messages. A cortical region receives input messages from regionshierarchically above and below it, through feed-forward and feedback connections. Therole of the cortical region is to update its Belief based on these messages and to derive themessages to be sent to its parents and children using outgoing connections. These computa-tions are performed using the Bayesian Belief Propagation equations shown below. Theseequations were adapted from [6] and are for singly connected tree structured networks. Weassume this type of topology for the rest of this paper.

(xk) =j

Yj (xk) (1)

X(um) =x

(x)P (x|um) (2)

multicolumn1l

piY2(X)Y1(X)

X(U) piX(U)

pi(X)(X) Bel(X)

White Matter

Skull

Cortical Sheet

x

y w

h

(a) (b) (c)

Figure 1: (a) Inputs and outputs of a node in Bayesian Belief Propagation. (b) Cortical sheet andits position with restpect to white matter and the skull (c) Idealization of a slice of a cortical regioncorresponding to the rectangle in (b). At the top is the skull and at the bottom the white matter.Different cortical layers are oriented horizontally. Each square in the grid can hold one neuron(cell).

pi(xk) =u

P (xk|u)piX(u) (3)

BEL(xk) = (xk)pi(xk) (4)

piYj (xk) = pi(xk)i 6=j

Yi(xk) (5)

These equations are described with respect to a region which encodes X with 2 child re-gions encoding Y1 and Y2 and a parent region encoding U , as illustrated in figure 1(a).The feed-forward input axons and feed-forward output axons carry the messages and thefeedback axons carry the pi messages.

Implementing the Bayesian Belief Propagation Equations in a cortical region will requirea diverse set of neurons with different formats of connections. We postulate the existenceof 5 types of cells for the implementation of the 5 equations given above. Cell type 1 C1is the recepient of the feedforward messages from child regions. A set of such cells do theoperation defined in equation 1. This is illustrated in figure 2(a). These cells multiply theirinputs together and they do not have weights associated with their synapses.

C2 is a set of cells which implement equation 2. These cells receive inputs from the C1cells defined above. The synapses of these cells implement a sum-product operation. Thesynapse from cell C1,i to cell C2,j stores the weight P (xi|uj). This is illustrated in fig-ure 2(b). C2 cells send their feed-forward outputs to higher level cortical regions via themessage X .

The inputs to C3 cells are the feed-back messages from higher level cortical areas. Thesemessages are converted to the language of the local cortical region according to equation3. Similar to C2 cells, these cells also have synaptic weights. The synapse between cellC3,i and the feedback axon uj stores the weight P (xi|uj). The outputs of cells C3 areprocessed internal to the region.

The Belief of a region, according to equation 4 are calculated at the output of cells C4.These cells receive their inputs from cells C1 and cells C3 and combine them multiplica-tively to obtain the Belief value according to equation 4. This is illustrated in figure 2(c).

Cells of type 5, C5 project to child regions and carry the feedback messages to those re-gions. According to equation 5, the feedback message is specific to child regions. Hencethere will be as many type groups of C5 cells as there are child regions. These cells com-bine a selected portion of feed-forward messages along with the outputs of cells C3 to formfeedback messages.

As described above, implementing the Belief Propagation Equations in a neuronal circuit

C1,2 C1,M

Y1(x1)

{ {

Y1(X) Y2(X)

Y2 (x

M )

{ (X)(x1)

C1,1

C1,2 C1,MC1,1

C2,1 C2,2 C2,M

C3,1 C3,2 C3,M

{ pi(X)C4,1 C4,2 C4,M

C1,1 C1,2 C1,MC3,1 C3,2 C3,M

(a) (b) (c) (d)

Figure 2: (a) to (d) show the different neuron types and the connections between them arising outof equations 1 to 4. These connection constraints can be encoded as a matrix B (see text).

automatically imposes a constraint on the connectivity between the elements of the circuit.This connectivity constraint can be expressed in the form of a connection matrix B withthe neuronal elements and the input and output axon labels along the rows and columns ofthis matrix.

3 Spatial arrangement of cells within a cortical region

Cortical regions are arranged on a thin sheet covering the white matter. On one side ofa cortical sheet is the skull and on the other side the white matter. Hence all the input(output) axons of a cortical region enter(exit) the cortical region via the white-matter side(figure 1b). We can approximate a slice of a region of cortex as a rectangular box with allthe inputs and outputs interfacing at the bottom of the rectangle. What would be the bestway to arrange the cells within this slice of cortex?

Note that we can achieve the functionality of Bayesian Belief Propagation as long as wemaintain the correct connectivity among the neurons. However, different spatial arrange-ments of these neurons will use different wiring lengths to maintain this connectivity. Itis reasonable to assume that among all configurations with the same functionality, Naturewould choose the one with the minimum requirement of resources. This enables us to castthe problem of placing the neurons within a cortical region as an optimization problem.

Minimize

(i,j)B||xi xj ||1 (6)

subject to no overlap between cells (7)

where xi and xj correspond to the spatial locations of terminals and B is the connectionconstraints matrix. Although the objective in this problem is convex, solving this withproblem with non-overlap constraints involve a combinatorial optimization.

We can use known facts about cortical organization to reduce the complexity of this prob-lem. The vertical dimension of the cortical rectangle is only a few layers deep . Thehorizontal dimension is variable. The number of states that a cortical region will have torepresent is typically much more than the number of cells that can be accommodated alongthe vertical dimension of the cortical region. Thus it is reasonable to assume that the statesof the region are represented by neurons along the horizontal dimension of the corticalregion. We thus divide the horizontal dimension of the cortical region into a number ofcompartments. We make the simplifying assumption that each compartment correspondsto a particular state of the region.Note that this arrangement corresponds to a columnarorganization of the cortex as has been observed using several anatomical and physiologicalexperiments.

This leaves the vertical dimension for cells to support the Belief Propagation operationsrelated to various states of the region.Since we have five different equations and types of

cells as given by the Belief Propagation Equations, we divide the vertical dimension of thecortical region into five compartments. This gives us a grid over which we can place cells.Each rectangle in the grid can accommodate one cell.

With no constraints on placement we have M !(5!M ) different arrangements of the cellswithin a grid on the idealized cortical region with M states. Knowing that the labeling ofthe states are arbitrary, we can reduce them to number of meaningful arrangements equal to(5!)M . From the pattern of connections illustrated in figure 2 and from the fact that we haveexactly 5 compartments in the vertical direction, we can conclude that the optimal solutionwill have the same type of cell in any particular row of the grid. This insight, combined withthe columnar organization constraint helps us to reduce the number of search points from5!M to 5! = 120. Thus the approximate optimization problem can be solved by exhaustivesearch on these 120 possible configurations.

4 Length Function

For each of these configurations, we calculated the length function as follows. In the equa-tions below, we let the symbol for a neuron type to mean its position within the configura-tion in terms of the number of grid positions from the lower edge of the rectangle (figure1(c)). LetM be the number of states of the region. We assume that the parent region alsohas the same number of states. Nch is the number of children and h and w are the heightand with of a grid position. In the calculations below, we assume that the axons branch soas to achieve the least cost wiring. Thus, the equations derived here depend on the order inwhich different neuron types are placed on the grid. (But the total wiring length does not).We assume that the cells are placed in the order we describe here.

The total length of axons and dendrites required for taking the feedforward inputs fromchild regions to obtain (X) is calculated as

l1 =MNchh(C1 0.5) (8)Calculation of the feedforward messages to be sent out to a higher level region requirestaking the outputs of C1 and operating them on according to equation 2 and figure 2 andthen sending the outputs to the bottom of the cortical region to be sent out to higher levelregions. This gives the total length of axons and dendrites required for this operation as

l2 =M(h |C1 C2|+M1i=1

(i 1)w + 2Mh(C2 0.5) (9)

Calculating the internal values pi(X) involves taking the feedback messages from a higherlevel and operating on it according to equation 3. The required length for this operation canbe calculated as

l3 =

2Mh |C3 C2|+MM1

i=1 (i 1)w if C3 > C2 > C1Mh |C3 C1|+Mh |C3 C2|+M

M1i=1 (i 1)w if C3 > C1 > C2

MM1

i=1 (i 1)w if C2 > C3 > C1Mh |C3 C1|+M

M1i=1 (i 1)w if C2 > C1 > C3

Mh |C3 C2|+MM1

i=1 (i 1)w if C1 > C2 > C3Mh |C3 C2|+M

M1i=1 (i 1)w if C1 > C3 > C2

(10)

The Belief states of a region are calculated according to equation 4. This requires takingthe outputs of cells C3 and cells C1 and multiplying them element-wise in cells C4. The

13

2

4

65

piX(u1) piX(u2)Y1(x2)

Y2(x1)

Bel(x1)X(u1) X(u2)

piY1(x1) piY1(x2)

(a) (b) (c)

Figure 3: (a) Plot of wiring length Vs configurations measured with the optimum configuration atzero base-line . The position of C (see text) is marked . (b) The laminar arrangement of neuronsand the connections between them corresponding to the C configuration. Shown are two corti-cal columns. The location of cells in this configuration and the connections between them matchanatomical data. (c) Laminar organization of cortical micro-circuits - anatomical data adapted from[9] (permission pending). This is included here for the purpose of comparison.

total length of wires required for this operation is

l4 =

{ 2Mh |C4 C3|+Mh(C4 0.5) if C3 (C1, C4)Mh |C4 C1|+Mh |C4 C3|+Mh(C4 0.5) if C1 (C3, C4)Mh |C4 C3|+Mh(C4 0.5) if C4 (C1, C3)

(11)

Calculation for l5 is similar, but involves enumeration of 24 different cases.

Finally, the total length of connections for a particular configuration is calculated as

L = l1 + l2 + l3 + l4 + l5 (12)

This is the objective function we attempt to minimize over all configurations.

5 Results: Near-optimal solution Matches Anatomical andPhysiological Data

In order to find how the wiring length varies as a function of spatial arrangement of theneurons, we evaluated the objective function described above at all the 120 configurations.We sorted these configurations based on their wiring lengths and examined the configura-tions starting with the ones with the least wiring lengths. We found that a configurationwith near-optimal wiring length matched the anatomical data to a great extent. This con-figuration, denoted henceforth by C, is the second best in wiring length among all wiringlengths, and is only slightly worse (10%) than the best solution when measured as a frac-tion of the difference between the best and the worst. The spatial arrangement of cellsand the anatomical connections resulting from this configuration is shown in figure 3 (b).The position of this configuration among other configurations is shown a wiring-length Vsconfigurations plot in figure 3(a).

In C, the feed-forward inputs cortical regions in a lower level of the hierarchy rise to thelayer-4 cells. Layer 4 cells then project to layer 2 cells and also send an axon to layer5 cells. The feedback axons coming from higher level cortical areas rise to layer 1 andspread laterally. Layer 3 neurons with synapses in layer 1 are the targets of this feedback.These layer 3 neurons project to layer 5 and layer 6. The layer 5 neurons project down sub-cortically and the layer 6 neurons are the source of feedback to cortical areas hierarchicallybelow. The feed-forward axons that project to layer 4 also synapse in layer 6 to perform thecomputations in equation 5. These details map on to the anatomical data obtained from [9].In figure 3 (c) we compare this configuration with a schematic of anatomical data adaptedfrom [9].

The results of the wiring length minimization were independent of the number of statesof the region and the number of children. There were three optimal solutions and theseconfigurations shared many properties of the near-optimal solution we chose here for itsconformance to anatomical data. These solutions differ fromC by exactly one interchangeoperation.

6 Implications

The mapping of Bayesian Belief Propagation on to the cortical microcircuit is of potentialimplications to experimenters. This framework, can help understand and guide physiolog-ical experiments. In this section we explore some of the predictions of this mapping andtheir potential implications.

Anatomical data describe a class of layer 5 neurons that project to sub-cortical areas [9].Our results implicate that these neurons carry the current Belief of a region. The currentBelief of a region could be be used for making actions or decisions. Emotionally relevantbeliefs could be stored for later recall. There are several sub-cortical modules that couldmake use of the current Belief states of cortical regions. Although the details of the sub-cortical projections are not known, the prediction that these projections carry the Beliefs ofa region is potentially significant.

In C the cortical columns correspond to states of a cortical region. Typically, these cellsare divided into two categories- simple, and complex. However, this mapping tells us thata more sophisticated explanation still consistent with the simple/complex mapping is pos-sible. Layer 4 cells are consistently characterized as simple cells, and layer 2 cells areconsistently characterized as complex cells because they pool information from differentlayer 4 cells. However, layer 3 cells, layer 5 cells and layer 6 cells that are normally charac-terized as complex cells have a richer meaning. They combine contextual information fromhigher levels with local information. If higher level context is ignored, and the receptivefields of these cells are mapped using a pure feed-forward technique, they will correspondto the complex cells characterization. However, when contextual effects are taken into ac-count, these cells will have a more sophisticated meaning. For example, In a related study[4], we showed how to explain illusory contours effect [5] and end-stopping effect usingBelief Propagation. Results from the current study show that the illusory contour cellsand end-stopping cells will be prevalent in layers 2-3 and layer 5. This is consistent withexperimental results [5].

This mapping also provides another way to interpret population coding. It is known thatcortical columns show a graded response to stimuli. This finding has been largely inter-preted as a coarse coding mechanism. C too predicts a graded response to stimuli [4].However in this setting the graded responses correspond to the measure by which the stim-uli is likely to belong to the different states of a cortical region. This is applicable at alllevels of the cortical hierarchy.

7 Discussion

We derived a cortical micro circuit and its layout within a laminar cortical architecturebased on the principles of Bayesian Belief Propagation and wiring length optimization.The discover of a near-optimal solution that matches anatomical data is an encouragingdevelopment. Several reasons can be cite for the sub-optimality of the solution. The majorreason is our ignorance of the exact constraints and objectives that are involved in corticalorganization.

We took into account only the role of excitatory neurons and connections in this study. We

think that inhibitory neurons play a very significant role in cortical computations. However,we think that these roles are more in terms of keeping a good operating point for the com-putationally relevant circuits. It is known that normalizing the messages and intermediatevalues in BBP is required for numerical stability. Such normalization computations wouldrequire inhibitory circuits. We think that inhibitory neurons play a significant role duringlearning as well. We are currently investigating how to include these as part of the opti-mization. Missing out the contributions from inhibitory neurons could also be one reasonfor the sub-optimality of our solution.

Deciphering the functional connectivity of the cortical micro-circuit is a formidable task.Several insights can be drawn by comparing it to reverse-engineering an electronic circuit.Although a single transistor can function as an amplifier, a good amplifier is seldom con-structed from a single transistor. A good construction involves a biasing circuitry whichmakes sure that the amplifier works properly despite changing temperature conditions, dif-ferent device characteristics, feedback instabilities etc. Its reasonable to expect that a sim-ilar situation exists within the cortical sheet where a multitude of neurons are involved inbiasing a canonical cortical circuit to function. If the circuit is tested for connectivity whenit is not properly biased, one would end up missing some important connections and log-ging some spurious connections. Hence, deciphering the functional connectivity from anincreasing amount of anatomical data would require theories about cortical functions andhow they map on to anatomy. We believe that our work is a contribution in that direction.

References[1] C. Cherniak, Z. Mokhtarzada, R. Rodriguez-Esteban, and K. Changizi. Global optimization of

cerebral cortex layout. Proceedings of the National Academy of Sciences of the United Statesof America, 101(4):10811086, Jan 27 2004. LR: 20041117; DEP: 20040113; JID: 7505876;2004/01/13 [aheadofprint]; ppublish.

[2] Sophie Deneve. Bayesian inference in spiking neurons. In Lawrence K. Saul, Yair Weiss, andLeon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 353360.MIT Press, Cambridge, MA, 2005.

[3] R. J. Douglas and K. A. Martin. Neuronal circuits of the neocortex. Annual Review of Neuro-science, 27:419451, 2004. LR: 20041117; JID: 7804039; RF: 176; ppublish.

[4] Dileep George and Jeff Hawkins. A hierarchical bayesian model of invariant pattern recognitionin the visual cortex. In Proceedings of the International Joint Conference on Neural Networks2005, page in press. IEEE Press, 2005.

[5] Tai Sing Lee and David Mumford. Hierarchical Bayesian inference in the visual cortex. J OptSoc Am A Opt Image Sci Vis, 20(7):14341448, Jul 2003.

[6] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman Publishers, SanFrancisco, California, 1988.

[7] Rajesh P. N. Rao. Hierarchical bayesian inference in networks of spiking neurons. InLawrence K. Saul, Yair Weiss, and Leon Bottou, editors, Advances in Neural Information Pro-cessing Systems 17, pages 11131120. MIT Press, Cambridge, MA, 2005.

[8] O. Sporns, G. Tononi, and G. M. Edelman. Theoretical neuroanatomy: relating anatomical andfunctional connectivity in graphs and cortical connection matrices. Cerebral cortex (New York,N.Y. : 1991), 10(2):127141, Feb 2000. LR: 20041117; JID: 9110718; ppublish.

[9] Alex M. Thomson and A. Peter Bannister. Interlaminar connections in the neocortex. Cerebralcortex (New York, N.Y. : 1991), 13(1):514, 2003.

[10] M. P. Young. Objective analysis of the topological organization of the primate cortical visualsystem. Nature, 358(6382):152155, Jul 9 1992. LR: 20041117; JID: 0410462; CIN: Nature.1994 Jun 9;369(6480):448-50. PMID: 8202134; ppublish.

/DilJeffTechReport.pdfInvariant Pattern Recognition using BayesianInference on Hierarchical Sequences

Dileep GeorgeElectrical Engineering, Stanford University

and Redwood Neuroscience InstituteMenlo Park, CA [email protected]

Jeff HawkinsRedwood Neuroscience Institute

Menlo Park, CA [email protected]

Abstract

Real world objects have persistent structure. However, as we move aboutin the world the spatio-temporal patterns coming from our sensory or-gans vary continuously. How the brain creates invariant representationsfrom the always-changing input patterns is a major unanswered ques-tion. We propose that the neocortex solves the invariance problem byusing a hierarchical structure. Each region in the hierarchy learns andrecalls sequences of inputs. Temporal sequences at each level of the hi-erarchy become the spatial inputs to the next higher regions. Thus theentire memory system stores sequences in sequences. The hierarchicalmodel is highly efficient in that object representations at any level in thehierarchy can be shared among multiple higher order objects, therefore,transformations learned for one set of objects will automatically apply toothers.Assuming a hierarchy of sequences, and assuming that each region in thehierarchy behaves equivalently, we derive the optimal Bayes inferencerules for any level in the cortical hierarchy and we show how feedfowardand feedback can be understood within this probabilistic framework. Wediscuss how the hierarchical nested structure of sequences can be learned.We show that static group formation and probability density formationare special cases of remembering sequences. Thus, although normal vi-sion is a temporal process we are able to recognize flashed static imagesas well. We use the most basic form of one of these special cases to trainan object recognition system that exhibits robust invariant recognition.

1 Introduction

Look at any object in front of you. As you move your head, your eyes, or move towards thatobject while still looking at it, the images that fall on your retina vary significantly from oneinstant to another. However your percept of the object remains stable despite this variation.This is known as the invariance property. Your cortex does not want your perception of anobject to vary with every small eye movement or neck tremor. We consider this invarianceproperty as a technique evolved by the cortex to produce stable percepts of this world. Howdoes the cortex achieve this invariance property?

Think of the different retinal images formed by an object. Although the retinal imagesare different, the underlying cause of all those images are the same - the object itself. Anobject is composed of several parts. And those parts are tied to the object in a particularway. When the object moves, it produces a particular motion pattern of the parts. The partsthemselves causally influence sub-parts. For example a contour which moves to the leftcauses a line-segment that is part of it to move in a particular way. A particular sequence ofmovement of a line segment can be caused by a contour or a corner. A particular sequenceof movement of a corner could be due to a table or a chair. The same lower level sequencesare reused as part of different high level contexts. Thus the world seems to be naturallyorganized into a hierarchy of sequences. We believe that the cortex is capturing this causalhierarchical structure of the world using its own hierarchical cortical structure to solve theinvariance problem.Suppose that a region of cortex which can see only a small patch of any image learnsall possible ways a line segment can move when it is part of a corner. Now, wheneverone of those sequences of movement of that line seqment occurs, the region would beable to say that although the inputs are changing they all belong to the same corner. Itseems plausible that by learning the sequences in the context of their causal influences,the invariance problem can be tackled. By doing that in a hierarchy, the same lower levelrepresentations can be shared among multiple higher level objects. Therefore, invarianceslearned for one set of objects will automatically apply to others.The known anatomy of the visual cortex seems to be conducive to this idea. Visual cortexis organized in a hierarchy and the receptive field size of neurons increase as you go up thehierarchy. Each region in the cortex receives input from below as well as feedback fromabove. The feedback pathways can provide the contexts of higher level sequences. Thereare also recurrent connections within a region and between regions via thalamus and suchconnections could store sequences of different durations.These ideas are discussed and elaborated in [6] and we consider that as the starting pointfor this work. The rest of this paper is organized as follows. Section 2 is a mathematicaldescription of how Bayesian inference can be done on hierarchical sequences. In this sec-tion we show that the large scale and small scale anatomical structure of the visual cortexis consistent with the idea of Bayesian inference on hierarchical sequences. In section 3,we discuss how such hierarchical structures can be learned. In section 4 we describe aninvariant pattern recogniton system that we built based on a subset of principles describedin sections 2 and 3. We conclude the paper in section 5 with a discussion on related andfuture work.

2 Inference and Prediction using Hierarchical Sequences

The goal of this section is to illustrate how Bayesian inference and prediction can occur ina hierarchical sequences framework and how it relates to the known anatomical structureof the visual cortex. We assume that images in this world are generated by a hierarchy ofcauses. A particular cause at one level influences a sequence of causes to unfold in time at alower level. For clarity and for notational convenience, we consider a three-level hierarchy.Let the random variables Xi, Yi and Zi denote the highest level, intermediate level andlowest level of causes respectively, where i indexes different regions in space active at thesame time. We restrict our analysis to cases with only one highest level cause active at anytime.We assume that a particular highest level cause xk causes a set of sequences S(k)Y1 of Y1sand S(k)Y2 of Y2s more likely to simultaneously occur in the child regions Y1 and Y2 of X1.In other words, the higher level cause xk is identified as the co-occurence of a sequence

Figure 1: (A) A particular instantiation of hierarchical sequences. The high level cause x1 of regionX causes either sequence y1y2y3 or sequence y1y1y1 in region Y1 along with sequence y1y1y1 inregion Y2. Elements of these sequences, say for example y2, act as causes for sequences at lowerlevels. A particular sequence at any level (y1y3y2 in Y1) can be part of multiple higher level causes(x2 and x3 in X1). (B). Bayesian inference-prediction architecture of the visual cortex based on thederivations in section 2.

in the set S(1)Y1 and a sequence in the set S(1)Y2

in adjacent intermediate level regions. Thehigh level causes vary at a slower rate compared to the lower level causes. For examplethe higher level cause xk on an average would stay active for a substantial duration of asequence in S(k)Y1 . In a similar fashion the intermediate level causes Yis influence their cor-responding lowest level Z variables and vary at a slower rate compared to the Z sequences.A particular instantiation of these ideas is illustrated in figure 1(A).We assume that the cortical hierarchy matches the causal sequences hierarchy of image gen-eration. This means that there are cortical regions corresponding to the random variablesXi, Yi and Zi. For the rest of the discussion we use these labels also to denote their cor-responding cortical regions. To simplify the analysis, we assume markovity of sequencesat each level. Thus, learning the structure of sequences of region Y1 would mean learningthe probability transition matrix PY1Y1|X1=x1,k for all k. The highest level propagates itselfforward according to PX1X1 . In order to obviate complicated time indexing, we assumethat the slower time variation of the high level sequences are captured within their proba-bility transition matrices. Whenever we condition a sequence of causes in a lower level ona particular cause at the higher level, we implicity assume that the higher level cause hasnot changed for the duration of the lower level sequence.Lets say that at time t, the region X1 wants to make a prediction about the probabilitydistribution of X1(t + 1) assuming that it knows X1(t), Y1(t), Y2(t) and Z1(t), ..., Z4(t).This can be done as

Pred(X1) = PX1(t+1)|X1(t),Y1(t),Y2(t),Z1(t),...,Z4(t) = PX1(t+1)|X1(t) (1)Thus the region X1 needs only its learned and locally stored matrix PX1X1 to make pre-dictions. Similarly, region Yi can make a prediction about the probability distribution ofYi(t+ 1) according to

Pred(Yi) = PYi(t+1)|Yi(t),X1(t),Z1(t),...,Z4(t) (2)=

j

PYi(t+1)|X1(t+1)=j,Yi(t)PX1(t+1)=j|X1(t) (3)

Figure 2: The laminar structure of cortical regions is conducive to the Bayesian inference-predictionarchitecture based on hierarchical sequences.

Note that the second term on the right hand side of the above equation, PX1(t+1)=j|X1(t),is same as the predictive probability distribution calculated by region X1 in equation 1.Thus for region Yi to make a prediction about its next state, it has to combine informationcomputed by its parent region with its own locally stored PYi(t+1)|X1(t+1)=j,Yi(t). Thusthe Pred(X1) information computed by the region X1 has to be fed down to regions Yi forthose regions to make predictions.Now lets consider the case when after having observed a sequence of Z1s and Z2s, regionY1 decides to update its estimate of its current state. The optimal estimate of the currentstate Y1(t+ 1) is obtained according to the MAP rule.

Y1(t+ 1) = argmaxY1(t+1)

PY1(t+1)|Y1(t),Zt0:t+11 ,Z

t0:t+12 ,X1(t)

(4)

= argmaxY1(t+1)

PZ

t0+1:t+11 ,Z

t0+1:t+12 |Y1(t+1),Z1(t0),Z2(t0)

PY1(t+1)|Y1(t),X1(t)(5)

= argmaxY1(t+1)

[PZ

t0+1:t+11 |Y1(t+1),Z1(t0)

(PY1(t+1)|Y1(t),X1(t))1/2] (6)[

PZ

t0+1:t+12 |Y1(t+1),Z2(t0)

(PY1(t+1)|Y1(t),X1(t))1/2]

(7)

where Zt0:t+1i is a sequence of Zis extending from time t0 to time t + 1. In the aboveequation, the terms within square brackets can be computed in the regions Z1 and Z2 usinglocal information, given that they have the Pred(Y1) information fed down to them. Ifthese regions send up a set of winning arguments Y1(t + 1) (lets denote it Infer(Y |Zi)),then the region Y1 can finish the argmax computation exactly and decide the most likely Y1state based on that information.The analysis above shows that the idea of hierarchical sequences and the idea of a hierar-chical cortex with feedforward and feedback connection are consistent with each other in aBayesian inference-prediction framework. The feedforward pathways are used to carry theinferences made based on current as well as locally stored past observations. The feedbackpathways carry expectations/predictions to lower levels (figure 1(B)) . Higher level regionshave converging inputs from multiple lower level regions. Thus feedback information froma higher level of the cortex can be used as context to interpret/disambiguate an observedpattern at a lower level. Recognition occurs when the highest level converges on a causeby disambiguating several parallel and competing hypotheses at all levels. This derivationalso suggests roles for different cells in the known laminar architecture of cortical regions

Figure 3: (A). Examples of most likely and most unlikely sequences of length 4 observed in aline-drawing movie by a region of size 4x4. The movie was generated by simulated straight-linemotions of images drawn using vertical and horizontal lines. The sequences s1, s2, s3 and s4 (readleft to right) occured much more frequently compared to the sequences s5, s6, s7 and s8. (readtop to bottom). (B)The higher level object/cause shown, moving left and up, is learned by a Y(Level 2) region receiving inputs from 4 Z (Level 1) regions. In this case the object corresponds tosimultaneous occurance of s2, s4, s3 and s2 in the top-left, top-right, bottom-right and bottom-leftregions respectively

as shown in figure 2. First order markovity was assumed so that we do not have to carry toomany terms in the conditional probability calculations. We believe that similar conclusionsas above could be drawn if this assumption is relaxed.

3 Learning Hierarchical Sequences

How can a region of cortex learn sequences within sequences? Consider a region Y1 re-ceiving the context information of a high level cause X1 = xk. If this region now learns toassociate with this context xk all sequences of Y1 that occur at its input while xk is active,then that region is essentially learning sequences within sequences. After learning, when-ever a sequence of Y1s occur at the input, this region can produce the corresponding X1 atits output. For example, if the sequences were markov then learning would correspond tolearning the matrices PY1Y1|X1=xk for every k. In this way, a region of cortex can learn tocollapse a sequence at its input to one or more higher level causes based on its learning.The high level causes themselves have to be learned from the low level inputs. This couldbe done as follows. Lower level regions learn the most frequent sequences of their inputs.After learning, whenever a part of one of those sequences occur, those regions pass up apattern corresponding to the sequence. A higher level region with converging inputs fromseveral lower level regions then looks at sequences occuring simultaneously in the lowlevel regions. Patterns of sequences which consistently occur in multiple low level regionsbecome the objects at the next higher level. This process can be repeated between levels ofthe hierarchy to obtain causes at the highest level. For example if region Y1 observes thatthe sequence sjZ1 of region Z1 and the sequence s

kZ2

of region Z2 occur at the same timevery often, then their combination becomes an object or cause at region Y1. Examples oflearned sequences and higher level causes for line drawing movies are shown in figure 3.Note that if under the context X1 = xk, a Y region stored only the frequency of occurencesof its inputs Y1, then this corresponds to learning the conditional probability distributionPY1|X1=xk . This is a special case of sequence learning where sequences are of length 1.In the markov case this would correspond to learning the steady state distribution of themarkov chain PY1Y1|X1=xk . Now, if under the context X1 = xk we just group all the

Figure 4: Examples of images used to train and test the invariant pattern recognition system. Therows correspond to 3 out of the 91 categories for which the system was trained. Columns A and Bcorrespond to the two different images of a category that were used for training. Columns C and D areexamples of test patterns for which the system recognized the image category correctly. Note that thesystem shows a high degree of shift, scale and distortion invariance. Some of these test images weredrawn by our lab mates using a mouse. Column E gives examples of patterns for which the systemmade an error in recogniton and column F gives the category the column E pattern was incorrectlyclassified as. The complete set of training and test patterns and the MATLAB code for the systemcan be downloaded from http://www.rni.org/nips2004/

inputs Y1 then it becomes a special case of learning the probability distribution. In this casethe probability distribution is uniform over all Y1s having non-zero probability under thecontex xk.

4 Simulation of a Line Drawing Recognition System

Using a subset of the principles outlined above, we simulated a hierarchical system for linedrawing recognition and measured various aspects of its performance. Instead of storingsequence information, we considered a sequences as groups (sets), thus dropping timinginformation within a sequence. As noted in the previous section, this is can be consideredas a special case of learning sequences. We do not make use of feedback in this imple-mentation. This can also be considered as a special case where all feedback probabilitydensities are uniform. Using the full set of principles outlined in sections 2 and 3 can onlyimprove the performance of the system.The system consisted of 3 levels - L1, L2 and L3. The lowest level, L1, consisted of re-gions receiving inputs from a 4x4 patch of images which were of size 32 pixels by 32pixels. These 4x4 regions regions tiled an input image with 2 pixels overlap between adja-cent regions. This overlap between regions ensured that spatial continuity constraints aremaintained. Learning started at L1 and proceeded to the higher levels. The L1 regionslearned by obtaining the most likely sequences caused by simulated motion of of black andwhite straight-line drawings (figures 3 and 4). For example, vertical lines and all shiftsof vertical lines within an L1 region became the vertical line group and left-bottom cor-ners and all shifts of them formed the left-bottom corner group . With this, an L1 regionpresented with a vertical line at its input would produce the output vertical line group irre-spective of the position of the vertical line within that region. (In our implementation, with13 groups in L1, it set one of out 13 bits to 1). For novel patterns appearing at the input,the output was set as the group of the closest (euclidean distance) familiar pattern.

Figure 5: Variation of perceptual strength with pattern distortion : For this test we gradually distorteda test pattern of one category (category b ) into a pattern of another category (category p). None ofthe test patterns were identical to the ones used for training the b and p categories. Plotted along theY axis is the score obtained at L3 for categories b and p when patterns along the X axis were shownto the system. In the region marked as b region the pattern was identified as belonging to category band similarly for the p region. In the ambiguous region the system classified the pattern as neither bnor p but (erroneously) idenitied it as various other categories.

An L2 region received its inputs from 16 L1 regions. In our implementation this pattern isof length 208. The groups at L2 were formed in a semi-supervised manner, with learningcontext coming from L3. We showed the network moving images of objects of a particularcategory all the while setting a constant context from L3 to all L2 regions. Thus the L2regions learned to associate all the inputs from L1 region that occured under a particularcategory context with that category. During the recognition phase, an L2 region set at itsoutput the category memberships of the pattern it received at its input. If the membershipwas null, it output an all zero pattern. Thus, during the recognition phase, each L2 regionsent up its multiple hypotheses regarding the possible L3 causes. A single L3 region pooledall such hypotheses from 16 L2 regions below it. The L3 region would make a decisionregarding the categry of object by counting the votes from all L2 regions.We observed in our introdcution that the perception of an object should remain stable de-spite eye movements as long as the object remains within the field of view and is attendedto. If the input is ambiguous, the brain can gather further information from the input bymaking small eye movements. Between these eye movements, the correct hypothesis wouldremain stable while the competing incorrect hypotheses will vary in a random manner. Wemade use of this idea to improve the signal to noise ratio for detecting novel patterns.The system was trained on simulated motions of 91 objects. Two examples of every objectwere shown to the system during training (figure 4 A, B). Accuracy of detection on thetraining set was 100% without any eye movement. For test cases, we limited the maxi-mum number of eye movements to 12. The system showed a high degree of invariance toposition, scale and distortion on novel patterns as displayed in figures 4 and 5.

5 Discussion

Invariant pattern recognition has been an area of active research for a long time. Earlierefforts used only the spatial information in images to achieve invariant representations[5,12, 11]. However performance of these systems was limited and generalization question-able. We believe that continuity of time is the cue that brain uses to solve the invarianceproblem [4, 13]. Some recent models have used temporal slowness as a criterion to learnrepresentations [7, 14, 2]. However those systems lacked a Bayesian inference-predictionframework [8] and did not have any particular role for feedback.

Our model captures multiple temporal and spatial scales at the same time. This goes be-yond the use of Hierarchical Hidden Markov Models (HHMMs)[3] to capture structure atmultiple scales either in space or in time. Moreover, algorithm stuctures like HHMMs andMarkov Random Fields [9] have remained as abstract computer vision models because theyhavent made any connections with known cortical structure. Several other models [1, 10]attempt to solve the invariance problem by explicitly applying different scalings, rotationsand translations in a very efficient manner. However, as our test cases in section 4 indicate,none of the novel patterns we receive are pure scalings or translations of stored patterns.We demonstrated invariant pattern recognition using only a subset of the principles out-lined in sections 2 and 3. We believe that, using the full strength of the outlined theory,we will be able to demonstrate other well known cortical phenomena [8]. Although weused supervised learning in our simulation of the pattern recognition system, this is not anecessary component of the theory. We believe that it is possible to learn high level causesin an unsupervised fashion by learning sequences of sequences as demonstrated in figure3. Future work will include application of these ideas to natural videos.

References[1] David W. Arathorn. Map-Seeking Circuits in Visual Cognition: A Computational Mechanism

for Biological and Machine Vision. Stanford Univ Pr, Stanford, CA 94305, Sept 2002.[2] Suzanna Becker. Implicit learning in 3D object recognition: The importance of temporal con-

text. Neural Computation, 11(2):347374, February 1999.[3] Shai Fine, Yoram Singer, and Naftali Tishby. The Hierarchical Hidden Markov Model: Analysis

and Applicaitons. J Opt Soc Am A, 20(7):12371252, 2003.[4] Peter Foldiak. Learning invariance from transformation sequences. Neural Computation,

3(2):194200, 1991.[5] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism

of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193202,1980.

[6] Jeff Hawkins and Sandra Blakeslee. On Intelligence. Times Books, Henry Holt and Company,New York, NY 10011, Sept 2004. In Press.

[7] Aapo Hyvrinen, Jarmo Hurri, and Jaakko Vyrynen. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. J Opt Soc Am A, 20(7):12371252, 2003.

[8] Tai Sing Lee and David Mumford. Hierarchical Bayesian inference in the visual cortex. J OptSoc Am A Opt Image Sci Vis, 20(7):14341448, Jul 2003.

[9] Kevin Murhpy, Antonio Torralba, and William T. Freeman. Using the Forest to See the Trees:A Graphical Model Relating Features Objects and Scenes. Advances in Neural InformationProcessing Systems 16, Vancouver, BC, 16, 2004.

[10] Bruno A. Olshausen, Charles H. Anderson, and David C. Van Essen. A neurobiological modelof visual attention and invariant pattern recognition based on dynamic routing of information.The Journal of Neuroscience, 13(11):47004719, November 1993.

[11] Rajesh P. N. Rao and Dana H. Ballard. Development of localized oriented receptive fieldsby learning a translation-invariant code for natural images. Network: Computation in NeuralSystems, 9(2):219234, 1998.

[12] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition incortex. Nature Neuroscience, 2(11):10191025, November 1999.

[13] Simon M. Stringer and Edmund T. Rolls. Invariant object recognition in the visual system withnovel views of 3D objects. Neural Computation, 14(11):25852596, November 2002.

[14] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning ofinvariances. Neural Computation, 14(4):715770, 2002.

/GeorgeHawkinsIJCNN05.pdfA Hierarchical Bayesian Model of Invariant PatternRecognition in the Visual Cortex

Dileep GeorgeDepartment of Electrical Engineering

Stanford University andRedwood Neuroscience Institute

Menlo Park, CA 94305E-mail: [email protected]

Jeff HawkinsRedwood Neuroscience Instiute

Menlo Park, CA 94305E-mail: [email protected]

Abstract We describe a hierarchical model of invariant visualpattern recognition in the visual cortex. In this model, theknowledge of how patterns change when objects move is learnedand encapsulated in terms of high probability sequences at eachlevel of the hierarchy. Configuration of object parts is capturedby the patterns of coincident high probability sequences. Thisknowledge is then encoded in a highly efficient Bayesian Networkstructure.The learning algorithm uses a temporal stability crite-rion to discover object concepts and movement patterns. We showthat the architecture and algorithms are biologically plausible.The large scale architecture of the system matches the large scaleorganization of the cortex and the micro-circuits derived fromthe local computations match the anatomical data on corticalcircuits. The system exhibits invariance across a wide variety oftransformations and is robust in the presence of noise. Moreover,the model also offers alternative explanations for various knowncortical phenomena.

I. INTRODUCTIONRecognizing objects despite different scalings, rotations and

translations is something humans perform without consciouseffort, but this still is a hard problem for computer visionalgorithms.

We believe that the geometric invariances that humans soeffectively handle are intimately linked to the motion in thisworld. When we move in this world while still looking at anobject, the patterns that fall on our retina change continuouslywhile the underlying cause for those patterns - the object itself- remains the same. Rigid objects have the property that theyproduce the same change of patterns for the same pattern ofmotion. Rigid objects in this world can be thought of as theunderlying causes for persistent patterns on our retina. Thus,learning persistent patterns on the retina would correspond tolearning objects in the visual world. Associating these patternswith their causes corresponds to invariant pattern recognition.

In this model we use many concepts which are familiarand accepted in neuroscience and computer vision. It wellknown that the visual cortex is organized in a hierarchyand several models of invariant pattern recognition [6][16]make use of this. Temporal slowness has been shown to be aplausible criterion for learning invariances [19] and our ideaof most likely sequences can be related to this. We derive ourarchitecture and algorithms based on the idea that the goalof the cortex is to make predictions [7]. Predictive models

[14] can explain the role of feedback connections in thecortex. However there are no predictive models available inthe literature that do invariant pattern recognition as well. Theframework of ideas that we use here was described in [7] andwe consider that as the starting point of the work we describehere.

The rest of this paper is organized as follows. In section2, we describe our system architecture and in section 3the learning algorithm. In section 4 we describe how thesystem performs invariant pattern recognition. In section 5 weconnect the architecture and algorithms to biology. In section6 we describe the simulation setup and performance results.Our model provides alternative explanations for some corticalphenomena and these are explained in section 7. We concludethe paper in section 8 with a discussion on related work.

II. ARCHITECTURE AND ASSUMPTIONS

The system we describe here is organized in a hierarchy andour learning and recognition algorithms exploit this hierarchi-cal structure. Each level in our system hierarchy has severalmodules. These modules model cortical regions. A modulecan have several children and one parent. Thus the modulesare arranged in a tree structure. The bottom most level iscalled level 1 and the level number increases as you go upin the hierarchy. Inputs go directly to the modules at level 1.The level 1 modules have small receptive fields compared tothe size of the total image, i.e., these modules receive theirinputs from a small patch of the visual field. Several suchlevel 1 modules tile the visual field, possibly with overlap.A module at level 2 is connected to several adjoining level1 modules below. Thus a level 2 module covers more of thevisual field compared to a level 1 module. However, a level2 module gets it information only through a level 1 module.This pattern is repeated in the hierarchy. Thus the receptivefield sizes increase as one goes up the hierarchy. The moduleat the root of the tree covers the entire visual field, by poolinginputs from its child modules. The set of level 1 modules canbe considered analogous to V1, the set of level 2 modulesanalogous to V2 and so on.

Fig. 1. Learning Stages: Learning starts at the bottom of the hierarchy and proceeds to the top. The modules at the very top of the hierarchyreceive their inputs from a small section of the visual field. During Stage 1, these modules observe their inputs in time and learn the mostlikely sequences of a particular length of inputs. Once stage 1 learning is finished,these modules start passing up the index of the sequencewhenever they observe one of the most likely sequences at its inputs. A higher level module gets its inputs from several lower level modules.During Stage 2, the higher level module learns frequent coincidences of sequence indices. These become the alphabets or concepts for thehigher level. Note that this alphabet abstracts what patterns occur together in space and time. During the third stage of the learning procedure,the higher level concepts are fed down to the lower regions so that they learn the occurrences of the lower level patterns in the context ofthe higher level concepts. Repeating this in a hierarchy we obtain a graphical model as shown in figure 1

III. LEARNING ALGORITHMWe describe our learning algorithm taking a two level

hierarchical arrangement as shown in figure 1 as the example.The inputs to the system are given to the modules at the bottommost level. Let the random variable prefix X indicate all thethe inputs to level 1 modules. Let {X(1)n } and {X(2)n } denotethe sequence of inputs to modules 1 and 2 in figure 1.

Learning in this model occurs in three stages. Duringthe first stage of learning, a module learns the most likelysequences of its inputs. Let B(l) =

{S(m)X,1 , S

(m)X,2 , , S(m)X,N

}be the set of sequences of length l with their fraction of occur-rences greater than . A module learns this set empirically byobserving its sequence of inputs. Once a module has learnedB(l) , any high probability sequence seen by this module can be

uniquely represented by its index k into the set B(l) . At the endof learning stage 1, a module has learned B(l) and it producesat its output the index of the high probability sequences thatit observes on its input.

A module enters the second stage of the learning processonce its child modules have finished the first stage of learningand is communicating with this module in terms of the indicesof the high probability sequences of those modules. Letsconsider module number 3 at the second level in figure 1. Theinput to this module consists of the concatenation of the out-puts from its child modules 1 and 2. A particular concatenationrepresents a simultaneous occurrence of a combination of highprobability sequences in the child modules. Depending on thespatio-temporal statistics of the inputs seen by the lower levelmodules, some of these coincidences will occur frequently andsome will not. During the second stage of learning, a parentmodule learns the most frequent coincidences (according toan criterion) of sequences in the levels below it. We denotethe most frequent patterns at this level 2 module by Y and the

number of such patterns by M . These patterns become thealphabet for this module.

The third stage, called contextual embedding, involves feed-back from the level 2 module to its child modules to embedthe the lower level patterns in the context of the higher levelpatterns. This stage is initiated once the level 2 module hasformed its alphabet Y as we described above. Assume thatat a particular point in time the higher level pattern Y = ykis active. (This pattern was made active by the simultaneousoccurrence of a combination of sequences in lower levels).This information, i.e., the index of the high level concept, isfed back to the level 1 modules. This information is used by thelevel 1 modules to obtain a conditional probability distribution(CPD) matrix of its patterns given the patterns at a higher level.During the learning process, this CPD matrix is updated byincrementing the count for all level 1 patterns that were partof the sequence which caused the high level pattern yk. Atthe end to the learning process, the rows of this matrix arenormalized to obtain the conditional probability distributionP (X(1)|Y ) for module 1 at level 1. This process is identicalfor all the modules at level 1.

The learning process defined above can be repeated in ahierarchy. This is done by considering the frequent spatialpatterns seen by a module at any level to be the alphabetof that region and then repeating stage 1, 2 and 3 of thelearning algorithms in a manner identical to the descriptionabove. In our example, the learning can be continued betweenlevels 2 and 3 by considering the frequent spatial patterns Yof the level 2 module as its alphabet and then learning the highprobability sequences on this alphabet to continue to stages 2and 3 of the algorithm.

Fig. 2. Structure of a typical Bayesian Network obtained at theend of the learning procedure. The random variables at the bottomlevel nodes correspond to quantizations of input patterns. The randomvariables at intermediate levels represent object parts which movetogether persistently. The random variables at the top node correspondto objects. During training the definitions of the intermediate object-parts and then the top-level objects are obtained using algorithmsdescribed in figure 1. The probability tables are also filled accordingto Stage 3 of figure 1.

IV. RECOGNITION AS INFERENCE IN A BAYESIANNETWORK

Once all the modules in the hierarchy have learned ac-cording to the algorithms described in section 3, we get atree structured Bayesian Network [13], an example of whichis shown in figure 2. The modules correspond to the nodesin a probabilistic graphical model and each node stores aconditional probability distribution. Every module can bethought of as having a set of states. The CPDs at each nodeencode the probability distribution of the states of that modulegiven the state of its parent module.

If we assume that the learning algorithm has producedmeaningful states at each module of the hierarchy with thestates at the top of the hierarchy corresponding to objectcategories, then the recognition problem can be defined asfollows. Given any image I , find the most likely set of statesat each module, the combination of which best explains thegiven image in a probabilistic sense. Specifically, if W is theset of all random variables corresponding to node states, thenthe most probable explanation for the image I is a set ofinstantiations w? of the random variables such that

P (w?|I) = maxw

P (w|I) (1)

If Z is the random variable representing the states at the topof the hierarchy, then the category label that best explains anygiven image is the index of z?, where z? is the assignment to Zwhich maximized the above equation. Its a well known resultthat given an acyclic Bayesian Network as the one we havehere, inference can be performed using local message passing.

Fig. 3. Belief Propagation and Cortical Anatomy: The beliefpropagation equations that we used for inference in our modelhas an anatomical mapping which matches anatomical data [18]to a large extent. Shown here is the cortical circuit resulting fromsuch a mapping. This mapping enabled us to replicate some of thephysiological experiments in our system.

We use Pearls Bayesian Belief Propagation algorithm [13] toobtain the most likely explanation given an image.

V. CONNECTION TO BIOLOGY

It is well known that cortical system is organized in ahierarchy and by virtue of the connections, some regionsare hierarchically above some other [3]. Moreover, it is wellknown that the receptive field size increases as you go up in thehierarchy. It is generally accepted that neurons in the higherlevel of the visual cortex represent more complex features withneurons/columns in IT representing objects or object parts.The lateral connections within layer 2-3 of the cortex and theconnections between layers 1,2 and 3 through the thalamuscould provide adequate mechanisms for learning of sequences[7]. Thus the large scale organization of our system is inagreement with the structure of the visual cortex.

We also found a fine mapping of these algorithms to thecortical anatomy by mapping the Bayesian Belief Propagation(BBP) [13] equations to a neural instantiation. A corticalregion can be thought of as encoding a set of concepts inrelation to the concepts encoded in regions hierarchicallyabove it. The set of concepts encoded by a region can bethought of as a random variable. A cortical column representsa particular value of this random variable. At every timeinstant, the activity of a set of cells in a column representsthe probability that a particular hypothesis is active. Thefeed forward and feed back connections to a cortical regioncarry the Belief Propagation messages. Observed informationanywhere in the cortex is propagated to other regions throughthese messages and can alter the probability values associatedwith the hypotheses maintained by other regions. Figure 3shows the detailed cortical micro-circuitry derived from BBPequations. The anatomical details of this circuit match theknown anatomical data [18] to a great extent. The BBPequations that we used for deriving this micro-circuit is givenas part of the appendix.

Fig. 4. Recognition: Shown here are examples of test imagesthat the system could recognize correctly along with their labels.The system shows very robust scale, translation and distortioninvariance works well with very noisy inputs. Note that somepatterns (table lamp, dog) are recognized irrespective of theirorientation. The invariances developed in the system are the onesto which the system is exposed to during the training phase.Our system has the feature that small eye-movements during therecognition stage improves performance. With eye-movements wehave a recognition accuracy of 97 percent for viewer drawn images.

Fig. 5. Prediction/Filling-in: This experiment demonstrates thepredictive capabilities of the system. The raw input (top left) isvery noisy and an immediate reconstruction using the informationin a 4x4 window has all the features wrong (top right). Theintermediate reconstruction (bottom left) is obtained by operatingthe belief propagation till the second level in the hierarchy andthen passing the beliefs down to the lower level again. Thus theintermediate level reconstruction the statistics of patterns in an 8x8neighborhood. The global reconstruction (bottom right) is obtainedby doing the belief propagation globally. This reconstruction isconsistent with the recognition of the input as a helicopter.

Fig. 6. In this experiment we showed the system snapshots of 3novel images at 10 randomly chosen positions. What is plotted isthe number of positions to which the system generalized for each ofthese novel images (shown along the X axis). It can be seen that thegeneralization performance goes down with increasing complexity ofthe novel pattern.

VI. SIMULATION SETUP AND RESULTS

We simulated the above algorithms for a data set of linedrawing movies. These movies were created by simulatingstraight-line motions of line drawings of objects belongingto 91 classes. There were 10 exemplar images of differentscales for each category. Each image was of size 32 pixelsby 32 pixels. The movies were created by picking a randomcategory index and then moving the picture belonging to thatcategory in straight lines. Once a particular direction of motionwas picked, the object moved in that direction for a minimum

of 10 time steps before changing directions. An object thatwas picked remained in the field of view for at least 30 timesteps before a new object category was picked at random.This way of simulating a movie gives us an infinitely longinput sequence to verify various performance aspects of thealgorithms described above. We describe the results of theseinvestigations in the following subsections.

All these simulations are based on a hierarchical arrange-ment of modules as described in section 2. The systemconsisted of 3 levels. The lowest level, level 1, consisted ofmodules receiving inputs from a 4x4 patch of images. Sixtyfour level 1 modules tiled an input image. Learning startedat L1 and proceeded to the higher levels. A level 2 modulereceived its inputs from 4 adjoining level 1 modules. Therewere a total of 16 level 2 modules. A single level 3 modulereceived all the information from these level 2 modules.

A. Recognition, Prediction and GeneralizationThe full network was trained up according to the algorithm

described in section 3. Recognition is performed accordingto the inference procedure outlined in section 4. An inputimage to be recognized is converted to uncertain evidenceusing a hamming distance metric on each module (at thelevel of 4x4 pixels) as described in section 4. Recognition isdefined as obtaining the most likely explanation (MPE) of theevidence given the conditional probabilities that we learnedon the graph. We used Pearls Bayesian Belief Propagationalgorithm for inference [13].

The system exhibited robust recognition performance in-variant to large scale changes, deformations, translations andnoise. Figure 4 shows examples of correctly recognized im-ages. Note that some categories are recognized irrespective

Fig. 7. Neurons Responding to Illusory Contours/Contour Con-tinuation: Such neurons were observed in V1 [9]. Here we showthe results of an experiment which demonstrates analogous results.Illusory contours are the result of the higher levels imposing itsknowledge of higher level structures on to the lower levels. To testthis we deleted a small portion of a familiar pattern and gave that asthe input to the system. This pattern (an incomplete a) is shown in thefigure. We then recorded the activities of neurons in regions markedA and B as a function of time. The image is shown to the system att = 0. Neuron 15 in of region B shows a robust response at t = 0because this region receives a perfect input that is tuned to neuron15. Whereas, neuron 76 of region A does not show any response atthis time. At time t = 2 the information has propagated one level upand has propagated back down. This forces region A to change itscurrent belief about its state, thus increasing the activity of neuron 76.At t = 4, the global feedback information reaches all level 1 regionsand for region A, this increases the belief in neuron 76. Note that thepattern corresponding to neuron 76 correctly fills the missing portionof the input pattern. Neuron 15 is an example of a neuron in regionA whose activity was not affected by the feedback information. Att = 4, all regions have received feedback from everywhere and hencethe responses do not change after this point.

of a flip about the vertical axis. This is because for thosecategories, we included sequences which had a flip as a partof the training sequences. This shows the capability of thealgorithm to learn the transformations it is exposed to duringthe training. If the input is ambiguous, the cortex can gatherfurther information from the input by making small eye move-ments. Many psycho-physical studies show that recognitionperformance is greatly hampered when eye movements arenot allowed [11]. Between the eye-movements, the correcthypothesis would remain stable while the competing incorrecthypotheses will vary in a random manner. Our system exhibitsthis property and we used it to improve the signal to noise ratioduring recognition.

A critical test for whether a system generalizes correctlywould be to test whether it can correct noisy or missing inputsusing its knowledge about the statistics of patterns. We testedthis for our system and the results are shown in figure 5.

We also tested that our system generalizes well when trainedon novel patterns. Generalization occurs in our system dueto the hierarchy. Objects are made of the same lower levelcomponents. After having seen many images, the lower levelsof the system have seen everything that is there (sufficientstatistic) in the visual world at the spatial and temporal scales

Fig. 8. Shape perception reduces activity in lower levels [10]:Our model offers an alternative explanation to this phenomenoncompared to the subtraction theory [10]. Reduction in activity occursbecause incorporating global information narrows the hypothesesspace maintained by a lower level region. In this experiment, weshowed our system a highly noisy picture of a helicopter and recordedthe activity of the cells which represent the current belief in arectangular Level1(V1) region(pointed by the arrow). At t = 0,the input is highly ambiguous as shown and hence the belief of theregion is highly spread out. At t = 1 the level 2 regions integrate theinformation from multiple level 1 regions and feed back informationto level 1 regions. At t = 2, the level 1 region uses this informationto update its belief. Figure shows that this reduces the spread ofthe belief as compared to t = 0. The corresponding picture of thehelicopter is the reconstruction at this stage if you take the bestguesses from all level 1 regions. At t = 4, the level 1 regionsget feedback which incorporates the global information. This furthernarrows the posterior distribution. Note also that the reconstructionat this point is the correct one.

of those modules. Thus if a new pattern is to be learned, mostof the lower level connections do not need to be changed atall for learning that pattern. Figure 6 shows the generalizationperformance of the system in learning new patterns.

VII. ALTERNATIVE EXPLANATIONS FOR BIOLOGICALPHENOMENA

Some physiological experiments [9] found that neurons inV1 of the visual cortex respond to illusory contours in aKanizsa figure. This means that the neuron is responding to acontour that does not exist in its receptive field. Another wayof interpreting this is that the activity of the neuron representsthe probability that a contour should be present in its input,given its own input and the contextual information from above.We found such neurons in our model using the anatomicalmapping we described in section 5. See figure 7 for the resultsof our experiment.

Functional MRI studies [10] report that the perception ofan object in the Infero Temporal cortex reduces the activityin lower levels of the hierarchy. We could observe this inour model and we offer a Bayesian explanation for thisphenomenon as opposed to the current subtraction hypothesis[10]. See figure 8 for details.

VIII. DISCUSSIONInvariant pattern recognition has been an area of ac-

tive research for a long time. Earlier efforts used only

the spatial information in images to achieve invariantrepresentations[6][16][15]. However performance of these sys-tems was limited and generalization questionable. We believethat continuity of time is the cue that brain uses to solve theinvariance problem [5], [17]. Some recent models have usedtemporal slowness as a criterion to learn representations [8],[19], [2]. However those systems lacked a Bayesian inference-prediction framework [9] and did not have any particular rolefor feedback.

Our model captures multiple temporal and spatial scalesat the same time. This goes beyond the use of HierarchicalHidden Markov Models (HHMMs)[4] to capture structure atmultiple scales either in space or in time. Several other models[1], [12] attempt to solve the invariance problem by explicitlyapplying different scalings, rotations and translations in avery efficient manner. However, as our test cases in section4 indicate, none of the novel patterns we receive are purescalings or translations of stored patterns.

In our current system, sequence information is used onlyduring the training stage to form concepts at intermediatelevels. Future work will include methods for preserving thissequence information so that the system can predict forwardin time. The current model deals only with the ventral visualpathway of the cortex. Dealing with the dorsal pathway willrequire integrating motor information with visual information.This is also part of future work.

APPENDIX: BAYESIAN BELIEF PROPAGATION EQUATIONSThe following equations, adapted from [13] were used for

the derivation of the circuit shown in figure 3.

(yk) =j

Xj (yk) (2)

pi(yk) =z

P (yk|z)piY (z) (3)

BEL(yk) = (yk)pi(yk) (4)Y (zm) =

y

(y)P (y|zm) (5)

piXj (yk) = pi(yk)i 6=j

Xi(yk) (6)

These equations are specified with respect to a mod-ule/region that encodes the random variable Y . Equations 2 to4 represent how the internal values (Y ), pi(Y ) and BEL(Y )are calculated from incoming messages and locally storedprobability tables. Equations 5 and 6 describe how to derivethe messages that are to be set as feed forward and feed backoutputs of this region. See [13] for more details on BeliefPropagation.

REFERENCES[1] David W. Arathorn. Map-Seeking Circuits in Visual Cognition: A

Computational Mechanism for Biological and Machine Vision. StanfordUniv Pr, Stanford, CA 94305, Sept 2002.

[2] Suzanna Becker. Implicit learning in 3D object recognition: Theimportance of temporal context. Neural Computation, 11(2):347374,February 1999.

[3] D. C. Van Essen, C. H. Anderson, and D. J. Felleman. Informationprocessing in the primate visual system: an integrated systems perspec-tive. Science, 255(5043):419423, Jan 24 1992. LR: 20041117; JID:0404511; RF: 38; ppublish.

[4] Shai Fine, Yoram Singer, and Naftali Tishby. The Hierarchical HiddenMarkov Model: Analysis and Applicaitons. J Opt Soc Am A, 20(7):12371252, 2003.

[5] Peter Foldiak. Learning invariance from transformation sequences.Neural Computation, 3(2):194200, 1991.

[6] Kunihiko Fukushima. Neocognitron: A self-organizing neural networkmodel for a mechanism of pattern recognition unaffected by shift inposition. Biological Cybernetics, 36(4):193202, 1980.

[7] Jeff Hawkins and Sandra Blakeslee. On Intelligence. Times Books,Henry Holt and Company, New York, NY 10011, Sept 2004. In Press.

[8] Aapo Hyvrinen, Jarmo Hurri, and Jaakko Vyrynen. Bubbles: a unifyingframework for low-level statistical properties of natural image sequences.J Opt Soc Am A, 20(7):12371252, 2003.

[9] Tai Sing Lee and David Mumford. Hierarchical Bayesian inference inthe visual cortex. J Opt Soc Am A Opt Image Sci Vis, 20(7):14341448,Jul 2003.

[10] S. O. Murray, D. Kersten, B. A. Olshausen, P. Schrater, and D. L.Woods. Shape perception reduces activity in human primary visualcortex. Proceedings of the National Academy of Sciences of the UnitedStates of America, 99(23):1516415169, Nov 12 2002. LR: 20041117;DEP: 20021104; JID: 7505876; 2002/11/04 [aheadofprint]; ppublish.

[11] T. A. Nazir and J. K. ORegan. Some results on translation invariancein the human visual system. Spatial vision, 5(2):81100, 1990. LR:20041117; JID: 8602662; ppublish.

[12] Bruno A. Olshausen, Charles H. Anderson, and David C. Van Es-sen. A neurobiological model of visual attention and invariant patternrecognition based on dynamic routing of information. The Journal ofNeuroscience, 13(11):47004719, November 1993.

[13] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. MorganKaufman Publishers, San Francisco, California, 1988.

[14] R. P. Rao and D. H. Ballard. Predictive coding in the visual cortex: afunctional interpretation of some extra-classical receptive-field effects.Nature neuroscience, 2(1):7987, Jan 1999. LR: 20041117; JID:9809671; CIN: Nat Neurosci. 1999 Jan;2(1):9-10. PMID: 10195172;ppublish.

[15] Rajesh P. N. Rao and Dana H. Ballard. Development of localizedoriented receptive fields by learning a translation-invariant code fornatural images. Network: Computation in Neural Systems, 9(2):219234, 1998.

[16] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models ofobject recognition in cortex. Nature Neuroscience, 2(11):10191025,November 1999.

[17] Simon M. Stringer and Edmund T. Rolls. Invariant object recognition inthe visual system with novel views of 3D objects. Neural Computation,14(11):25852596, November 2002.

[18] Alex M. Thomson and A. Peter Bannister. Interlaminar connections inthe neocortex. Cerebral cortex (New York, N.Y. : 1991), 13(1):514,2003.

[19] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis:Unsupervised learning of invariances. Neural Computation, 14(4):715770, 2002.

Numenta_HTM_Concepts.doc

,

, Numenta Inc. , , . , , , , . , , .

. (Hierarchical Temporal Memory, HTM) , . , HTM , .

HTM . , . , , , , . , HTM , . HTM . , HTM . HTM , HTM , .

HTM , . HTM , . , , , , , - . , . HTM , , , .

HTM ; , , , . HTM , , .

, HTM, , HTM . HTM - . , HTM - .

, Numenta, . Numenta , HTM .

HTM . , HTM . 6 On Intelligence (Times Books, 2004). HTM HTM.

HTM , , . . . ( ). , HTM . , , .

, .

1. HTM?

2. HTM ?

3. ?

4. ?

5. ?

6.

7.

1. HTM?

25 , ; , , , , , , . HTM. , , HTM , , . , HTM, , HTM , , .

HTM , . , .

1)

2)

3)

4)

.

1.1

1 , HTM . , , HTM. . , , . , , , . HTM , ; . . , , , . . , , , , , . , .

1

. , HTM . HTM , , , , , . , HTM, , HTM.

1 HTM. . , , , HTM, . . . , HTM - , , . HTM , . , , , . , . , HTM, , . , , .

HTM - , . HTM, . - -, , . , HTM , -, , . HTM , . , , , . (, ), (, ). , , HTM .

HTM - , . HTM , , , , . , HTM . , , , , . HTM . , , HTM , . HTM . . HTM , , . , HTM . HTM , , , HTM .

, . . , -. . , . , , , . -, , .

HTM , HTM , . HTM . HTM, , . HTM, , . HTM , .

. , , , , , . HTM .

1.2

HTM , , . . HTM , . , , HTM, , . . , . , , HTM , .

HTM , - HTM ( !). , HTM , .

HTM . , , . , , HTM . , HTM , . , HTM , . HTM , ( , ). , .

HTM , , . . . . , , . , . . , ( ), . , . , , . , , - . , , .

, , HTM , , . , HTM, , , . , . , . , , , . . , , . , HTM : , .

. , , . HTM , . , , HTM .

1.3

HTM , . , , , . , , . HTM , , . , HTM , . . , , , . HTM, , , . .

HTM , , , , , . , HTM , , , . . , HTM , , .

HTM , . , HTM, , .

HTM , . , , HTM HTM, . , . , , , . HTM . . , , HTM . , HTM , , , , , , .

, HTM , HTM.

1.4

HTM, , , . , HTM , . HTM, , . , / . HTM . .

2 HTM . , . , HTM

2

HTM ,

, HTM , , . HTM, , , . HTM , , , , . , HTM , ( 2b). , HTM , . HTM , , , . HTM . , . , HTM , , . , , , HTM , .

2b

, HTM

. , , , , , , . . , , , . , , . , . , , . , .

HTM ; . . . HTM . , HTM, , . HTM , , , , , .. HTM , , , . , . HTM , , , . , HTM . HTM , , , .

HTM:

1)

2)

3)

4) .

, . , HTM .

2. HTM ?

HTM , . 3 HTM. . , . , . , - , . , .

. . , , , . , . , . , . , HTM, , , , . , , , , . , . , HTM. .

3

HTM , . . , , .

( ) . . , , , , . , , , , . . , HTM, , , , ASCII . ASCII .

. , . . , HTM.

, , . , , . , .

. , . HTM , .

HTM . , HTM , , . , , , . . , HTM , .

. ( ). 100 , 100 , . , , , , .

. . , , . () ().

. , , , , , . , . .

. . . . , , . . , , , , . , , .. , , .

, HTM. , , / . 4 , HTM. HTM . . , HTM - . .

4

, HTM ( , HTM , ), . , ? , , ? , ? , , . .

3. ?

, . , .

3.1

, , . , , , , . HTM , . HTM . , , , HTM .

, HTM , , , . , . , HTM , , .

, - , . . . , , - . , x-y, . , . . , . , , . . , , .

, , . , (, , ). , . , , , ? , ? . , , ? , , , . . , . , .

? , - , . , . -, , , , , -, . , . , . . .

HTM . HTM . , HTM, , , . , HTM . HTM , .

, , , , , . 10x10 (100 ), , , 2 100 , . , , . , , , 50 100. . , . , . , . , , , 1010 . , , . , , , , . . , . , , .

. (, , ). . , , . . , . , , . , .

, HTM . , . , , HTM, ( , ), , ( ). , , HTM . , HTM , , .

, HTM. , HTM, . HTM . , . , , , , . , , . , , , , . , .

HTM , , . HTM, , , . HTM , , HTM , . . . , , , .

. , , , . , . .

3.2 HTM

, HTM , , . . , . , . , , . HTM , . , HTM , , , .

, , , HTM. , , , . . , , , , .. . , , . , .

, , HTM . , .

HTM . . HTM , : B A. , . , . , . , , . , , , . , , . , , , , , .

, , , , , , .. . , , .

, HTM . ( ). , , , HTM, .

HTM , ( ) . , , HTM , HTM,

Временная иерархическая память. Статьи

Documents