motion integration and disambiguation by spiking v1-mt-mstl … · 2019-09-09 · by v1 receptive...

Motion Integration and Disambiguation by SpikingV1-MT-MSTl Feedforward-Feedback Interaction

Maximilian P. R. LohrInst. of Neural Information Processing

Ulm UniversityUlm, Germany

[email protected]

Daniel SchmidInst. of Neural Information Processing


[email protected]

Heiko NeumannInst. of Neural Information Processing


[email protected]

Abstract—Motion detection registers items within restrictedregions in the visual field. Early stages of cortical processingof motion advance this estimate by integrating spatio-temporalinput responses in area V1 to build feature representationsof direction and speed in area MT of primate cortex. Theneural mechanisms underlying such processes are not yet fullyunderstood. We propose a neural model of hierarchically orga-nized areas V1, MT, and MSTl, with feedforward and feedbackconnections. Each area serves a distinct purpose and is formallyrepresented by layers of model cortical columns composed ofexcitatory and inhibitory spiking neurons with conductance-based activation dynamics. Recurrent connections enhance ac-tivations by modulatory interaction and divisive normalization.MT population activities allow to estimate motion direction andspeed which we show for various stimuli. The importance ofthe feedback connections for disambiguation is demonstrated insimulated lesion studies.

I. INTRODUCTION

The reliable detection of moving objects in the surroundand the estimation of self-motion and body posture areprerequisites of robust visual space and object perception.The primate visual system achieves such performance overa sequence of neural transformations of the retinal input.Different stages along the dorsal pathway in visual cortexbuild representations of feature and shape motion in the scene.In order to accomplish this, the visual system must integrateinformation derived from a set of local detectors which aredistributed over the visual field. Such cortical processing ofmotion starts in area V1 subsequently feeding into area MTand then forwarding resulting activity distributions to MSTwith its lateral and dorsal subdivisions [1]. Building coherentneural representations of moving shapes takes time such thatlocalized 2D intrinsic shape features and 1D extended contouroutlines are suitably weighted according to their relevance [2,3]. Motion processing is driven by the feedforward sensorystream but is influenced by re-entrant feedback signals fromhigher-level areas which control the gain of neurons in V1 andMT [4, 5].

Here, we propose a detailed neural architecture of spikingneurons for motion detection and integration at the initial

The research reported here has been conducted as part of the VA-MORPHproject financed by the Baden-Wurttemberg foundation (“Neurorobotik” pro-gram, project no. NEU012).

stages of the dorsal pathway in visual cortex. Spike-encodingis included for increased biological plausibility and for latermodel implementation in event-driven neuromorphic hardwareplatforms. Results of initial spatio-temporal filtering of drivingvisual input are laterally integrated and further enhancedby context-driven modulating feedback and down-modulatingdivisive outer-surround inhibition [6]. The model considers ahierarchy of stages corresponding to areas V1, MT, and MSTl.We suggest that different stages of hierarchical processing ofinput motion contribute different functions for the detectionand integration of motion information. Our proposal buildsupon a canonical neural circuit model (CNCM) which proper-ties have been described and mathematically analyzed in [7].This encoding scheme is here extended by transforming thecircuit model into a spiking network architecture that incor-porates the adaptive exponential firing (AdEx) mechanism togenerate discrete events as outputs [8]. Along the feedforwardpathway cells process spatio-temporal input signals over ahierarchy of increasing feature selectivity and spatial size. Thehierarchical bottom-up stream of driving signals is combinedwith top-down modulatory feedback signals to build a coun-terstream network for top-down prediction. We demonstratethat V1-MT-MSTl feedforward and feedback interaction re-distributes the neural responses to disambiguate initial motionresponses in a context-dependent way and generates coherentshape motion representations to reliably encode direction andspeed of moving objects. The work thus further contributesto reveal the roles different processing pathways play forfeature detection, shape registration, the delivery of contextand prediction by top-down feedback signals [9].

II. PREVIOUS WORK

Several works investigated the initial stages of motiondetection along the cascade of areas V1 and MT of the dorsalpathway. These investigations identified canonical neural oper-ations for the sequential filtering and competition to generaterepresentations of moving patterns, such as spatio-temporalfiltering at different scales, non-linear response integration,and opponent competition [10, 11]. Divisive normalizationhas been suggested to explain several nonlinear responseproperties of neurons at different neocortical levels [12, 13]. Acascaded model of linear-nonlinear operations was proposed

hneumann

Schreibmaschinentext

Proc. Int'l Joint Conf. on Neural Networks, IJCNN'19, Budapest, Hungary, July 14-19, 2019

hneumann


hneumann


hneumann


hneumann


by [14] in order to account for MT cell selectivities tocomponent and pattern motion. The initial filtering of rawinput built upon earlier investigations by [15–17] showingformal equivalence of filtering with competition and spatio-temporal correlation mechanisms [18]. Initial motion detectionby V1 receptive fields (RF) can only measure flow that isnormal to local contrast orientation. This so-called apertureproblem is resolved over time in area MT with neuronpopulation responses switching from signaling normal flowresponses to pattern motion direction [19]. Motion can besignaled accurately already by V1 end-stopped neurons givenlocalized features that are intrinsic to the target shape [20]. Thecontributions of initial filtering for feature motion detectionin V1 and subsequent non-linear normalization operations forreliable MT motion integration was investigated in [21]. Theauthors demonstrated for oriented bars smaller in length thanMT cell RFs that divisive inhibition plays a key role in there-weighting of input strength for localized feature motionsand aperture responses. More recently, hierarchical modelshave been suggested using hierarchical feedforward poolingschemes to process complex input motion for multiscalespatio-temporal filtering and pattern response integration in V1and MT [22] and visual heading computation in the presenceof occluding moving objects [23]. Several models suggestedbuilding speed selective representations in area MT operatingeither in the spatial or frequency domain [24–26]; the lattersuggesting to combine sustained and transient output responsesfrom V1 complex cells for generating a speed sensitive sensor.

Several investigations emphasized that feedback signals aregenerated at higher stages along the feedforward pathwayto reenter the motion computation at lower stages. Suchcomputations can potentially help to resolve the uncertaintiesfrom initial feature detection. For example, in [27] the modelproposed by [23] was extended by integrating inhibitoryfeedback connections from pooling cells in MT to V1 complexcells to improve their feature selectivity. An earlier model by[28] demonstrated that the integration of bidirectional signalflow between areas V1, MT, and MST helps to integrateand segregate local motion signals incorporating non-localcontext information, such as figure-ground properties andmutual occlusions. Similarly, motion feature selection andintegration with and without guiding attention has been inves-tigated by [29, 30] suggesting that re-entrant feedback signalsare required to segregate semi-transparent motion. The lattermodels build upon a core model of bidirectionally coupledV1 and MT [31]. This work suggested disambiguation ofnormal flow in apertures via activity propagation initiated fromlocalized features. The V1-MT counterstream interaction, withMT cell pooling over a larger spatial neighborhood, predictedmotion disambiguation in both V1 and MT representationssimultaneously. This was not confirmed experimentally [32].The model presented here resolves this discrepancy predictingthe growth cone of disambiguation in MT while V1 cells onlyimprove their feature selectivity and reduce their responses toaperture motion. While the models mentioned above were notspeed sensitive, [33] proposed a mechanism that incorporated

different speed channels based on spatio-temporal filters.The models discussed so far utilize rate-coded mechanisms

to simulate population activities in response to motion inputs.A few spiking models have been suggested as well. Forexample, [34] investigated how a velocity-sensitive cell in MTmight be learned using an STDP mechanism, [35] developed aspiking two-layer neural model to detect moving object bound-aries in image sequences. The architecture in [36] suggesteda feedforward spiking network model for sequence analysisin action recognition including areas V1 and MT, and [37]hypothesized that specific connectivity patterns in top-downpathways implement the generative model activation patternsin predictive spike coding for motion tracking. More recently,initial event-based (spiking) motion detector schemes havebeen realized and implemented on dedicated neuromorphichardware platforms (e.g., [38]).

We here build upon such previous investigations and de-velop a spiking neural architecture of model V1, MT, andMSTl processing, incorporating driving feedforward process-ing together with top-down feedback signal pathways. Detailedinvestigations by selective lesioning of the model allow tovalue the individual contributions of these processing princi-ples. The spiking network architecture is also well suited forlater implementation on neuromorphic hardware platforms.

III. METHODS

A. Architecture

The proposed model architecture includes three corticalareas, namely V1 (primary visual area), MT (middle temporalarea) and MSTl (lateral part of medial superior temporalarea). Each is defined as a topographically organized layerof CNCMs corresponding to cortical columns and their lateralconnections. The areas are hierarchically located on consecu-tive levels having different selectivities, such that the output ofone level is fed as input to the subsequent layer (c.f. Fig. 1).Output activation at one level in the hierarchy generates re-entrant feedback signals to the stage below in that hierarchy.Each model area includes three computational operations,namely an input filtering using static spatio-temporal filters(see III-C), a re-entrant top-down modulation of activation,and a local spatio-temporal competition (see III-B) leadingto response normalization over a pool of cells in a space-feature domain [7]. The input filtering and further mutualcell interaction plays a different computational role in eachlayer. Model V1 detects local movements in the input streamwithin a small RF. The input is a stream of events generatedfrom temporal changes in the input luminance distribution, asregistered in the retina. Local competition between cells at thisstage reduce or suppress responses at locations of intrinsically1D structure where local normal flow is signaled. This aidssubsequent layers to detect the true motion direction which ismainly guided by locations of intrinsically 2D structure, e.g.corners and end-points. This first layer consists of 128×128×8pairs of excitatory-inhibitory (E-I) neurons. Each correspondsto a single spatial location in an 128× 128 px image, which

is the input size of all further experiments, and one of eightdiscrete directions of motion, for increments of ∆ = 45°.

Model MT builds the first velocity, i.e. direction and speedselective, representation of the image features. At this leveloutput activity of V1 is integrated over a larger spatial andtemporal domain utilizing velocity sensitive filters. Severalselective channels enable to discern different speeds of motionalong an actual movement direction. Feedback signals aregenerated from MT velocity selective cells marginalizing overthe speed dimension. This feedback reenters the more localizedV1 direction selective activity distributions to enhance the gainof those components that might explain the current motion hy-potheses. The net effect of such enhanced V1 responses is thatthe increase in local gain give these cells a competitive advan-tage in the subsequent competition for response normalizationsuch that cells not receiving any re-entrant amplification willbe down-modulated. The topographic structure of this secondlayer is formed by 128× 128× 6× 8 E-I pairs where spatiallocations and directions of motion correspond to those of layerV1. The additional explicit speed dimension is discretized intosix channels to build explicit velocity representations. Speedselective cells have Gaussian tuning on log-speed input.

Inspired by [28] the model architecture includes area MSTlcells at the highest level which integrate MT output activityto find regions of homogeneous movement independent of thedetails and variations of speed. Again, output responses formre-entrant feedback signals to model MT neurons. At this stagethe loopy feedforward and feedback interaction consolidatesthe responses of those MT cells that encode directions match-ing the movement hypothesis formed at the level of MSTl.In turn, this facilitates an activity spreading, or guided filling-in, for MT cell activation which evolves dynamically like agrowth cone that is initiated at locations with unambiguousfeature motion. The topographic structure of this last layerconsists of 128× 128× 8 linear neurons to discretely samplemovements along eight directions. Any output visualization ofmotion is calculated from the population of model MT space-velocity representations.

B. Model Neurons

Model layers V1 and MT are populated by pairs of E-Icells resembling the interaction within a cortical column [7].Their state and its respective temporal evolution is definedby membrane potentials, r, p, and their first order temporalchanges. The spiking activity ur of an E subpopulation formsthe output of the respective layer. The net input current to eachtype of model neuron is denoted by

Ir = (Eexr − r) · gexr · (1 + λ · gmodr ) + (Einr − γr · r) · ginr (1)Ip = (Eexp − p) · gexp + (Einp − γp · p) · ginp , (2)

which are driven by excitatory and inhibitory input conduc-tances g. In the model, E-cells, r,1 receive excitatory input gexr

1For better readability, subscripts denoting the E-cell instance with respectto its layer (e.g. MT), features (e.g. direction φ and speed s) and its positionin retinotopic space x, y are dropped, rMT(x, y, φ, s) ≡ r.

Neuron types:linearexcitatory nonlinearinhibitory nonlinear

Connections:excitatoryinhibitorymodulatory

tx

MT filter:

tx

V1 filter:

Fig. 1. Model architecture consisting of areas V1, MT and MSTl. Excitatory(E) cells are driven by the responses of their level-dependent feedforwardfilters which are driven by the output of the preceding layer. Modulatoryinput to E-cells is generated by top-down feedback and in the case of MTfrom lateral interaction as well. Inhibitory (I) cells are driven by integrated E-cell activity pooled over a larger neighborhood in the space-feature domain.Such pool activation of I-cells exerts inhibitory influence on E-cells at thecorresponding location. Motion is estimated from area MT activity. Insetsshow examples of the spatio-temporal feedforward filters of V1 and MT, withlight regions denoting positive and dark regions denoting negative coefficients.

via feedforward connections, modulatory input gmodr by top-down feedback as well as by lateral E-cells2, and inhibitoryinput ginr from I-cells in the same layer. Similarly, I-cells,p, receive excitatory input gexp from E-cells of the regardinglayer, as well as inhibitory input ginp from I-cells of the samelayer. The constants Eex and Ein denote the excitatory andinhibitory reversal potentials, while λ controls the strengthof the gain modulation by feedback signals. We emphasizethat feedforward and feedback signals have an asymmetricinfluence on the excitatory cells: while driving input signalsare necessary to generate an activity, feedback signals canonly amplify the response in the presence of coinciding input,whereas it cannot generate activity when driving input isabsent. The integration of the different inputs specifies thetemporal change of membrane potentials following first-orderdynamics (using the notation ∂tx ≡ x).

τr r = fr(r)− wr + Ir (3)τpp = fp(p)− wp + Ip, (4)

with a leak term fr(r) (fp(p) respectively) and an adapta-tion term wr (wp, respectively) for the spike-rate adaptation(see below). Steering parameters for the strength of divisiveinhibition are given by γr and γp, and τr, τp denote themembrane time constants. For the definition of a spike-output

2In the case of MT, lateral and feedback streams are additively combined.

based dynamical neuron model, we incorporated mechanismsof the AdEx model [8] which considers leak and spike-rateadaptation terms, respectively, particularly

fr(r) = (Eleakr − r) · gleakr + gleakr ·∆r exp(r − Vr

∆r) (5)

fp(p) = (Eleakp − p) · gleakp + gleakp ·∆p exp(p− Vp

∆p), (6)

for the leak and the additional dynamics for the adaptation

τwrwr = ar · (r − Eleakr )− wr (7)

τwpwp = ap · (p− Eleakp )− wp, (8)

with ∆• and V• as in [8], gleak• denoting the leak strengthsand Eleak• the reversal potentials in the spike-generation mech-anisms. Spikes ur are generated whenever the potential ofthe E-/I-cells exceeds the threshold V Θ

r or V Θp , respectively.

After spike-generation the membrane potential is set to a resetpotential and the adaptation variable wr is incremented:

if r > V Θr : ur → 1, r → Eleakr , wr → wr + br (9)

else : ur → 0, (10)if p > V Θ

p : up → 1, p→ Eleakp , wp → wp + bp (11)else : up → 0.

Spikes are transmitted to other neural sites by a filteringprocess via their overall connectivities and the associatedweight coefficients (cf. Section II)

yS(x, y, t, θ) = (ΛS (x,y,t,θ)∗ u)(x, y, t, θ), (12)

with ∗ denoting a convolution operation over the respectivespatial (x,y), temporal (t) and feature (θ) dimensions, u beingspikes emitted by a source (r or p cells), respectively. ΛS

are filter matrices for different synapse types per layer andneuron population (S ∈ ex, in,mod). These filtered spikes yS

are then integrated postsynaptically via

τS gS = −gS + kS · yS , (13)

with the synaptic time constants τS which differ by connection,or synapse, and kS is a scaling parameter to define the impactof spikes. The time constants are parameterized according to[39]. The MSTl cell responses are defined by the simplifiedscheme utilizing a linear mapping

rMSTl = yMSTl,exr (14)

The parameters for instantiating the different neuron modelsare defined in Table I.

C. Parameterization of feedforward and feedback filters

In the proposed model, the weighted connections betweenneuron populations are represented by filter kernels which arespecific to the input stage of the different levels. Such filtersare applied to their respective input activity distribution usinga convolution operation. Feedforward filter banks at differentlayers generate input to the E-cells of V1 and higher layers(see gexr in (1)). Feedback filters act along the reverse direction

TABLE INEURON PARAMETERIZATION OF MODEL V1 & MT

Variable Value Unit Remarksgleakr , gleakp 30 1 leak conductanceEexr , Eexp 0 mV excitatory reversal potentialEinr , Einp −75 mV inhibitory reversal potential

Eleakr , Eleakp −70.6 mV leak reversal potentialVr, Vp −50.4 mV exponential term offset∆r,∆p 2 mV exponential term slopeγr, γp 1 1 divisive inhibition scalingλ 0.5 1 modulation scaling

ar, ap 4 1 sub-threshold adaptation strengthbr, bp 80.5 1 super-threshold adaptation strengthV Θr , V

Θp 20 mV spike threshold

τr 281 ms membrane time constant E-cellτwr 144 ms adaptation time constant E-cellτp τr/2 ms membrane time constant I-cellτwp τwr/2 ms adaptation time constant I-cellτex 1.7 ms excitatory synaptic time constantτ in 6.5 ms inhibitory synaptic time constantτmod 26 ms modulatory synaptic time constant

of the hierarchy with the role of the cells and their receptiveand the projective field exchanged (c.f. gmodr ). The I-cells ofa layer are fed by the E-cells at this level (e.g., via gexp in (2))which is the result of a convolution as well. Finally, there islateral interaction within MT to support spreading of activitywithin the E-cell population. The qualitative shape of thesekernels is described below and exact parameters can be foundin Table II. All types of filter kernels are normalized such thattheir coefficients sum to one. Gabor filter coefficients sum tozero to achieve zero DC-level response. Feedback kernels arecreated such that their maximum coefficient amounts to one.Scaling factors kS , c.f. (13), for filter responses are listed inTable III. Model parameters and kernel weights have beenchosen according to physiologically plausible measures andselected to ensure stable dynamics.

TABLE IIFILTER PARAMETERIZATION OF MODEL V1, MT & MST

Layer Filter ParametersV1 feedforward Gabor: σx,y = 1.22 px, ωx = 2.86 2π

pxσt = 33 ms, stuning = 100 px/s

inhibitory Gaussian over space: σx,y = 7.5 pxMT feedforward DoG: σcen

x,y = 1.22 px, σsurx,y = 1.96 px

stuning ∈ {24, 40, 56, 72, 88, 104} px/slateral Gaussian over space: σx,y = 7 px

coefficient at mode set to 0Uniform over speedGaussian over direction: fwmh = 90°

inhibitory Gaussian over space: σx,y = 10.5 pxUniform over features φ, s

feedback Gaussian over space: σx,y = 13.44 pxSum over speed channels for V1 motion cells

MSTl feedback Gaussian over space: σx,y = 6.11 pxAverage of MT speed channels replicated

In V1 driving feedforward activity is generated by a bankof spatio-temporal Gabor filters of four different orientations,each with direction selectivities orthogonal to the orientationof a Gabor-RF (yielding eight movement directions). We also

include oriented filters selective to zero motion, which areseparable in space-time. To derive phase invariant responsesfor the retinal image both sine and cosine Gabor are usedin quadrature. The vector of wave propagation for the filtersis slanted towards the temporal axis (see the lower inset inFig. 1). In order to yield causal impulse response propertiesfilter weights are set to zero for negative times. The outputof the spatio-temporally separable filters is used to subtrac-tively inhibit the inseparable filters responding to normalflow orthogonal to their orientation preference. This rectifieddifference drives V1 E-cells, forming a simple mechanism ofenhancing end-stop response. V1 inhibition pools excitatoryactivation over a Gaussian weighted spatial neighborhood foreach direction tuning separately. Thus, V1 cells with differentdirection selectivities do not interact.

In MT feedforward driving activity is generated by a filterbank consisting of 48 space-velocity selective filters similarto those shown in the upper inset of Fig. 1. These causalkernels are cylinders aligned along the temporal axis thathave a Difference-of-Gaussians (DoG) cross-section (similarto [40]). Different speed tuning is generated by rotationtowards or away from the t-axis and a rotation around the t-axis tunes the direction selectivity. The cylindrical weightingis extended to cover approximately 300 ms for low speeds.Its shape was chosen to sample just a few cells of V1 overits past temporal window while keeping a high sensitivity fordirection or speed since the inhibitory Gaussian componentpunishes spatio-temporal misalignment. The MT inhibitorykernel for integrating over the pool of activations is Gaussianover space but uniform in both speed and direction. This allowsstrong activity of E-cells of a given feature combination tosuppress nearby E-cells of a competing feature combination.To facilitate lateral integration of E-cell activities a kernelis specified that is Gaussian in space and over directionsbut uniform over speeds. Activities from cells with differentspeed tuning can facilitate each other as can activities ofdirectly adjacent direction channels. To prevent neuronal self-excitation the central spatial coefficient of the kernel is zero.

Re-entrant feedback signals from levels higher in the pro-cessing hierarchy are generated by convolving the outputactivity distribution by kernels to specify the projective fieldof modulating signals. MT feedback signals are generated byconvolving E-cell activations by a spatial Gaussian to definethe modulatory cone for V1 cells. Speed selective responsesin MT are marginalized over the different speed channelsto facilitate V1 spatio-temporal cell responses of matchingdirection selectivity. Finally, direction selective model MSTlcells pool MT output using spatial Gaussian weights and uni-form weighting for speeds. The speed integration makes MSTlselective to detect homogeneous directional motion fields.The resulting activation is fed back to enhance MT velocityrepresentations for corresponding directions regardless of theirspeed characteristics.

TABLE IIIFILTER NORMALIZATION COEFFICIENTS

Layer Variable Value Unit Post-multiplier typeV1 kV1,exc

r 20000 1 E-cell excitatorykV1,inhr 1250 1 E-cell inhibitorykV1,modr 0.02 1 E-cell modulatorykV1,excp 10000 1 I-cell excitatorykV1,inhp 1250 1 I-cell inhibitory

MT kMT,excr 20000 1 E-cell excitatorykMT,inhr 5000 1 E-cell inhibitorykMT,latr 1 1 E-cell lateral modulatorykMT,fbr 0.05 1 E-cell feedback modulatorykMT,excp 10000 1 I-cell excitatorykMT,inhp 5000 1 I-cell inhibitory

IV. RESULTS

To evaluate the function of the proposed model we ransimulations on several different inputs. We used synthetic inputconfigurations similar to those in animal electrophysiologyor behavioral studies [19, 41] as well as real-world address-event-representation (AER) data of [42] and of our own whichwas recorded with a neuromorphic DVS camera [43]. Usinga moving oriented bar we investigate how spatio-temporalchanges of intrinsically 1D and 2D intensity features areinitially detected, how these responses are differently weightedto link them into a coherent moving shape, and how thedifferent stages along the model visual pathway interact inorder to disambiguate local measures to represent the targetmovement. We show how the aperture problem for the baris resolved dynamically and how a representation of movingshape in model MT is generated. Our simulations show thatreliable shape motion representations for different directionsand speeds are only achieved at the level of MT but not inV1.

To determine the relative importance of each model com-ponent we conducted simulated lesion studies eliminatingselected connections between layers and show how the gen-eration of moving shape representations is impaired. Wealso demonstrate that the model architecture is capable ofprocessing real camera input from a database provided in [42](see Fig. 4

). Finally, the representational differences of the

modeled areas are compared.For all simulations the model equations have been dis-

cretized using the explicit Euler scheme with a step size of2 ms. To read out robust quantities for the response estimatesand plots of this section we apply an exponential filter(τres = 13 time steps) on the output spikes. The resultingactivation gMT,res

r resembles a postsynaptic potential similarto (13).

A. Direction & Speed Estimation

To estimate individual motion vectors we employ a simplereadout scheme of MT activity. At each spatial location (x, y)neural responses of input filtering for combined direction(φ) and speed selectivity (s) have been generated to yieldvelocity votes ~ve(φ, s) = s·(cos(φ), sin(φ)). The final velocity

estimate is calculated by the sum of all votes weighted by theresponses of all model MT neurons at that location:

~v =∑φ,s

~ve(φ, s) · gMT,resr (φ, s)/

∑φ,s

gMT,resr (φ, s) (15)

We employ the average angular error (AAE) as metric toevaluate deviation between velocity estimates and the groundtruth motion. Since high neuronal responses are interpretedas high confidence we weight the calculated average bythe motion energy summing all cell responses at that pixellocation.

(a) AAE for bar lengths of 11, 21, 31, 41, 51 and 61 pixels.

(b) MT: 61 px (c) iter. 150 (d) iter. 350 (e) iter. 700

(f) MT: 21 px (g) iter. 150 (h) iter. 350 (i) iter. 700

−22.5° +22.5°(j) AAE legend

Fig. 2. AAE of MT motion estimation for different bar stimulus lengthsover time. Each bar stimulus is 1 px wide and moves along 45° at 64 px/s.(a) AAE for six different bar lengths. Larger stimuli take longer to minimizethe error. (b)-(e) Spatial distribution of angular error for 61px bar over timesteps 10 to 700. (f)-(i) Same as above but for 21px bar length. (j) AAE legendranging from 0° (white) to ±22.5° (dark gray).

Fig. 2(a) shows the AAE for bars of different lengthsmoving along 45° at 64 px/s. Longer stimuli need more timeto reduce their AAE and the shortest bar’s error settles at∆AAE = 3.2°. Responses are stronger at end-points and fromsuch localized intrinsic feature locations the correct motionhypothesis is propagated through existing activity along thecomplete edge [41]. Localized features are enhanced in V1

as well, while responses signaling locally ambiguous motion(normal flow along elongated contrast) are reduced.

The movement speed sx,y is the length of the estimatedmotion vector. Fig. 3 shows how such speed estimates evolveover time along the 51 px bar stimulus from the previousexperiment. The rates in blue show responses of all MTneurons which are selective to either the direction of 0° or45° at the respective position along the bar. As the stimulusis moving diagonally upwards to the right the overall neuralactivity shifts from left to right in the plots. To comparethe velocity estimate with the activity of responsive neuronsof both subpopulations we projected ~v onto their tuningdirections (depicted red in Fig. 3). The first row of showsMT responses for the actual stimulus direction of 45°. Theground truth of 64 px/s is marked in green. The second rowshows the response of neurons sensitive to rightward motion.This direction reflects the normal flow as signaled by mostV1 neurons (suffering from the aperture problem due to theirsmall RF sizes). The ground truth speed is thus scaled bycos(45°). Speed estimates approach ground truth after half thesimulation time at those locations where strong MT activityexists. Where activity is weak (indicating unreliable estimates)speed estimates diverge from ground truth due to a highersensitivity to perturbations, see Fig. 3(d).

(a) time step 50, 45° (b) time step 350, 45°

(c) time step 50, 0° (d) time step 350, 0°

Fig. 3. Progression of speed component estimates for a 51 px vertical barstimulus. MT responses (blue) are shown along a vertical cut through thepopulation at the respective horizontal location of the stimulus. Darker shadesdepict responses of neurons selective to higher speeds. Ground truth speedcomponents (green) and estimated speed components (red) are the projectedlengths of the stimulus motion vector and of ~v onto the depicted directiontunings. (a)-(b) Responses of MT neurons sensitive to direction 45°, i.e. thestimulus movement direction. (c)-(d) Responses of MT neurons sensitive todirection 0°, i.e. the stimulus orientation and direction of normal flow.

B. Simulated Lesion Studies

To ascertain the importance of each of the recurrent lateraland feedback pathways in our model we conducted severallesion studies. In such simulations the connections of either

(a) windmill AER data (b) MT flow (step 51) (c) MT flow (step 101)

(d) chess AER data [42] (e) MT flow (step 345) (f) color wheel

Fig. 4. Exemplary MT flow estimates on real-world AER data. On- and off-events are either marked black and white (a) or red and green (d). (a)-(c)Windmill rotating counterclockwise with a tangential velocity of 148 px/sat the tips. (d)-(e) Moving checker board setup of sequence 3 of [42]. Theground truth motion direction is upwards to the left along 118°. (f) Colorwheel encoding direction of the flow estimates.

the MT-V1 and MSTl-MT feedback, the lateral connectionswithin MT or combinations of those have been set to zero.

Fig. 5(a) shows that eliminating any feedback signals reen-tering V1 and MT, akin to considering a pure feedforward-network, drastically impairs the computation of shape motion.The error reaches only ∆AAE = 27° after about 100 timesteps and barely improves after that. Feedback from MSTlinto MT alone (no feedback from MT into V1) converges to∆AAE = 25.6° after 150 time steps. Feedback from MT intoV1 (but removing re-entrant signals from MSTl to MT) leavesa final ∆AAE = 17.3° which shows that feedback at the lowerlevels has a large impact. However, all feedback signalingcontributes to generating robust moving shape representations:Using both feedback streams but no lateral interaction in MTleads to an ∆AAE = 2.1°. Including lateral interaction as well,the full model yields ∆AAE = 1.6° AAE at the end of thesimulation and is still improving. When any feedback streamis missing, lateral connections alone show no noticeable effect.

C. Differences in V1, MT and MSTl Representations

All three layers of the model architecture serve differentpurposes (Section III-A). Fig. 6 shows the neuronal responsesof each layer and the AAE estimated near the end of sim-ulations. The detailed investigation considers the diagonallymoving vertical bar (51 px) used above.

V1 neurons have a small RF and detect the true motiondirection only at points of intrinsically 2D structure (e.g., atendpoints of the bar) or the localized feature points in complexpatterns. The AAE there is low. Feedback from MT facilitatescorrect responses, however, the normal flow components ofV1 are significantly reduced but not replaced by activitycorresponding to the true direction. V1 cells are selective tothe individual components of a moving pattern but insensitiveto whole patterns [44]. This is reflected by the gray patches in

(a) Overall MT direction error for different lesions on 31px bar.

(b) baseline (c) no LIMT (d) no FBMST (e) no FBMT (f) only LIMT

Fig. 5. AAE of MT motion estimation for lesion studies for a 31 px barstimulus. (a) From bottom to top: Baseline with no lesions (red), no lateralinteraction in MT (yellow), no feedback from MSTl (blue), no feedback fromMT (green), and no feedback from both MT and MSTl (gray). (b)-(e) Fromleft to right: same lesions as above. Spatial distribution of angular error shownat time step 700. Same AAE legend as in Fig. 2(j) (brighter is better).

Fig. 6(a) and the response strengths along the outline boundaryshown in Fig. 6(d,e).

MT neurons have larger RF and integrate V1 activity overa neighborhood. They fill in gaps along existing activitydistributions and also have the capacity for precise speedestimation. The AAE is low over the complete stimulus lengthand its vicinity. The responses of both neuron populations withtheir tuning speed closest to the stimulus speed are nearlyconstant over the bar’s extent. Also the relative amount ofresponse for normal flow is less than in comparison to V1neurons

(Fig. 6(d-g)

).

Neurons in MSTl integrate MT activities over a larger re-gion to enforce consistency among the directional hypothesesgenerated at lower layer neurons, here yielding a homogeneoustranslatory field. The responses in MSTl can, thus, be inter-preted as votes for the presence of a given direction of motion,as shown in Fig. 6(h,i) for 45° (ground truth). The spreadof activation (positional uncertainty) over the neighborhoodof stimulus locations increases due to the increasing RFsizes in higher layers. But modulating feedback together withnormalization effectively restricts the amount of spread.

V. DISCUSSION & CONCLUSION

The proposed architecture explains how motion can be esti-mated in model area MT, utilizing feedforward and feedbackstreams between areas V1, MT and MSTl along the dorsal

pathway. Each layer has a different feature space and playsa different role. Feedback and lateral interaction govern thespread of information. This way motion hypotheses can beintegrated over distances that exceed the receptive fields ofMT and disambiguated through competitive normalization.

(a) angular error of V1 (b) angular error of MT(white equals low error)

(c) ang. error of MSTl

(d) V1 activity, 45° (e) V1 activity, 0°

(f) MT activity, 45° (g) MT activity, 0°

(h) MSTl activity, 45° (i) MSTl activity, 0°

Fig. 6. Representational differences of the three model areas. Stimulus is avertical 51 px bar and activity was sampled at time step 650. (a)-(c) Angularerror of estimations from the neuronal activity in layers V1, MT and MSTl.Same error legend as in Fig. 2(j) is used (white means zero error). Stimulusis shown in red. Increasing activation spread reflects the larger RF sizes.(d)-(e) Activities of V1 neurons selective to direction 45° (the real stimulusmovement) and 0° (the stimulus orientation and direction of normal flow).Dark and light blue marks neurons selective to movement and to stationarycontrasts, respectively, at the current horizontal position of the bar stimulus.(f)-(g) Activities of MT neurons selective to directions 45° and 0°. Darkershades denote neurons of higher tuning speeds. (h)-(i) Activities of MSTlneurons of directions 45° and 0°. All ordinate axes show the maximum rangeof neuronal activity.

The importance of these recurrent connections has beendemonstrated in simulated lesion studies. For a diagonallymoving bar stimulus the error can drop below ∆AAE = 2°which is comparable to the results in [21], but removing anyof the feedback streams drastically magnifies the error.

Still, there are limitations to the model. MT needs strongresponses of V1 neurons at the end-points of the stimuluswhere the actual motion can be discerned in V1. Here, wedecrease the response of all V1 movement neurons along thebar by subtracting the response of static contour neurons whichis weaker at the endpoints, because the RF is only partiallyfilled there. This works well for end-points of a bar and for90° corners in a contour. At T-junctions or crossings of longbars the inhibition is getting too strong and disambiguationtakes significantly longer. Mechanisms for detecting intrinsic2D features would stabilize the solution. These could be eithersignaled by a separate neuron population of end-stop neurons[21, 29] or be generated by scale-sensitive filter interactionswithin area V1 or between V1 and V2 [45].

A second issue is the current choice of feedforward filtersin V1. Spatio-temporal Gabor filters impose a speed tuningso the contour inhibition mechanism will not work over anarbitrary range of different stimulus speeds. This can be seen inFig. 4(b,c) were there is nearly no response close to the centerof the windmill. One could switch to correlation detectors thathave constant response for movements of any positive speed.Another possibility is to create multiple filters with differenttuning speed and average their responses.

Future work can be pursued along several directions. Thespatial size of each layer was chosen to be the same number ofpixels. However, higher areas along the dorsal pathway haveless spatial resolution than V1, thus, approximating a hierar-chical pyramid of spatio-temporal filtering. Such a pyramidalrepresentation might improve efficiency in simulation time.This will also support more rigorous testing of the proposedmodel on a greater variety of input data sets. Finally, creatinga fully spike based model and realizing it on neuromorphichardware is attempted as further extension.

REFERENCES

[1] R. T. Born and D. C. Bradley, “Structure and function of visualarea MT,” Annu. Rev. Neurosci., vol. 28, pp. 157–189, 2005.

[2] C. C. Pack and R. T. Born, “Cortical mechanisms for theintegration of visual motion,” in The Senses: A ComprehensiveReference, R. H. Masland, T. D. Albright, T. D. Albright, et al.,Eds., vol. 2, New York: Academic Press, 2008, pp. 189–218.

[3] R. T. Born, J. M. G. Tsui, and C. C. Pack, “Temporal dynamicsof motion integration,” in Dynamics of Visual Motion Process-ing: Neuronal, Behavioral, and Computational Approaches,U. J. Ilg and G. S. Masson, Eds., Boston, MA: Springer US,2010, pp. 37–54.

[4] J.-M. Hupe, A. C. James, B. R. Payne, S. G. Lomber, P. Girard,and J. Bullier, “Cortical feedback improves discriminationbetween figure and background by V1, V2 and V3 neurons,”Nature, vol. 394, pp. 784–787, 1998.

[5] A. K. Seth, J. L. McKinstry, G. M. Edelman, and J. L.Krichmar, “Visual binding through reentrant connectivity anddynamic synchronization in a brain-based device,” Cereb.Cortex, vol. 14, no. 11, pp. 1185–1199, 2004.

[6] W. A. Phillips, “Cognitive functions of intracellular mecha-nisms for contextual amplification,” Brain Cogn., vol. 112,pp. 39–53, 2017.

[7] T. Brosch and H. Neumann, “Computing with a canonicalneural circuits model with pool normalization and modulatingfeedback,” Neural Comput., vol. 26, no. 12, pp. 2735–2789,2014.

[8] R. Brette and W. Gerstner, “Adaptive exponential integrate-and-fire model as an effective description of neuronal activity,”J. Neurophysiol., vol. 94, no. 5, pp. 3637–3642, 2005.

[9] S. Shipp, “Neural elements for predictive coding,” Front.Psychol., vol. 7, 2016.

[10] D. J. Heeger, E. P. Simoncelli, and J. A. Movshon, “Computa-tional models of cortical visual processing,” Proc. Natl. Acad.Sci. U.S.A., vol. 93, no. 2, pp. 623–627, 1996.

[11] E. P. Simoncelli and D. J. Heeger, “A model of neuronalresponses in visual area MT,” Vision Res., vol. 38, no. 5,pp. 743–761, 1998.

[12] D. J. Heeger, “Normalization of cell responses in cat striatecortex,” Visual Neurosci., vol. 9, no. 2, pp. 181–197, 1992.

[13] M. Carandini and D. J. Heeger, “Normalization as a canon-ical neural computation,” Nature Reviews Neurosci., vol. 13,pp. 51–62, 2012.

[14] N. C. Rust, V. Mante, E. P. Simoncelli, and J. A. Movshon,“How MT cells analyze the motion of visual patterns,” Nat.Neurosci., vol. 9, no. 11, pp. 1421–1431, 2006.

[15] E. H. Adelson and J. R. Bergen, “Spatiotemporal energymodels for the perception of motion,” J. Opt. Soc. Am. A,vol. 2, no. 2, pp. 284–299, 1985.

[16] A. B. Watson and A. J. Ahumada, “Model of human visual-motion sensing,” J. Opt. Soc. Am. A, vol. 2, no. 2, pp. 322–341,1985.

[17] J. P. H. van Santen and G. Sperling, “Elaborated Reichardtdetectors,” J. Opt. Soc. Am. A, vol. 2, no. 2, pp. 300–321,1985.

[18] W. Reichardt, “Autokorrelations-Auswertung als Funktion-sprinzip des Zentralnervensystems,” Zeitschrift fur Natur-forschung B, vol. 12, no. 7, pp. 448–457, 1957.

[19] C. C. Pack and R. T. Born, “Temporal dynamics of a neuralsolution to the aperture problem in visual area MT of macaquebrain,” Nature, vol. 409, no. 6823, pp. 1040–1042, 2001.

[20] C. C. Pack, M. S. Livingstone, K. R. Duffy, and R. T. Born,“End-stopping and the aperture problem,” Neuron, vol. 39,no. 4, pp. 671–680, 2003.

[21] J. M. G. Tsui, J. N. Hunter, R. T. Born, and C. C. Pack, “Therole of V1 surround suppression in MT motion integration,”J. Neurophysiol., vol. 103, no. 6, pp. 3123–3138, 2010.

[22] F. Solari, M. Chessa, N. K. Medathati, and P. Kornprobst,“What can we expect from a V1-MT feedforward architecturefor optical flow estimation?” Image Commun., vol. 39, no. PB,pp. 342–354, 2015.

[23] O. W. Layton, E. Mingolla, and N. A. Browning, “A motionpooling model of visually guided navigation explains humanbehavior in the presence of independently moving objects,” J.Vis., vol. 12, no. 1, pp. 1–19, 2012.

[24] J. Chey, S. Grossberg, and E. Mingolla, “Neural dynamicsof motion processing and speed discrimination,” Vision Res.,vol. 38, no. 18, pp. 2769–2786, 1998.

[25] N. J. Priebe, C. R. Cassanello, and S. G. Lisberger, “The neuralrepresentation of speed in macaque area MT/V5,” J. Neurosci.,vol. 23, no. 13, pp. 5650–5661, 2003.

[26] J. A. Perrone, “Economy of scale: A motion sensor withvariable speed tuning,” J. Vis., vol. 5, pp. 28–33, 2005.

[27] O. W. Layton and B. R. Fajen, “Competitive dynamics inMSTd: A mechanism for robust heading perception based onoptic flow,” PLOS Comp. Biol., vol. 12, no. 6, 2016.

[28] S. Grossberg, E. Mingolla, and L. Viswanathan, “Neuraldynamics of motion integration and segmentation within andacross apertures,” Vision Res., vol. 41, no. 19, pp. 2521–2553,2001.

[29] C. Beck and H. Neumann, “Combining feature selection andintegration—a neural model for MT motion selectivity,” PLoSONE, vol. 6, no. 7, e21254, 2011.

[30] F. Raudies, E. Mingolla, and H. Neumann, “A model of motiontransparency processing with local center-surround interactionsand feedback,” Neural Comput., vol. 23, no. 11, pp. 2868–2914, 2011.

[31] P. Bayerl and H. Neumann, “Disambiguating visual motionthrough contextual feedback modulation,” Neural Comput.,vol. 16, no. 10, pp. 2041–2066, 2004.

[32] K. Guo, R. Robertson, A. Nevado, M. Pulgarin, S. Mah-moodi, and M. P. Young, “Primary visual cortex neuronsthat contribute to resolve the aperture problem,” Neuroscience,vol. 138, no. 4, pp. 1397–1406, 2006.

[33] L. I. Abdul-Kreem and H. Neumann, “Neural mechanisms ofcortical motion computation based on a neuromorphic sensorysystem,” PLoS ONE, vol. 10, no. 11, e0142488, 2015.

[34] A. P. Shon, R. P. N. Rao, and T. J. Sejnowski, “Motion detec-tion and prediction through spike-timing dependent plasticity,”Netw. Comput. Neural Syst., vol. 15, no. 3, pp. 179–198, 2004.

[35] Q. Wu, T. M. McGinnity, L. Maguire, J. Cai, and G. D.Valderrama-Gonzalez, “Motion detection using spiking neuralnetwork model,” ICIC, pp. 76–83, 2008.

[36] M.-J. Escobar, G. S. Masson, T. Vieville, and P. Kornprobst,“Action recognition using a bio-inspired feedforward spikingnetwork,” Int. J. Comput. Vision, vol. 82, no. 3, pp. 284–301,2009.

[37] B. A. Kaplan, A. Lansner, G. S. Masson, and L. U. Perrinet,“Anisotropic connectivity implements motion-based predictionin a spiking neural network,” Front. Comput. Neurosci., vol. 7,2013.

[38] M. Giulioni, X. Lagorce, F. Galluppi, and R. B. Benosman,“Event-based computation of motion flow on a neuromorphicanalog neural platform,” Front. Neurosci., vol. 10, 2016.

[39] A. Roth and M. C. W. van Rossum, “Modeling synapses,” inComputational Modeling Methods for Neuroscientists, E. deSchutter, Ed., The MIT Press, 2009, pp. 139–160.

[40] S. Nishimoto and J. L. Gallant, “A three-dimensional spa-tiotemporal receptive field model explains responses of areaMT neurons to naturalistic movies,” J. Neurosci., vol. 31,no. 41, pp. 14 551–14 564, 2011.

[41] R. T. Born, C. C. Pack, and R. Zhao, “Integration of motioncues for the initiation of smooth pursuit eye movements,” inThe Brain’s eye: Neurobiological and clinical aspects of ocu-lomotor research, ser. Progress in Brain Research, J. Hyona,D. Munoz, W. Heide, and R. Radach, Eds., vol. 140, Elsevier,2002, pp. 225–237.

[42] F. Barranco, C. Fermuller, Y. Aloimonos, and T. Delbruck,“A dataset for visual navigation with neuromorphic methods,”Front. Neurosci., vol. 10, 2016.

[43] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128x128 120dB15us latency asynchronous temporal contrast vision sensor,” J.Solid-State Circuits, vol. 43, no. 2, 2008.

[44] C. C. Pack, A. J. Gartland, and R. T. Born, “Integration ofcontour and terminator signals in visual area MT of alertmacaque,” J. Neurosci., vol. 24, no. 13, pp. 3268–3280, 2004.

[45] J. Bolz and C. D. Gilbert, “Generation of end-inhibition in thevisual cortex via interlaminar connections,” Nature, vol. 320,pp. 362–365, 1986.

motion integration and disambiguation by spiking v1-mt-mstl … · 2019-09-09 · by v1 receptive...

Documents