visual attention and recognition through neuromorphic modeling of “where” and “what”...

Michigan State University 1

Visual Attention and Recognition Through Neuromorphic Modeling of “Where” and

“What” Pathways

Zhengping JiEmbodied Intelligence Laboratory

Computer Science and EngineeringMichigan State University,

Lansing, USA


Outline Attention and recognition: Chicken-egg problem Motivation: brain inspired, neuromorphic, brain’s

visual pathway Saliency-based attention Where-what Network (WWN):

How to integrate the saliency-based attention & top-down attention control

How attention and recognition helps each other Conclusions and future work


What is attention?


Bottom-up Attention (Saliency)


Attention Shifting


Spatial Top-down Attention Control


Spatial Top-down Attention Control

e.g. pay attention to the center


Object-based Top-down Attention Control


Object-based Top-down Attention Control

e.g. pay attention to the square


Chicken-egg Problem

Without attention, recognition cannot do well: recognition requires attended areas for the

further processing. Without recognition, attention is limited:

not only bottom-up saliency-based cues, but also top-down object-dependant signals and top-down spatial controls.


Problem


Challenge High-dimensional space Background noise Large variance

Scale Shape Illumination View point …..


Saliency-based Attention (I)

IHDR Tree

IHDR Tree Heading

Direction

Boundary Detection Part

The mapping from two visual images to correct road boundary type for

each sub-window

(Reinforcement Learning)

Action Generation Part

The mapping from road boundary type to correct heading

direction

(Supervised Learning)

e1Desired

Path

Win1Win2

Win3 Win4Win5

Win6

e2 e3 e4 e5 e6

Naïve way: attention window

by guessing


Saliency-based Attention (II)

Low-level image

processing

Itti & Koch et al. 1998


Review Attention and recognition: Chicken-egg problem Motivation: brain inspired, neuromorphic, brain’s

visual pathway Saliency-based attention Where-what Network (WWN):

How to integrate the saliency-based attention & top-down attention control

How attention and recognition helps each other Conclusions and future work


Biological Motivations


Challenge: Foreground Teaching

How does a neuron separate a foreground from a complex background? No need for a teacher to hand-segment the

foreground Fixed foreground, changing background

E.g., during baby object tracking The background weights are averaged out

(no effect during neuronal competition)


Novelty Bottom-up attention:

Koch & Ullman in 1985, Itti & Koch et al. 1998, Baker et al. 2001, etc. Position based top-down control:

Olshausen et al. 1993, Tsotsos et al. 1995, Mozer et al. 1996, Schill et al. 2001, Rao et al. 2004, etc.

Object based top-down control: Deco & Rolls 2004 (no performance evaluation), etc.

Our work: Saliency is developed features Both bottom-up and top-down based control Top-down: either object, position or none Attention and recognition is a single process


ICDL Architecture

ImageV1 V2

“what”-motor

40*40

11*11

11*1111*1121*21

“where”-motor

(r, c) 40*40 pixel-based

Size fixed: 20*20

global

global


Multi-level Receptive Fields


Layer Computation Compute pre-response of cell (i, j) at

time t

Sort: z1 ≥ z2 ≥ … zk… ≥ zm;

Only top-k neurons respond to keep selectiveness and long-term memory

Response range is normalized Update the local winners


In-place Learning Rule Do not use back-prop

Not biologically plausible Does not give long-term memory

Do not use any distribution model (e.g., Gaussian mixture) Avoid high complexity of covariance matrix

New Hebbian like rule: With automatic plasticity scheduling: only winners

update Minimum error toward target in every incremental

estimation stage (local first principal component)


Top-down Attention

Recruit & identify class invariant

features

Recruit & identify position invariant

features


Experiment

Foreground objects defined by “what” motor (20*20)

Attended areas defined by “where” motor

Randomly Selected background patches (40*40)


Developed Layer 1

Bottom-up synaptic weights of neurons in Layer 1, developed through randomly selected patches from natural images.


Developed Layer 2

Bottom-up synaptic weights of neurons in Layer 2.

Not Intuitive for understanding!!


Response Weighted Stimuli for Layer 2


Experimental Result I

Recognition rate with incremental learning


Experimental Result II

(a) Examples of input images; (b) Responses of attention (“where”) motors when supervised by “what” motors. (c) Responses of attention (“where”) motor when “what” supervision is not available.


Summary

“What” motor helps to direct attention of network to features of particular object;

“Where” motor helps to direct attention to positional information (from 45% to 100% accurate when “where” information is present);

Saliency-based bottom-up attention, location-based top-down attention, and object-based top-down attention are integrated in the top-k spatial competition rule;


Problems The accuracy for the “where” motors is not good:

45.53% Layer 1 was developed offline; More layers are needed to handle more positions Where motor should be given externally, instead of

retina-based representation No internal iterations especially when the number of

hidden layers is larger than one No cross-level projections


Fully Implemented WWN (Original Design)“where”-motor

Image(40*40)

V1(40*40)

V2(40*40)

V4(40*40) “what”-motor: 4 objects

11*1111*11

11*1121*21

V3LIP

31*31

IT (40*40)

MT PP

(r, c) 25 center

Fixed size motor

global global


Problems The accuracy for “where” and “what” motors are not good:

25.53% for “what” motor and 4.15% for “where” motor Too many parameters to be tuned Training is extremely slow How to do the internal iterations

“Sweeping” way: always use recently updated weights and responses.

Always use p-1 weights and responses, where p records the current number of iterations.

The response should not be normalized in each lateral inhibition neighborhood.


Modified Simple Architecture

ImageV1 V2

“what”-motor : 5 Objects

40*40

11*11

11*1111*1121*21

“where”-motor

(r, c) 5 centers

Size fixed: 20*20

global

global

Retina-based supervision


Advantage

Internal iterations are not necessary Network is running much faster Easier to track neural representations and

evaluate performance Performance evaluation

What motor reaches 100% accuracy for disjoint test

Where motor reaches 41.09% accuracy for disjoint test


Problems

Top-down projection from motor

+Bottom-up responses

Top-down responses

Total responses

Dominance by Top-down Projection


Solution

Sparse bottom-up responses by only keeping local top-k winner of bottom-up responses

The performance of where motor increases from around 40% to 91%.


Fully Implemented WWN (Latest)“where”-motor

Image(40*40)

V1(40*35)

V2(40*40)

V4(40*40)

“what”-motor: 5 objects(smoothing by Gaussian)

11*1111*11

11*1121*21

MT(r, c) 3*3 center

Fixed-size: 20*20

(smoothing by Gaussian)(40*40)

Each cortex: Modified ADAST


Modified ADAST

Previous Cortex

L4 L2/3

L6 (ranking)

L5 (ranking)

L2/3Next Cortex


Other improvements

Smooth the external motors using Gaussian function

Where motors are evaluated by regression errors Local top-k is adaptive by neuron positions The network does not converge by internal

iterations learning rate for top-down excitation is adaptive by

internal iterations. Using context information


Layer 1 – Bottom-up Weights


Layer 2 – Response-weighted Stimuli


Layer 3 (Where) – Top-down Weights


Layer 3 (What) – Top-down Weights


Test Samples

Input “Where” motor (ground truth) “What” motor (ground truth)

“Where” output (Saliency-based) “Where” output (“What” supervised) “What” output (Saliency-based) “What” output (“Where” supervised)


Performance Evaluation

Without supervision

Supervise “Where”

Supervise “What”

“Where” motor (regression error: MSE)

4.137 pixels N/A 4.137 pixels

“What” motor (classification

error: %)12.7% 12.1 % N/A

Average error for “where” and “what” motors (250 test samples)


Discussions

visual attention and recognition through neuromorphic modeling of “where” and “what”...

Documents