modular neural networks ii - university of calgary in...

Modular NeuralNetworks II

Presented by:David BrydonKarl MartensDavid Pereira

CPSC 533 - Artificial Intelligence Winter 2000Instructor: C. JacobDate: 16-March-2000

Presentation Agenda

A Reiteration Of Modular Neural Networks Hybrid Neural Networks Maximum Entropy Counterpropagation Networks Spline Networks Radial Basis Functions

Note: The information contained in this presentation hasbeen obtained from Neural Networks: A SystematicIntroduction by R. Rojas.

A Reiteration of Modular NeuralNetworks

There are many different types of neural networks -linear, recurrent, supervised, unsupervised, self-organizing, etc. Each of these neural networks havea different theoretical and practical approach.

However, each of these different models can becombined.

How ? Each of the afore-mentioned neural networkscan be transformed into a module that can be freelyintermixed with modules of other types of neuralnetworks.

Thus, we have Modular Neural Networks.

A Reiteration of Modular NeuralNetworks

But WHY do we have Modular Neural NetworkSystems ?

To Reduce Model Complexity To Incorporate Knowledge To Fuse Data and Predict Averages To Combine Techniques To Learn Different Tasks Simultaneously To Incrementally Increase Robustness To Emulate Its Biological Counterpart

Hybrid Neural Networks

A very well-known and promising family ofarchitectures was developed by Stephen Grossberg.It is called ART - Adaptive Resonance Theory. It is closer to the biological paradigm than feed-forward networks or standard associative memories. The dynamics of the networks resembles learning inhumans. One-shot learning can be recreated with this model.There are three different architectures in this family: ART-1: Uses Boolean values ART-2: Uses real values ART-3: Uses differential equations


Each category in the input space is represented by avector.The ART networks classify a stochastic series ofvectors into clusters.All vectors located inside the cone around eachweight vector are considered members of a specificcluster.Each unit fires only for vector located inside itassociated ‘cone’ of radius ‘r’.The value ‘r’ is inversely proportional to theattention parameter of the unit.

Large ‘r’ means classification of the inputspace is fine.

Small ‘r’ means classification of the input spaceis coarse.


Fig. 1. Vector clusters and attention parameters


Once the weight vectors have been found, thenetwork computes whether new data can or cannotbe classified by the existing clusters.If not, a new a new cluster is created with a newassociated weight vector.ART networks have two major advantages: Plasticity: it can always react to unknown inputs (bycreating a new cluster with a new weight vector, ifthe given input cannot be classified by existingclusters). Stability: Existing clusters are not deleted by theintroduction of new inputs (New clusters will just becreated in addition to the old ones).However, enough potential weight vectors must beprovided.


Fig. 2. The ART-1 Architecture


The Structure of ART-1 (Part 1 of 2):There are two basic layers of computing units.

Layer F1 receives binary input vectors from the inputsites.

As soon as an input vector arrives it is passed tolayer F1 and from there to layer F2.

Layer F2 contains elements which fire according tothe “winner-takes-all” method. (Only the elementreceiving the maximal scalar product of its weightvector and input vector fires).

When a unit in layer F2 has fired, the negative weightturns off the attention unit. Also, the winning unit inlayer F2 sends back a 1 throughout the connectionbetween layer F2 and F1.

Now each unit in layer F1 becomes as input thecorresponding component of the input vector x andof the weight vector w.


The Structure of ART-1 (Part 2 of 2):

The i-th F1 unit compares xi with wi and outputs theproduct xiwi.The reset unit receives this information and also thecomponents of x, weighted by p, the attentionparameter so that its own computation is

p (x1+x2+…+xn) - x.w 0 which is the same as

(x.w) / (x1+x2+…+xn) p

The reset unit fires only if the input lies outside theattention cone of the winning unit. A reset signal issent to layer F2, but only the winning layer isinhibited.This is turns activates the attention unit and a newround of computation begins. Hence, there isresonance.


The Structure of ART-1 (Some Final Details):

The weight vectors in layer F2 are initialized with allcomponents equal to 1 and p is selected to satisfy0<p<1. This ensures that eventually an unused vectorwill be recruited to represent a new cluster.The selected weight vector w is updated by pulling itin the direction of x. This is done in ART-1 by turningof all component in w which are zeros in x.

The purpose of the reset signal is to inhibit all unitsthat do not resonate with the input. A unit in layer F2,which is still unused, can be selected for the newcluster containing x. In this way, sufficientlydifferent input data can create a new cluster. Bymodifying the value of the attention parameter p, wecan control the number of clusters and how widethey are.


The Structure of ART-2 and ART-3

ART-2 uses vectors that have real-valuedcomponents instead of Boolean components.The dynamics of the ART-2 and ART-3 models isgoverned by differential equations.However, computer simulations consume too muchtime.Consequently, implementations using analoghardware or a combination of optical and electronicelements are more suited to this kind of model.


Maximum entropySo what’s the problem with ART ? It tries to buildclusters of the same size, independently of thedistribution data.So, is there a better solution ? Yes, Allow the clustersto have varying radii with a technique called the“Maximum Entropy Method”.

What is “entropy” ? The entropy H of a data set of Npoints assigned to k differently clusters c1, c2,c3,…,cn is given by

H=- p(c1)log(p(c1)) + p(c1)log(p(c2)) + ... +p(cn)log(p(cn))

where p(ci) denotes the probability of hitting the i-thcluster, when an element of the data set is picked atrandom.Since the probabilities add up to 1, the cluster thatmaximizes the entropy is one for which all clusters areidentical. This means that the clusters will tend tocover the same number of points.


Maximum entropyHowever, there is still a problem - whenever thenumber of elements of each class in the data setis different. Consider the case of unlabeledspeech data: some phonemes are more frequentthan others and if a maximum entropy method isused, the boundaries between clusters willdeviate from the natural solution and classifysome data erroneously.So how do we solve this problem ? With the“Boostrapped Iterative Algorithm”:cluster: Computer a maximum entropy clusteringwith the training data. Label the original datadata according to this clustering.select: Build a new training set by selecting fromeach class the same number of points (randomselection with replacement). Go to the previousstep.


Counterpropagation networkAre there any other hybrid network models ?Yes, the counter-propagation network asproposed by Hecht-Nielsen.So what are counter-propagation networksdesigned for ? To approximate a continuousmapping f and it inverse f-1.A counter-propagation consists of an n-dimentional input vector which is fed to a hiddenlayer consisting of h cluster vectors. The outputis generated by a single linear associator unit.The weights in the network are adjusted usingsupervised learning.

The above network can successfully approximatefunctions of the form f: Rn -> R.


Fig. 3 Simplified counterpropagation nework


Counterpropagation networkThe training phase is completed in two parts

Training of the hidden layer into a clusteringof input space that corresponds to an n-dimentional Voronoi tiling. The hidden layersoutput needs to be controlled so that only theelement with the highest activation fires.The zi weights are then adjusted to representthe value of the approximation for thecluster region.

This network can be extended to handle multipleoutput

units.


Fig. 4 Function approximation with acounterpropagation network.


Spline networksCan the approximation created by acounterpropagation network be improved on?YesIn the counterpropagation network the VoronoiTiling, is composed of a series horizontal tiles.Each of which represents an average of thefunction in that region.The spline network solves this problem byextending the hidden layer in thecounterpropagation network. Each unit is pairedwith a linear associator, the cluster unit is usedto inhibit or activate the linear associator whichis connected to all inputs.This modification allows the resulting set of tilesto be oriented differently with respect to eachother. Creating an approximation with a smallerquadratic error, and a better solution to theproblem.Training proceeds as before except the newly


Fig. 5 Function approximation with linearassociators


Radial basis functionsHas a simular structure as that of the counterpropagation network. The difference is in theactivation function used for each unit is Gaussianinstead of Sigmoidal.The Gaussian approach uses locally concentratedfunctions.The Sigmodal approach uses a smooth stepapproach.Which is better depends on the specific problemat hand. If the function is smooth step then theGaussian approach would require more units,where if the function is Gaussian then theSigmodal approach will require more units.

modular neural networks ii - university of calgary in...

Documents