a matlab implementation of nn

A Matlab�implementation of neural networks

Jeroen van Grondelle

July ��

�

Contents

Preface �

� An introduction to neural networks �

� Associative memory �

�� What is associative memory� � � � � � � � � � � � � � � � � � � � � � � �� Implementing Associative Memory using Neural Networks � � � � � � �� Matlab�functions implementing associative memory � � � � � � � � � � �

�� Storing information � � � � � � � � � � � � � � � � � � � � � � � �� Recalling information � � � � � � � � � � � � � � � � � � � � � � �

� The perceptron model ��

�� Simple perceptrons � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� The XOR�problem � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Solving the XOR�problem using multi�layered perceptrons � � � � � � � ��

� Multi�layered networks ��

�� Learning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Training � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Generalizing � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� The Back�Propagation Network ��

�� The idea of a BPN � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Updating the output�layer weights � � � � � � � � � � � � � � � � � � � �� Updating the hidden�layer weights � � � � � � � � � � � � � � � � � � � ��

� A BPN algorithm �

�� Choice of the activation function � � � � � � � � � � � � � � � � � � � � �� Con guring the network � � � � � � � � � � � � � � � � � � � � � � � � � �� An algorithm� train��m � � � � � � � � � � � � � � � � � � � � � � � ��

� Application I the XOR�gate ��

�� Results and performance � � � � � � � � � � � � � � � � � � � � � � � � � �� Some criteria for stopping training � � � � � � � � � � � � � � � � � � � ��

�� Train until SSE � a � � � � � � � � � � � � � � � � � � � � � � � �� Finding an SSE�minimum � � � � � � � � � � � � � � � � � � � � ��

�� Forgetting � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Application II Curve �tting ��

�� A parabola � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� The sine function � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Overtraining � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� Some new criteria for stopping training � � � � � � � � � � � � � � � � �� Evaluating the curve tting results � � � � � � � � � � � � � � � � � � � ��

� Application III Times Series Forecasting ��

�� Results � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Conclusions ��

�

A Source of the used M��les ��

A�� Associative memory� assostore�m� assorecall�m � � � � � � � � � � ��A�� An example session � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��A�� BPN� train��m � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Bibliography ��

�

Preface

Although conventional computers have been shown to be e�ective at a lot of de�manding tasks� they still seem unable to perform certain tasks that our brains doso easily� These are tasks like for instance pattern recognition and various kindsof forecasting� That we do these tasks so easily has a lot to do with our learningcapabilities� Conventional computers do not seem to learn very well�

In January �� the NRC Handelsblad� in its weekly science subsection� publisheda series of four columns on neural networks� a technique that overcomes some of theabove�mentioned problems� These columns aroused my interest in neural networks�of which I knew practically nothing at the time� As I was just looking for a subjectfor a paper� I decided to nd out more about neural networks�

In this paper� I will start with giving a brief introduction to the theory of neuralnetworks� Section � discusses associative memory� which is a simple application ofneural networks� It is a �exible way of information storage� allowing retrieval in anassociative way�

In sections � to � general neural networks are discussed� Section � shows the be�haviour of elementary nets and in section and this theory is extended to largernets� The back propagation rule is introduced and a general training algorithm isderived from this rule�

Sections � to � deal with three applications of the back propagation network� Usingthis type of net� we solve the XOR�problem and we use this technique for curve tting� Time series forecasting also deals with predicting function values� but isshown to be a more general technique than the introduced technique of curve tting�

Using these applications� I demonstrate several interesting phenomena and criteriaconcerning implementing and training networks� such as stopping criteria� over�training and forgetting�

Finally� I�d like to thank Rob Bisseling for his supervision during the process andEls Vermij for her numerous suggestions for improving this text�

Jeroen van GrondelleUtrecht� July ��

� An introduction to neural networks

In this section a brief introduction is o�ered to the theory of neural networks� Thistheory is based on the actual physiology of the human brain and shows a greatresemblance to the way our brains work�

The building blocks of neural networks are neurons� These neurons are nodes inthe network and they have a state that acts as output to other neurons� This statedepends on the input the neuron is given by other neurons�

activation function

Input

Output

Neuronthreshold

Figure �� A neuron

A neural network is a set of connected neurons� The connections are called synapses�If two neurons are connected� one neuron takes the output of the other neuron asinput� according to the direction of the connection�

Neurons are grouped in layers� Neurons in one layer only take input from the pre�vious layer and give output to the next layer ��

Every synapse is associated with a weight� This weight indicates the impact of theoutput on the receiving neuron� The state of neuron i is de ned as�

si � f

�Xk

wikrk � �

��

where rk are the states of the neurons that give input to neuron i and wik representsthe weight associated with the connection� f�x� is the activation function� Thisfunction is often linear or a sign�function whwn we require binary output� The signfunction is generally replaced by a continuous representation of this function� Thevalue � is called the threshold�

input

ouput

input

output

hidden layer

Figure �� A single and multi�layered network

�Networks with connections skipping layers are possible� but we will not discuss them in this

paper

A neural net is based on layers of neurons� Because the number of neurons is nite�there is always an input layer and an output layer � which only give output or takeinput respectively� All other layers are called hidden layers� A two�layered net iscalled a simple perceptron and the other nets multi�layered perceptrons� Examplesare given in gure �

�

� Associative memory

�� What is associative memory�

In general� a memory is a system that both stores information and allows us torecall this information� In computers� a memory will usually look like an array� Anarray consists of pairs �i� �� where � is the information we are storing� and i is theindex assigned to it by the memory on storage� We can recall the information bygiving the index as input to the memory�

index

Minput output

information

Figure �� Memory recall

This is not a very �exible technique� we have to know exactly the right index torecall the stored information�

Associative memory works much more like our mind does� If we are for instancelooking for someone�s name� it will help to know where we met this person or whathe looks like� With this information as input� our memory will usually come upwith the right name� A memory is called an associative memory if it permits therecall of information based on partial knowledge of its contents�

�� Implementing Associative Memory using Neural Net�works

Neural networks are very well suited to create an associative memory� Say we wishto store p bitwords� of length N � We want to recall in an associative way� so wewant to give as input a bitword and want as output the stored bitword that mostresembles the input�

So it seems the obvious thing to do is to take an N�neuron layer as both input andoutput layer and nd a set of weights so that the system behaves like a memory forthe bitwords �� p�

1 2 3 4

1 2 3Output

Input N

N

Figure � An associative memory con guration

If now a pattern s is given as input to the system� we want �� to be the output� so

�For later convenience� we will work with binary numbers that consist of ��s and ��s� where

�� replaces the usual zero�

�

that s and �� di�er in as few places as possible� So we want the error Hj

Hj �NX��

�si � �ji ��

to be minimal if j � �� This Hj is called the Hamming distance��

We will have a look at a simple case rst� Say we want to store one pattern �� Wewill give an expression for w and check that it suits our purposes�

wij ��

N�i�j ��

If we give an arbitrary pattern s as input� where s di�ers in n places from the storedpattern �� we get�

Si � sign

�� NX

j��

wijsj

�A � sign

��i �

N

NXj��

�jsj

�A ��

Now examinePN

j�� jsj � If sj � �j� then �jsj � �� otherwise it is �� Therefore�the sum equals �N � n� � �n� and�

sign

��i �

N

NXj��

�jsj

�A � sign

��

N�i�N � �n�

�� sign

��

�n

N

��i

��

There are two important features to check� First� we can see that if we chooses � �� the output will be �� This is obvious� because � and � di�er in � places�We call this stability of the stored pattern� Secondly� we want to check that if wegive an input reasonably close to �� we get � as output� Obviously� if n � N

� � theoutput will equal �� Then

�� n

N

�does not a�ect the sign of �i� This is called

convergence to a stored pattern�

We now want to store all the words �� p� And again we will give an expressionand prove that it serves our purpose� De ne

wij ��

N

pX��

��i �

�j ��

The method will be roughly the same� We will not give a full proof here� This wouldbe too complex and is not of great importance to our argument� What is importantis that we are proving stability of stored patterns and the convergence to a storedpattern of other input patterns� We did this for the case of one stored pattern� Themethod for multiple stored patterns is similar� Only� proving the error terms to besmall enough will take some advanced statistics� Therefore� we will prove up to theerror terms here and then quote �M�uller��

Because the problem is becoming a little more complex now� we will discuss theactivation value for an arbitrary output neuron i� usually referred to as hi� Firstwe will look at the output when a stored pattern �say �� is given as input�

hi �NXj��

wij��j �

�

N

pX��

��i

NXj��

��j �

�j �

�

N

��i NX

j��

��j��j �

X��

��i

NXj��

��j �

�j

�A ��

�Actually� this is the Hamming distance when bits are represented by ��s and ��s� The square

then acts as absolute�value operator� So we should scale results by a constant factor �� to obtain

the Hamming distance�

�

The rst part of the last expression is equal to ��i due to similar arguments as inthe previous one�pattern case� The second expression is dealt with using laws ofstatistics� see �M�uller��

Now we give the system an input s where n neurons start out in the wrong state�Then generalizing �� similar to �� gives�

hi �

��

�n

N

��i �

�

N

X� ��

��i

NXj��

��j sj ��

The rst term is equal to that of the single�pattern storage case� And the second isagain proven small by �M�uller�� Moreover� it is proven that

hi �

��

�n

N

��i � O

�rp� �

N

��

So if p �� N the system will still function as a memory for the p patterns� In�M�uller�� it is proven that as long as p � ��N the system will function well�

�� Matlab�functions implementing associative memory

In Appendix A�� two Matlab functions are given for both storing and recallinginformation in an associative memory as described above� Here we will make someshort remarks on how this is done�

� � � Storing information

The assostore�function works as follows�The function gets a binary matrix S as input� where the rows of S are the patternsto store� After determining its size� the program lls a matrix w with zeros� Thevalues of S are transformed from �� to �� notation� Now all values of w arecomputed using �� This formula is implemented using the inner product of twocolumns in S� The division by N is delayed until the end of the routine�

� � � Recalling information

assorecall�m is also a straightforward implementation of the procedure describedabove� After transforming from �� to �� notation� s is computed as w timesthe transposed input� The sign of this s is transformed back to �� notation�

�

� The perceptron model

�� Simple perceptrons

In the previous section� we have been looking at two�layered networks� which arealso known as simple perceptrons� We did not really go into the details� An expres�sion for w was given and we simply checked that it worked for us� In this sectionwe will look closer at what these simple perceptrons really do�

Let us look at a ��neuron input� ��neuron output simple perceptron� as shown in gure �

S1

s1

s2

Figure � A �� simple perceptron

This net has only two synapses� with weights w� and w�� and we assume S� hasthreshold �� We allow inputs from the reals and take as activation function thesign�function� Then the output is given by�

S� � sign�w�s� �w�s� � ��

There is also another way of looking at S�� The inner product of w and s actuallyde nes the direction of a line in the input space� � determines the location of thisline and taking the sign over this expression determines whether the input is on oneside of the line or at the other side� This can be seen more easily if we rewrite ��as�

S� �

� if w�s� � w�s� � �

�� if w�s� � w�s� � ��

So �� simple perceptrons just divide the input space in two and return � at onehalf and �� at the other� We visualize this in gure ��

w

Figure �� A simple perceptron dividing the input space

We can of course generalize this to �n�� simple perceptrons� in which case theperceptron de nes a �n��dimensional hyperplane in the n�dimensional input space�The hyperplane view of simple perceptrons also allows looking at not too complexmulti�layered nets�As we saw before� every neuron in the rst hidden layer is an

��

indicator of a hyperplane� But the next hidden layer again consists of indicators ofhyperplanes� de ned this time on the output of the rst hidden layer� Multi�layerednets soon become far too complex to study in such a concrete way� In the literaturewe see that multi�layered nets are often regarded as black boxes� You know whatgoes in� you train until the output is right and you do not bother about the exactactions inside the box� But for relatively small nets� it can be very interesting tostudy the exact mechanism� as it can show whether or not a net is able to do therequired job� This is exactly what we will do in the next subsection�

�� The XOR�problem

As we have seen� simple perceptrons are quite easy to understand and their be�haviour is very well modelled� We can visualize their input�output relation throughthe hyperplane method�But simple perceptrons are very limited in the sort of problems they can solve�If we look for instance at logical operators� we can instantly see one of its limits�Although a simple perceptron is able to adopt the input�output relation of boththe OR and AND operator� it is unable to do the same for the Exclusive�Or gate� theXOR�operator�

s� s� S��

Table �� The truth table of the XOR�function

We examine rst the AND�implementation on a simple perceptron� The input�outputrelation would be�

AND

-1 1

1

-1

Figure �� Input�output relation for the AND�gate

Here the input is on the axes� and a black dot means output � and a white dotmeans output �� As we have seen in section �� a simple perceptron will de nea hyperplane� returning � at one side and �� at the other� In gure �� we choosea hyperplane for both the AND and the OR�gate input space� We immediately seewhy a simple perceptron will never simulate an XOR�gate� as this would take twohyperplanes� which a simple perceptron can not de ne�

��

XORAND OR

1

1-1

-1

1

1-1

-1 -1

-1

1

1

Figure �� A hyperplane choice for all three gates

It is now almost trivial to nd the simple perceptron solution to the rst two gates�Obviously� �w�� w�� de nes the direction of the chosen line� It follows thatfor the AND�gate � � � works well� In the same way we compute values for theOR�gate� w� � �� w� � � and � � ��

When neural nets were only just invented and these obvious limits were discovered�most scientists regarded neural nets as a dead end� If problems this simple couldnot be solved� neural nets were never going to be very useful� The answer to theselimits were multi�layered nets�

�� Solving the XOR�problem using multi�layered perceptrons

Allthough the XOR�problem can not be solved by simple perceptrons� it is easy toshow that it can be solved by a �� perceptron� We could prove this by givinga set of suitable synapses and prove its functioning� We could also go deeper intothe hyperplane method� Instead of these options� we will use some logical rules andexpress the XOR operator in terms of OR and AND operators� which we have seen wecan handle� It can be easily shown that�

�s� XOR s�� s� � �s�� s� � s��

We have neural net implementations of the OR and AND operator� Because we areusing � and �� as logical values� �s� is equal to �s�� This makes it easy to put s�and �s� in a neural AND�gate� We will just negate the synapse that leads from �s� toS� and use s� as input instead of �s�� This suggests the following ��solution�The input layer is used as usual and feeds the hidden layer� consisting of hs� andhs�� These function as AND�gates as indicated in �� S� the only element in theoutput layer� implements the OR�symbol in ��

By writing down the truth table for the system� it can easily be shown that thegiven net is correct�

��

θθ

θ=-1

1 1-1

-1

s s

S

=1=1

21

11

Figure �� A �� solution of the XOR�gate

��

� Multi�layered networks

In the previous section� we studied a very speci c case of multi�layered networks� Wecould determine its synaptic strengths because it was a combination of several simpleperceptrons� which we studied quite thoroughly before� and because we could reducethe original problem to several subproblems that we already solved using neural nets�In the preface� several tasks were mentioned such as character recognition� timeseries forecasting� etc� These are all very demanding tasks� which need considerablylarger nets� These tasks are also problems we do not understand so well� So we arenot able to de ne subproblems� which we could solve rst� The strong feature ofneural nets that we are going to use here is that� by training� the net will learn theinput�output relation we are looking for� We are not concerned with the individualfunction of neurons� in this section we will consider the net as the earlier mentionedblack box�

�� Learning

Let us discuss a concrete example here� A widely used application of neural nets isthat of character recognition� The input of our black box could then be for instancea � � � matrix of ones and zeros� representing a scan of a character� The outputcould consist of �� neurons� representing the �� characters of the alphabet�Since we do not have a concrete solution in for instance hyperplane or logical termsto implement in a net� we choose more or less at random a net con guration andsynaptic strengths for this net� Not all net con gurations are able to learn all prob�lems �we have seen a very obvious example of that before� but there are guidelinesand rough estimations on how large a net has to be� We will not go into that rightnow�Given our net� every scan given as input will result in an output� It is not very likelythat this net will do what we want from the start� since we initiated it randomly� Itall depends on nding the right values for the synaptic strenghts� We need to trainthe net� We give it an input and compare the output with the result we wantedto get� And then we will adjust the synaptic strenghts� This is done by learning

algorithms� of which the earlier mentioned Back�Propagation rule is an example�We will discuss the BPN�rule later�By repeating this procedure often with di�erent examples� the net will learn to givethe right output for a given input�

�� Training

We have mentioned the word training several times now� It refers to the situationwhere we show the system several inputs and provide the required output as well�The net is then adjusted� By doing this the net learns�The contents of the training set is of crucial importance� First of all� it has to belarge enough� In order to get the system to generalize� a large set of examples hasto be available� Probably� a network trained with a small set will behave like amemory� but a limited training set will never evoke the behaviour we are lookingfor� adapting an error�tolerant� generalizing input�output relation�The set also has to be su�ciently rich� The notion we want the neural net to rec�ognize has to be the only notion that is present everywhere in our training set� Asthis may sound a bit vague� an example might be neccesary� If we have a set ofpictures of blond men and dark women� we could teach a neural net to determinethe sex of a person� But it might very well be that on showing this trained systema blond girl� the net would say it�s a boy� There are obviously two notions in orderhere� someone�s sex and the colour of his or her hair�

�

In the theory of neural nets� one comes across more of these rather vague problems�The non�deterministic nature of training makes that trained systems can get over�trained and can even forget� We will not pay too much attention to these phenomenanow� We will discuss them later� when we have practical examples to illustrate them�

�� Generalizing

There is an aspect of learning that we have not yet discussed� We de ned trainingas adjusting a neural net to the right input�output relation� This relation is thende ned by the training set� This suggests that we train the network to give theright output at every input from the training set�If this is all that the system can achieve� it would be nothing more than a memory�which we discussed in section �� We also want the system to give output on inputthat is not in the training set� And we want this output to be correct� By givingthe system a training set� we want the system to learn about other inputs as well�Of course these will have to be close enough to the ones in the training set�The right network con guration is crucial for the system to learn to generalize� Ifthe network is too large� it will be able to memorize the training set� If it is toosmall� it simply will not be able to master the problem�So con guring a net is very important� There are basically two ways of achievingthe right size� One is to begin with a rather big net� After some training� thenon�active neurons and synapses are removed� thus leaving a smaller net� which canbe further trained� This technique is called pruning� The other way is rather theopposite� Start with a small net and enlarge it if it does not succeed in solving theproblem� This guarantees you to get the smallest net that does the job� But youwill have to train a whole new net every time you add some neurons�

�

� The Back�Propagation Network

�� The idea of a BPN

In the previous section we mentioned a learning algorithm� This algorithm updatedthe synaptic strengths after comparing an expected output with an actual output�The algorithm should alter the weights to minimize the error next time�One of the algorithms developed is the Error Back�Propagation algorithm� Thisis the algorithm we will describe here and implement in the next section� We willdiscuss a speci c case in detail� We will derive and implement this rule for a three�layered network�

Input

Hidden

Output

x x x x1 2 3 N

1 L

1 2 3 M

i i = f (h )

= f (h )

h

o

M

L

h

o

L

Mo oo o

Figure �� The network con guration we will solve

We want to minimize the error between expected output y and actual output o�From now on we will be looking at a xed training�set pair� an input vector x andan expected output y� The actual output o is the output that the net gives for theinput vector�We de ne the total error�

E ��

�

Xk

��k ��

where �k is the di�erence between the expected and actual output of output neuronk� �k � �yk � ok��

Since all the information of the net is in its weights� we could look at E as a functionof all its weights� We could regard the error to be a surface in W �R� where W isthe weights space� This weights space has as dimension the number of synapses inthe entire network� Every possible state of this network is represented by a point�wh� wo� in W �

Now we can look at the derivative of E with respect to W � This gives us the gradi�ent of E� which always points in the direction of steepest ascent of the surface� So�grad�E� points in the direction of steepest descent � Adjusting the net to a point�wh� wo� in the direction of �grad�E� secures that the net will perform better nexttime� This procedure is visualized in gure ��

��

W-space

E

ED

-grad(E)

Figure �� The error as a function of the weights

�� Updating the output�layer weights

We will calculate the gradient of E in two parts and start with the output�layerweights�

E

wokj

� ��yk � ok�fok�ho

k�

�hok�

wokj

��

Because we have not yet chosen an activation function f � we can not yet evaluate�fok��ho

k� � We will refer to it as fo

�

k �hok�� What we do know is�

�hok�

wokj

�

wokj

LXl��

woklil � �ok � ij ��

Combining the previous equations gives�

E

wokj

� ��yk � ok�fo�

k �hok�ij ��

Now we want to change wokj in the direction of � �E

�wokj

� We de ne�

�ok � �yk � ok�fo�

k �hok� ��

Then we can update wo according to�

wokj�t� �� wo

kj�t� � �okij ��

where is called the learning�rate parameter� determines the learning speed� theextent to which the w is adjusted in the gradient�s direction� If it is too small� thesystem will learn very slowly� If it is too big� the algorithmwill adjust w too stronglyand the optimal situation will not be reached� The e�ects of di�erent values of are discussed further in section �

��

�� Updating the hidden�layer weights

To update the hidden�layer weights we will follow a procedure roughly the same asin section �� In section �� we looked at E as a function of the output�neuronvalues� Now we will look at E as a function of the hidden�neuron values ij �

E ��

�

Xk

�yk � ok��

��

�

Xk

�yk � fok �hok��

�

��

�

Xk

�yk � fok �Xj

wokjij � �ok��

�

And now we examine �E

�whji

�

E

whji

��

�

Xk

whji

�yk � ok��

� �Xk

�yk � ok�ok

hok

hokij

ij

hhj

hhj

whji

We can deal with these four derivatives the same way as section �� The rst andthe third are clearly equal to the unknown derivatives of f � The second is equal to�

�hok�

ij�

ij

LXj��

wokjij � �ok � wo

kj ��

For the same reason� the last derivative is xi� So we have�

E

whji

� �Xk

�yk � ok�fo�

k wokjf

h�

j xi ��

We de ne a �h similar to the one in ��

�hj � fh�

j �hhj �Xk

�yk � ok�fo�

k �hok�wokj

� fh�

j �hhj �Xk

�okwokj ��

Looking at the de nition of �h� we can see that updating whji in the direction of

�E�wh

ji

is equal to�

whji�t � �� wh

ji�t� � �hj xi ��

where is again the learning parameter�

��

� A BPN algorithm

In the next sections we will demonstrate a few phenomena as described in chapter �using an application of a �� back�propagation network� We have seen this rel�atively simple network before in subsection �� The XOR�gate described there willbe the rst problem we solve with an application of the bpn� In this section we willformulate a general ��bpn training algorithm�

�� Choice of the activation function

Since we will be simulating the XOR�gate� which has outputs �� and � only� it is anobvious choice to use a sigmoidal activation function� We will use f�x� � tanh�x��

f(x) = tanh(x) df/dx = 1 − tanh^2(x)

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure �� A sigmoidal activation function

We will also need its derivative� Since tanh�x� � sinh�x�cosh�x� � we have�

tanh �ex � e�x

ex � e�x

Deriving this expression yields�

tanh��x� � ��ex � e�x��

�ex � e�x�� tanh��x�

�� Conguring the network

We are going to use a three�layer net� with two input neurons� two hidden neuronsand one output neuron� As we have already chosen the activation function� wenow only have to decide how to implement the thresholds� In section we did notmention them� This was not necessary� since we will show here that they are easilytreated as ordinary weights�

We add a special neuron to both the input and the hidden layer and we de ne thestate of this neuron equal to �� This neuron therefore takes no input from previouslayers� since they would have no impact anyway� The weight of a synapse between

��

this special neuron and one in the next layer then acts as the threshold for thisneuron� When the activation value for a neuron is computed� it now looks like�

hj �kXi��

wi�jsi �wi��j � � �k��Xi��

wi�jsi

Neuron k � � is the special neuron that always has a state equal to ��

In gure �� we give an example of such a net�

1

1

x x

i i

2

1 2

1

o

Figure �� A �� neural net with weights as thresholds

This approach enables us to implement the network by using the techniques fromsection � without paying special attention to the thresholds�

�� An algorithm train��m

Given the above�mentioned choices and the explicit method described in section �we can now implement a training function for the given situation� Appendix A��gives the source of train��m� This function is used as follows�

�WH�WO�E� � train��Wh�Wo�x�y�eta�

where the inputs Wh and Wo represent the current weights in the network� �x�y� isa training input�output pair and eta is the learning parameter�The outputs WH and WO are the updated weight matrices and E is the error� ascomputed before the update�

��

� Application I� the XOR�gate

�� Results and performance

We will now use the algorithm to solve the XOR�gate problem� First� we de ne ourtraining set�

ST ��

�The elements of this set are given as input to the training algorithm introduced inthe previous section� This is done by a special m� le� which also stores the errorterms� These error terms enable us to analyse the training behaviour of the net� Inthe rest of the section� we will describe several phenomena� using the informationthe error terms give us�

When looking at the performance of the net� we can look� for instance� at the errorof the net on an input�ouput pair of the training set� �xi� yi��

Ei � �yi � oi��

with yi as the expected output and oi as the output of the net with xi as in�put� A measure of performance on the entire training set is the Sum of SquaredErrors�SSE��

SSE �Xi

Ei

Clearly� the SSE is an upper bound for every Ei� We will use this frequently whenexamining the net�s performance� If we want the error on every training set elementto converge to zero� we just compute the SSE and check that it does this�

Now we will have a rst look at the results of training the net on ST � Figure �shows some of the results�

�iters E� E� E� E� SSE

��

��

Figure �� Some training results

As we see in gure �� both training sessions are succesful� as the SSE becomesvery small� We see that with larger� SSE converges to zero faster� This suggeststaking large values for � To see if this stategy would be succesful� we repeat theexperiment with various values of �

In gure �� the SSE is plotted versus the number of training laps for various �We can see that� for � �� the SSE converges to zero� For � �� SSE convergesfaster� but less smoothly� After �� trainings� the SSE has a little peak� Takinglarger � as suggested above� does not seem very pro table� When is �� SSE has

��

eta = .1

eta = .2

eta = .4

eta = .6

eta = .8

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure �� SSE vs� number of training laps� for various

strong oscillations and with � �� SSE does not even converge to zero�

This non�convergence for large can be explained by the error�surface view as usedin section � We regard the error as function of all the weights� This leads to an errorsurface on the weights space� We used the gradient of E in this space to minimizethe error� expresses the extent to which we change the weights in the direction ofthe opposite of the gradient� In this way we hope to nd a minimum in this space�If is too large� we can jump over this minimum� approach it from the other sideand jump over it again� Thus� we will never reach it and the error will not converge�

The conclusion seems to be that the choice of is important� If it is too small� thenetwork will learn very slowly� Larger lead to faster learning� but the networkmight not reach the optimal solution�

We have now trained a network to give the outputs at inputs from the training set�And in this speci c case� these are the only inputs we are interested in� But thenet does give outputs on other inputs� Figure �� shows the output on inputs inthe square between the training�set inputs� The graph clearly shows the XOR�gate�soutputs on the four corners of the surface�

In this case� we were only interested in the training�set elements� What the net doesby representing these four states right� is actually only remembering by learning�Later� we will be looking at cases where we are interested in the outputs on inputsoutside the training set� Then we are investigating the generalizing capabilities ofneural networks�

��

0 2 4 6 8 10 120

5

10

15

−0.2

0

0.2

0.4

0.6

0.8

1

Figure �� The output of the XOR�net

�� Some criteria for stopping training

When using neural networks in applications� we will in general not be interestedin all the SSE curves etc� In these cases� training is just a way to get a well�performing network� which� after stopping training�� can be used for the requiredpurposes� There are several criteria to stop training�

� � � Train until SSE � a

A very obvious method is to choose a value a and stop training as soon as the SSEgets below this value� In the examples of possible SSE�curves we have seen sofar�the SSE� for suitable � converges more or less monotonically� to zero� So it is boundto decrease below any value required�

Choosing this value depends on the accuracy you demand� As we saw before� theSSE is an upper bound for the Ei� which was the square of y � o� So if we toleratea di�erence of c between the expected output and the net�s output� we want�

Ei � c� �i

Since SSE is an upperbound� we could use SSE � c� as a stopping criterion�

The advantage of this criterion is that you know a lot about the performance of

�Unless the input�output relation is changing through time and we will have to continue training

the new situations�

��

SSE

0 50 100 150 200 250 3000

0.5

1

1.5

Figure �� Stopping training after �� laps� when SSE � ��

the net if training is stopped by it� A disadvantage is that training might not stopin some situations� Some situations are too complex for a net to reach the givenaccuracy�

� � � Finding an SSE�minimum

The disadvantage of the previous method suggests another method� If SSE does notconverge to zero� we want to at least stop training at its minimum� We might traina net for a very long time� plot the SSE and look for its global minimum� Then weretrain the net under the same circumstances and stop at this optimal point� Thisis not realistic however� since training in complex situations can take a considerableamount of time and complete retraining would take too long�

Another approach is to stop training as soon as SSE starts growing� For small� this might work� since we noticed before that choosing a small leads to verysmooth and monotonic SSE�curves� But there is still a big risk of ending up in alocal SSE�minimum� Training would stop just before a little peak in the SSE�curve�although training on would soon lead to even better results�

The advantages are obvious� given a complex situation with non�convergent SSE� westill reach a relatively optimal situation� The disadvantages are obvious too� Thismethod might very well lead to suboptimal stopping� although we can limit thisrisk by choosing small and maybe combining the two previous techniques� trainthe network through the rst fase with the rst criterion and then nd a minimumin the second� smoother fase with the second criterion�

�� Forgetting

Forgetting is another phenomenon we will demonstrate here� So far our traininghas consisted of subsequently showing the net all the training�set elements an equalnumber of times� We will show that this is very important�

Figure �� shows the error during training on all the individual training�set elements�

�

SSE

0 20 40 60 80 100 120 140 160 180 2000

200

400

600

800

1000

1200

1400

1600

Figure �� Finding an SSE�minimum

It is clear that these functions Ei do not converge monotonically� While the erroron some elements decreases� the error on others increases� This suggests that train�ing the net on element a might negatively in�uence the performance of the net onanother element b�

This is the basis for the proces of forgetting� If we stop training an element� trainingthe other elements in�uences the performance on this element negatively and causesthe net to forget the element�

In gure �� we see the results of the following experiment� We start training the neton element �� We can see that the performance on the elements � and becomesworse� Surprisingly� the performance on element � improves along with element ��After � rounds of training� we stop training element � and start training the otherthree elements� Clearly� the error on element �� E�� increases dramatically and thenet ends up performing well on the other three� The net forgot element ��

�

E1

E2

E3

E4

SSE

0 20 40 60 80 100 120 140 160 180 2000

0.5

1

1.5

Figure �� Training per element eta � ��

E1

E2

E3

E4

SSE

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure �� This net forgets element �

��

Application II� Curve tting

In this section we will look at another application of three�layered networks� Wewill try to use a network to represent a function f � R R� We use a networkwith one input and one output neuron� We will take ve sigmoidal hidden neurons�The output neuron will have a linear activation function� because we want it tohave outputs not just in the �� interval� The rest of the network is similar tothat used in the previous section� Also� the training algorithm is analogous andtherefore not printed in the appendices� The matter of choosing was discussedin the previous section and we will let it rest now� For the rest of the section� wewill use � �� which will turn out to give just as smooth and convergent trainingresults as in the previous section�

1

1

f(x)

x

Figure �� A ��neural network

�� A parabola

We will try to t the parabola y � x� and train the network with several inputsfrom the �� interval� The training set we use is�

ST ��

�Training the network shows that the SSE converges to zero smoothly� In this sectionwe will focus less on the SSE and more on the behaviour of the trained network�In the previous section� we wanted the network to perform well on the training set�In this section we want the network to give accurate predictions of the value of x��with x the input value� and not just on the ve training pairs� So we will not showthe SSE graph here� We will plot the networks prediction of the parabola�

As we can see� the network predicts the function really well� After �� trainingruns we have a fairly accurate prediction of the parabola� It is interesting whetherthe network also has any knowledge of what happens outside the ��interval� sowhether it can predict the value outside that interval� Figure � shows that thenetwork fails to do this� Outside its training set� its performance is bad�

��

prediction actual value

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.2

0

0.2

0.4

0.6

0.8

1

Figure �� The networks prediction after �� training runs


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.2

0

0.2

0.4

0.6

0.8

1

Figure �� The networks prediction after �� training runs

��


0 0.5 1 1.5 2 2.5 3−1

0

1

2

3

4

5

6

7

8

9

Figure �� The network does not extrapolate

��

�� The sine function

In this subsection we will repeat the experiment from the previous subsection forthe sine function� Our training set is�

ST ��

�These are the results of training a net on ST �


0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

Figure �� The networks prediction after �� runs


0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

Figure �� The networks prediction after �� runs

Obviously� this problem is a lot harder to solve for the network� After �� runs� theperformance is not good yet and even after �� runs there is a noticeable di�erencebetween the prediction and the actual value of the sine function�

��

�� Overtraining

An interesting phenomenon is that of overtraining� So far� the only measure ofperformance has been the SSE on the training set� on which the two suggestedstopping criteria were based� In this section� we abandon the SSE�approach becausewe are interested in the performance on sets larger than just the training set� SSE�stopping criteria combined with this new objective of performance on larger setscan lead to overtraining� We give an example� We trained two networks on�

ST ��

�Here are the training results�


0 1 2 3 4 5 6 7−10

0

10

20

30

40

50

Figure �� Network A predicting the parabola


0 1 2 3 4 5 6 7−10

0

10

20

30

40

50

Figure �� Network B predicting the parabola

The question is which of the above networks functions best� With the SSE on STin mind� the answer is obvious� network B has a very small SSE on the trainingset� But we mentioned before that we wanted the network to perform on a wider

��

set� So maybe we should prefer network A after all�

In fact� network B is just a longer�trained version of network A� We call network Bovertrained� Using the discussed methods of stopping training can lead to situationslike this� so these criteria might not be satisfactory�

�� Some new criteria for stopping training

We are looking for a criterion to stop training which avoids the illustrated problems�But the SSE is the only measure of performance we have so far� We will thereforeuse a combination of the two�

As we are interested in the performance of the net on a wider set than just ST �we introduce a reference set SR with input�output elements that are not in ST butrepresent the area on which we want the network to perform well� Now we de nethe performance of the net as the SSE on SR� When we start training a net withST � the SSE on SR is likely to decrease� due to the generalizing capabilities of neu�ral networks� As soon as the network becomes overtrained� the SSE on SR increases�

Now we can use the stopping criteria from subsection �� with the SSE on SR�

We illustrate this technique in the case of the previous subsection� We de ne�

SR ��

�and we calculate the SSE on both ST and SR�

SSE op StSSE op Sr

0 20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

Figure �� The SSE on the training set and the reference set

Using the old stopping criteria would obviously lead to network B� A stopping cri�terion that would terminate training somewhere close to the minimum of the SSEon SR would lead to network A�

��

In this case� the overtraining is caused by a bad training set ST � It contains alltraining�pairs on the �� interval and one quite far from that interval� Trainingthe net on SR would have given a much better result�

What we wanted to show however was what happens if we keep training too long ona too limited training set� the net indeed does memorize the entries of the trainingset� but its performance on the neighbourhood of this training set gets worse afterlonger training�

�� Evaluating the curve tting results

In the last few sections� we have not been interested in the individual neurons�Instead� we just looked at the entire network and its performance� We did thisbecause we wanted the network to solve the problem� The strong feature of neuralnetworks is that we do not have to divide the problem into subproblems for theindividual neurons�

It can be interesting though� to look back� We will now analyze the role of everyneuron in the two trained curve� tting networks�

We start with the hidden neurons� Their output was the tanh over their activationvalue�

ik � tanh�whkx� �hk �

The output neuron takes a linear combination of these tanh�curves�

ok ��Xl��

wol il � �o

�� Xl��

wol tanh�w

hl x� �hl �

�� o

��

So the network is trying to t tanh�curves to the tted curve as accurately aspossible� We can plot the curves for both the tted parabola and the sine�

In this case� only one neuron has a non�trivial output� the other four are more orless constant� a role �o could have full lled easily� This leads us to assume that theparabola could have been tted by a �� network�

The sine function is more complex� Fitting a non�monotonic function with mono�tonics obviously takes more functions� Neuron � has a strongly increasing functionas output� Because of the symmetry of the sine� we would expect another neuronto have a equally decreasing output function� It appears that this task has beendivided between neurons � and � they both have a decreasing output function andthey would probably add up to the symmetric function we expected� The two otherneurons have a more or less constant value�

For the same reasons as we mentioned with the parabola� we might expect that thisproblem could have been solved by a �� or even a �� network�

Analyzing the output of the neurons after training can give a good idea of the min�imum size of the network required to solve the problem� And we saw in section

��

1

2

3

4

5

6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

Figure �� The tanh�curves used to t the parabola

1

2

3

4

5

6

0 0.5 1 1.5 2 2.5 3 3.5−1

−0.5

0

0.5

1

1.5

Figure �� The tanh�curves used to t the sine

that over�dimensioned networks can lose their generalizing capabilities fast� Ana�lyzing the neurons could lead to removing neurons from the network and improvingits generalizing capabilities�

There is another interesting evaluation method� We could replace the hidden�outputresults with their Taylor polynomials� This would lead to a polynomial as outputfunction� Question is if this polynomial would be identical to the Taylor polyno�mial of the required output function� Since the functions coincide on an interval�the polynomials would be probably identical for the rst number of coe�cients�

This could lead to a theory on how big a network needs to be in order to t afunction with a given Taylor polynomial� But this would take further research�

�

� Application III� Times Series Forecasting

In the previous section� we trained a neural network to adopt the input�output re�lation of two familiar functions� We used training pairs �x� f�x�� And althoughperformance was acceptabel after small numbers of training� this application hadone shortcoming� it did not extrapolate at all� Neural networks will in generalperform weakly outside their training set� but a smart input�output choice canovercome these limits�

In this section� we will look at time series� A time series is a vector �yt� with xeddistances between subsequent ti� Examples of time series are the daily stock prices�rainfall in the last twenty years and actually every measured quantity over discretetime intervals�

Predicting a future value of y� say yt is now done based on for instance yt�� yt�nbut not on t� In this application we will take n � � and try to train a network togive valuable predictions of yt�

1

1

yt-1yt-2

ty

Figure �� The network used for TSF

We take a network similar to the network we used in the previous section� Onlywe now have � input neurons� The hidden neurons still have sigmoidal activationfunctions and the output neuron has a linear activation function�

�� Results

Of course we can look at any function f�x� as a time series� We associate withevery entry ti of a vector �t the value of f�ti�� We will rst try to train the networkon the sine function again�

We take �t � f�� g and yt � sin�t�� Training this network enables us topredict the sine of t given the sine of the two previous values of t� t� �� and t� ��But we could also predict the sine of t based on the sines of t� �� and t� �� thesetwo values gives us a prediction of sin�t � �� and thus we can predict sin�t�� Ofcourse� basing a prediction on a prediction is less accurate than the prediction basedon two actual sine values� The results of the network is plotted in gure ��

Because we trained the net to predict based on previous behaviour� this network

�

predicting 3 deep

predicting 2 deep

predicting 1 deep

actual value

0 1 2 3 4 5 6−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure �� The networks performance after �� training runs

will extrapolate� since the sine�curve�s behaviour is periodical�

��


1 2 3 4 5 6 7 8 9 10 11−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Figure �� This network does extrapolate

��

Conclusions

In this paper we introduced a technique that in theory should lead to good trainingbehaviour for three�layered neural networks� In order to achieve this behaviour� anumber of important choices has to be made�

�� the choice of learning parameter

�� the choice of the training set ST

�� the con guration of the network

� the choice of a stopping criterion

In application I� we focussed on measuring the SSE and saw that its behaviour wasstrongly dependent on the choice of � A small leads to smooth and convergentSSE�curves and therefore to satisfying training results� In our example� � ��was small enough� but the maximum value of may vary� If an SSE curve is notconvergent or is not smooth� one should always try a smaller �

Also� choosing ST is crucial� In application II we saw that with a non�representativetraining set� a trained network will not generalize well� And if you are not only in�terested in performance on ST � just getting the SSE small is not enough� Thereference�set�SSE method is a good way to reach a compromise� acceptable perfor�mance on ST combined with a reasonable performance on its neighbourhood�

Neural networks seem to be a useful technique to learn the relation between datasets in cases where we have no knowledge of what the characteristics of the relationwill be� The parameters determining the network�s success are not always clear�but there are enough techniques to make these choices�

��

A Source of the used M�les

A�� Associative memory assostore�m� assorecall�m

function w � assostore�S�

� ASSOSTORE�S� has as output the synaptic strength

� matrix w for the associative memory with contents

� the rowvectors of S�

�p�N��size�S��

w�zeros�N��

S�S��

for i�� N

for j�� n

w�i�j��S�� p�i��S�� p�j��

end�

end

w�w�N�

function s� assorecall�sigma�w�

� ASSORECALL�g�w� returns the closest contents of

� memory w� stored by ASSOSTORE�

�N�N��size�w��

s�zeros��N��

sigma�sigma ��

s� wsigma��

s�sign�s��

s��s��

A�� An example session

� M A T L A B �R� �

�c� Copyright �� The MathWorks� Inc�

All Rights Reserved

Version ��c

Dec ��

�� S � ��

S �

� � � � � � � �

� � � � � � � �

�� w�assostore�S��

�� assorecall��w�

ans �

� � � � � � � �

�� assorecall��w�

ans �

� � � � � � � �

��

A�� BPN train��m

function �Wh�Wo�E� � train��Wh�Wo� x� y� eta�

� train� trains a �� neural net with sigmoidal

� activation functions� It updates the weights Wh

� and Wo for input x and expected output y� eta is

� the learning parameter�

� Returns the updated matrices and the error E

�

� Usage �Wh�Wo�E� � train��Wh�Wo� x� y� eta�

�� Computing the networks output ��

hi � Wh�x��

i � tanh�hi��

ho � Wo �i � ��

o � tanh�ho��

E � y � o�

�� Back Propagation ��

� Computing deltas

deltao � �� o�� E�

deltah� � �� i�� deltao Wo��

deltah � �� i�� deltao Wo��

� Updating Outputlayer weights

Wo�� Wo�� eta deltao i��

Wo�� Wo�� eta deltao i��

Wo�� Wo�� eta deltao�

� Updating Hiddenlayer weights

Wh�� Wh�� eta deltah� x��

Wh�� Wh�� eta deltah� x��

Wh�� Wh�� eta deltah��

Wh�� Wh�� eta deltah x��

Wh�� Wh�� eta deltah x��

Wh�� Wh�� eta deltah�

�

Bibliography

�Freeman� James A� Freeman and David M� Skapura� Neural Networks� Algo�

rithms� Applications and Programming Techniques� Addison�Wesley��

�M�uller� B� M�uller� J� Reinhardt� M�T� Strickland� Neural Networks� An Intro�

duction� Berlin� Springer Verlag� ��

�N�rgaard� Magnus N�rgaard� The NNSYSID Toolbox�httpkalman�iau�dtu�dkProjectsprojnnsysid�html

�

a matlab implementation of nn

Education

theory of neural networks

general neural networks

training networks

neural networks7

neural networks5

state of neuron i

associative way

sign function