a matlab implementation of nn
TRANSCRIPT
A Matlab�implementation of neural networks
Jeroen van Grondelle
July ����
�
Contents
Preface �
� An introduction to neural networks �
� Associative memory �
��� What is associative memory� � � � � � � � � � � � � � � � � � � � � � � ���� Implementing Associative Memory using Neural Networks � � � � � � ���� Matlab�functions implementing associative memory � � � � � � � � � � �
����� Storing information � � � � � � � � � � � � � � � � � � � � � � � ������ Recalling information � � � � � � � � � � � � � � � � � � � � � � �
� The perceptron model ��
��� Simple perceptrons � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� The XOR�problem � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Solving the XOR�problem using multi�layered perceptrons � � � � � � � ��
� Multi�layered networks ��
�� Learning � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Training � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Generalizing � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� The Back�Propagation Network ��
�� The idea of a BPN � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��� Updating the output�layer weights � � � � � � � � � � � � � � � � � � � ���� Updating the hidden�layer weights � � � � � � � � � � � � � � � � � � � ��
� A BPN algorithm �
��� Choice of the activation function � � � � � � � � � � � � � � � � � � � � ����� Con guring the network � � � � � � � � � � � � � � � � � � � � � � � � � ����� An algorithm� train����m � � � � � � � � � � � � � � � � � � � � � � � ��
� Application I the XOR�gate ��
��� Results and performance � � � � � � � � � � � � � � � � � � � � � � � � � ����� Some criteria for stopping training � � � � � � � � � � � � � � � � � � � ��
����� Train until SSE � a � � � � � � � � � � � � � � � � � � � � � � � ������� Finding an SSE�minimum � � � � � � � � � � � � � � � � � � � � ��
��� Forgetting � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
Application II Curve �tting ��
��� A parabola � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� The sine function � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ����� Overtraining � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���� Some new criteria for stopping training � � � � � � � � � � � � � � � � ���� Evaluating the curve tting results � � � � � � � � � � � � � � � � � � � ��
� Application III Times Series Forecasting ��
��� Results � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
Conclusions ��
�
A Source of the used M��les ��
A�� Associative memory� assostore�m� assorecall�m � � � � � � � � � � ��A�� An example session � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��A�� BPN� train����m � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Bibliography ��
�
Preface
Although conventional computers have been shown to be e�ective at a lot of de�manding tasks� they still seem unable to perform certain tasks that our brains doso easily� These are tasks like for instance pattern recognition and various kindsof forecasting� That we do these tasks so easily has a lot to do with our learningcapabilities� Conventional computers do not seem to learn very well�
In January ����� the NRC Handelsblad� in its weekly science subsection� publisheda series of four columns on neural networks� a technique that overcomes some of theabove�mentioned problems� These columns aroused my interest in neural networks�of which I knew practically nothing at the time� As I was just looking for a subjectfor a paper� I decided to nd out more about neural networks�
In this paper� I will start with giving a brief introduction to the theory of neuralnetworks� Section � discusses associative memory� which is a simple application ofneural networks� It is a �exible way of information storage� allowing retrieval in anassociative way�
In sections � to � general neural networks are discussed� Section � shows the be�haviour of elementary nets and in section and this theory is extended to largernets� The back propagation rule is introduced and a general training algorithm isderived from this rule�
Sections � to � deal with three applications of the back propagation network� Usingthis type of net� we solve the XOR�problem and we use this technique for curve tting� Time series forecasting also deals with predicting function values� but isshown to be a more general technique than the introduced technique of curve tting�
Using these applications� I demonstrate several interesting phenomena and criteriaconcerning implementing and training networks� such as stopping criteria� over�training and forgetting�
Finally� I�d like to thank Rob Bisseling for his supervision during the process andEls Vermij for her numerous suggestions for improving this text�
Jeroen van GrondelleUtrecht� July ����
� An introduction to neural networks
In this section a brief introduction is o�ered to the theory of neural networks� Thistheory is based on the actual physiology of the human brain and shows a greatresemblance to the way our brains work�
The building blocks of neural networks are neurons� These neurons are nodes inthe network and they have a state that acts as output to other neurons� This statedepends on the input the neuron is given by other neurons�
activation function
Input
Output
Neuronthreshold
Figure �� A neuron
A neural network is a set of connected neurons� The connections are called synapses�If two neurons are connected� one neuron takes the output of the other neuron asinput� according to the direction of the connection�
Neurons are grouped in layers� Neurons in one layer only take input from the pre�vious layer and give output to the next layer ��
Every synapse is associated with a weight� This weight indicates the impact of theoutput on the receiving neuron� The state of neuron i is de ned as�
si � f
�Xk
wikrk � �
����
where rk are the states of the neurons that give input to neuron i and wik representsthe weight associated with the connection� f�x� is the activation function� Thisfunction is often linear or a sign�function whwn we require binary output� The signfunction is generally replaced by a continuous representation of this function� Thevalue � is called the threshold�
input
ouput
input
output
hidden layer
Figure �� A single and multi�layered network
�Networks with connections skipping layers are possible� but we will not discuss them in this
paper
A neural net is based on layers of neurons� Because the number of neurons is nite�there is always an input layer and an output layer � which only give output or takeinput respectively� All other layers are called hidden layers� A two�layered net iscalled a simple perceptron and the other nets multi�layered perceptrons� Examplesare given in gure �
�
� Associative memory
��� What is associative memory�
In general� a memory is a system that both stores information and allows us torecall this information� In computers� a memory will usually look like an array� Anarray consists of pairs �i� ��� where � is the information we are storing� and i is theindex assigned to it by the memory on storage� We can recall the information bygiving the index as input to the memory�
index
Minput output
information
Figure �� Memory recall
This is not a very �exible technique� we have to know exactly the right index torecall the stored information�
Associative memory works much more like our mind does� If we are for instancelooking for someone�s name� it will help to know where we met this person or whathe looks like� With this information as input� our memory will usually come upwith the right name� A memory is called an associative memory if it permits therecall of information based on partial knowledge of its contents�
��� Implementing Associative Memory using Neural Net�works
Neural networks are very well suited to create an associative memory� Say we wishto store p bitwords� of length N � We want to recall in an associative way� so wewant to give as input a bitword and want as output the stored bitword that mostresembles the input�
So it seems the obvious thing to do is to take an N�neuron layer as both input andoutput layer and nd a set of weights so that the system behaves like a memory forthe bitwords �� � � ��p�
1 2 3 4
1 2 3Output
Input N
N
Figure � An associative memory con guration
If now a pattern s is given as input to the system� we want �� to be the output� so
�For later convenience� we will work with binary numbers that consist of ��s and ���s� where
�� replaces the usual zero�
�
that s and �� di�er in as few places as possible� So we want the error Hj
Hj �NX���
�si � �ji �� ���
to be minimal if j � �� This Hj is called the Hamming distance��
We will have a look at a simple case rst� Say we want to store one pattern �� Wewill give an expression for w and check that it suits our purposes�
wij ��
N�i�j ���
If we give an arbitrary pattern s as input� where s di�ers in n places from the storedpattern �� we get�
Si � sign
�� NX
j��
wijsj
�A � sign
���i �
N
NXj��
�jsj
�A ��
Now examinePN
j�� �jsj � If sj � �j� then �jsj � �� otherwise it is ��� Therefore�the sum equals �N � n� � �n� and�
sign
���i �
N
NXj��
�jsj
�A � sign
��
N�i�N � �n�
�� sign
����
�n
N
��i
���
There are two important features to check� First� we can see that if we chooses � �� the output will be �� This is obvious� because � and � di�er in � places�We call this stability of the stored pattern� Secondly� we want to check that if wegive an input reasonably close to �� we get � as output� Obviously� if n � N
� � theoutput will equal �� Then
��� �n
N
�does not a�ect the sign of �i� This is called
convergence to a stored pattern�
We now want to store all the words ��� � � � � �p� And again we will give an expressionand prove that it serves our purpose� De ne
wij ��
N
pX���
��i �
�j ���
The method will be roughly the same� We will not give a full proof here� This wouldbe too complex and is not of great importance to our argument� What is importantis that we are proving stability of stored patterns and the convergence to a storedpattern of other input patterns� We did this for the case of one stored pattern� Themethod for multiple stored patterns is similar� Only� proving the error terms to besmall enough will take some advanced statistics� Therefore� we will prove up to theerror terms here and then quote �M�uller��
Because the problem is becoming a little more complex now� we will discuss theactivation value for an arbitrary output neuron i� usually referred to as hi� Firstwe will look at the output when a stored pattern �say ��� is given as input�
hi �NXj��
wij��j �
�
N
pX���
��i
NXj��
��j �
�j �
�
N
����i NX
j��
��j��j �
X����
��i
NXj��
��j �
�j
�A ���
�Actually� this is the Hamming distance when bits are represented by ��s and ��s� The square
then acts as absolute�value operator� So we should scale results by a constant factor ��� to obtain
the Hamming distance�
�
The rst part of the last expression is equal to ��i due to similar arguments as inthe previous one�pattern case� The second expression is dealt with using laws ofstatistics� see �M�uller��
Now we give the system an input s where n neurons start out in the wrong state�Then generalizing ��� similar to �� gives�
hi �
���
�n
N
���i �
�
N
X� ���
��i
NXj��
��j sj ���
The rst term is equal to that of the single�pattern storage case� And the second isagain proven small by �M�uller�� Moreover� it is proven that
hi �
���
�n
N
���i � O
�rp� �
N
����
So if p �� N the system will still function as a memory for the p patterns� In�M�uller�� it is proven that as long as p � ��N the system will function well�
��� Matlab�functions implementing associative memory
In Appendix A�� two Matlab functions are given for both storing and recallinginformation in an associative memory as described above� Here we will make someshort remarks on how this is done�
� � � Storing information
The assostore�function works as follows�The function gets a binary matrix S as input� where the rows of S are the patternsto store� After determining its size� the program lls a matrix w with zeros� Thevalues of S are transformed from ����� to ������ notation� Now all values of w arecomputed using ���� This formula is implemented using the inner product of twocolumns in S� The division by N is delayed until the end of the routine�
� � � Recalling information
assorecall�m is also a straightforward implementation of the procedure describedabove� After transforming from ����� to ������ notation� s is computed as w timesthe transposed input� The sign of this s is transformed back to ����� notation�
�
� The perceptron model
��� Simple perceptrons
In the previous section� we have been looking at two�layered networks� which arealso known as simple perceptrons� We did not really go into the details� An expres�sion for w was given and we simply checked that it worked for us� In this sectionwe will look closer at what these simple perceptrons really do�
Let us look at a ��neuron input� ��neuron output simple perceptron� as shown in gure �
S1
s1
s2
Figure � A ����� simple perceptron
This net has only two synapses� with weights w� and w�� and we assume S� hasthreshold �� We allow inputs from the reals and take as activation function thesign�function� Then the output is given by�
S� � sign�w�s� �w�s� � �� ����
There is also another way of looking at S�� The inner product of w and s actuallyde nes the direction of a line in the input space� � determines the location of thisline and taking the sign over this expression determines whether the input is on oneside of the line or at the other side� This can be seen more easily if we rewrite ����as�
S� �
� if w�s� � w�s� � �
�� if w�s� � w�s� � �����
So ����� simple perceptrons just divide the input space in two and return � at onehalf and �� at the other� We visualize this in gure ��
w
Figure �� A simple perceptron dividing the input space
We can of course generalize this to �n��� simple perceptrons� in which case theperceptron de nes a �n����dimensional hyperplane in the n�dimensional input space�The hyperplane view of simple perceptrons also allows looking at not too complexmulti�layered nets�As we saw before� every neuron in the rst hidden layer is an
��
indicator of a hyperplane� But the next hidden layer again consists of indicators ofhyperplanes� de ned this time on the output of the rst hidden layer� Multi�layerednets soon become far too complex to study in such a concrete way� In the literaturewe see that multi�layered nets are often regarded as black boxes� You know whatgoes in� you train until the output is right and you do not bother about the exactactions inside the box� But for relatively small nets� it can be very interesting tostudy the exact mechanism� as it can show whether or not a net is able to do therequired job� This is exactly what we will do in the next subsection�
��� The XOR�problem
As we have seen� simple perceptrons are quite easy to understand and their be�haviour is very well modelled� We can visualize their input�output relation throughthe hyperplane method�But simple perceptrons are very limited in the sort of problems they can solve�If we look for instance at logical operators� we can instantly see one of its limits�Although a simple perceptron is able to adopt the input�output relation of boththe OR and AND operator� it is unable to do the same for the Exclusive�Or gate� theXOR�operator�
s� s� S�� �� ��� �� ��� � �� � ��
Table �� The truth table of the XOR�function
We examine rst the AND�implementation on a simple perceptron� The input�outputrelation would be�
AND
-1 1
1
-1
Figure �� Input�output relation for the AND�gate
Here the input is on the axes� and a black dot means output � and a white dotmeans output ��� As we have seen in section ���� a simple perceptron will de nea hyperplane� returning � at one side and �� at the other� In gure �� we choosea hyperplane for both the AND and the OR�gate input space� We immediately seewhy a simple perceptron will never simulate an XOR�gate� as this would take twohyperplanes� which a simple perceptron can not de ne�
��
XORAND OR
1
1-1
-1
1
1-1
-1 -1
-1
1
1
Figure �� A hyperplane choice for all three gates
It is now almost trivial to nd the simple perceptron solution to the rst two gates�Obviously� �w�� w�� � ��� �� de nes the direction of the chosen line� It follows thatfor the AND�gate � � � works well� In the same way we compute values for theOR�gate� w� � �� w� � � and � � ���
When neural nets were only just invented and these obvious limits were discovered�most scientists regarded neural nets as a dead end� If problems this simple couldnot be solved� neural nets were never going to be very useful� The answer to theselimits were multi�layered nets�
��� Solving the XOR�problem using multi�layered perceptrons
Allthough the XOR�problem can not be solved by simple perceptrons� it is easy toshow that it can be solved by a ������� perceptron� We could prove this by givinga set of suitable synapses and prove its functioning� We could also go deeper intothe hyperplane method� Instead of these options� we will use some logical rules andexpress the XOR operator in terms of OR and AND operators� which we have seen wecan handle� It can be easily shown that�
�s� XOR s��� �s� � �s�� � ��s� � s�� ����
We have neural net implementations of the OR and AND operator� Because we areusing � and �� as logical values� �s� is equal to �s�� This makes it easy to put s�and �s� in a neural AND�gate� We will just negate the synapse that leads from �s� toS� and use s� as input instead of �s�� This suggests the following ��������solution�The input layer is used as usual and feeds the hidden layer� consisting of hs� andhs�� These function as AND�gates as indicated in ����� S� the only element in theoutput layer� implements the OR�symbol in �����
By writing down the truth table for the system� it can easily be shown that thegiven net is correct�
��
θθ
θ=-1
1 1-1
-1
s s
S
=1=1
21
11
Figure �� A ������� solution of the XOR�gate
��
� Multi�layered networks
In the previous section� we studied a very speci c case of multi�layered networks� Wecould determine its synaptic strengths because it was a combination of several simpleperceptrons� which we studied quite thoroughly before� and because we could reducethe original problem to several subproblems that we already solved using neural nets�In the preface� several tasks were mentioned such as character recognition� timeseries forecasting� etc� These are all very demanding tasks� which need considerablylarger nets� These tasks are also problems we do not understand so well� So we arenot able to de ne subproblems� which we could solve rst� The strong feature ofneural nets that we are going to use here is that� by training� the net will learn theinput�output relation we are looking for� We are not concerned with the individualfunction of neurons� in this section we will consider the net as the earlier mentionedblack box�
��� Learning
Let us discuss a concrete example here� A widely used application of neural nets isthat of character recognition� The input of our black box could then be for instancea � � � matrix of ones and zeros� representing a scan of a character� The outputcould consist of �� neurons� representing the �� characters of the alphabet�Since we do not have a concrete solution in for instance hyperplane or logical termsto implement in a net� we choose more or less at random a net con guration andsynaptic strengths for this net� Not all net con gurations are able to learn all prob�lems �we have seen a very obvious example of that before� but there are guidelinesand rough estimations on how large a net has to be� We will not go into that rightnow�Given our net� every scan given as input will result in an output� It is not very likelythat this net will do what we want from the start� since we initiated it randomly� Itall depends on nding the right values for the synaptic strenghts� We need to trainthe net� We give it an input and compare the output with the result we wantedto get� And then we will adjust the synaptic strenghts� This is done by learning
algorithms� of which the earlier mentioned Back�Propagation rule is an example�We will discuss the BPN�rule later�By repeating this procedure often with di�erent examples� the net will learn to givethe right output for a given input�
��� Training
We have mentioned the word training several times now� It refers to the situationwhere we show the system several inputs and provide the required output as well�The net is then adjusted� By doing this the net learns�The contents of the training set is of crucial importance� First of all� it has to belarge enough� In order to get the system to generalize� a large set of examples hasto be available� Probably� a network trained with a small set will behave like amemory� but a limited training set will never evoke the behaviour we are lookingfor� adapting an error�tolerant� generalizing input�output relation�The set also has to be su�ciently rich� The notion we want the neural net to rec�ognize has to be the only notion that is present everywhere in our training set� Asthis may sound a bit vague� an example might be neccesary� If we have a set ofpictures of blond men and dark women� we could teach a neural net to determinethe sex of a person� But it might very well be that on showing this trained systema blond girl� the net would say it�s a boy� There are obviously two notions in orderhere� someone�s sex and the colour of his or her hair�
�
In the theory of neural nets� one comes across more of these rather vague problems�The non�deterministic nature of training makes that trained systems can get over�trained and can even forget� We will not pay too much attention to these phenomenanow� We will discuss them later� when we have practical examples to illustrate them�
��� Generalizing
There is an aspect of learning that we have not yet discussed� We de ned trainingas adjusting a neural net to the right input�output relation� This relation is thende ned by the training set� This suggests that we train the network to give theright output at every input from the training set�If this is all that the system can achieve� it would be nothing more than a memory�which we discussed in section �� We also want the system to give output on inputthat is not in the training set� And we want this output to be correct� By givingthe system a training set� we want the system to learn about other inputs as well�Of course these will have to be close enough to the ones in the training set�The right network con guration is crucial for the system to learn to generalize� Ifthe network is too large� it will be able to memorize the training set� If it is toosmall� it simply will not be able to master the problem�So con guring a net is very important� There are basically two ways of achievingthe right size� One is to begin with a rather big net� After some training� thenon�active neurons and synapses are removed� thus leaving a smaller net� which canbe further trained� This technique is called pruning� The other way is rather theopposite� Start with a small net and enlarge it if it does not succeed in solving theproblem� This guarantees you to get the smallest net that does the job� But youwill have to train a whole new net every time you add some neurons�
�
� The Back�Propagation Network
��� The idea of a BPN
In the previous section we mentioned a learning algorithm� This algorithm updatedthe synaptic strengths after comparing an expected output with an actual output�The algorithm should alter the weights to minimize the error next time�One of the algorithms developed is the Error Back�Propagation algorithm� Thisis the algorithm we will describe here and implement in the next section� We willdiscuss a speci c case in detail� We will derive and implement this rule for a three�layered network�
Input
Hidden
Output
x x x x1 2 3 N
1 L
1 2 3 M
i i = f (h )
= f (h )
h
o
M
L
h
o
L
Mo oo o
Figure ��� The network con guration we will solve
We want to minimize the error between expected output y and actual output o�From now on we will be looking at a xed training�set pair� an input vector x andan expected output y� The actual output o is the output that the net gives for theinput vector�We de ne the total error�
E ��
�
Xk
��k ����
where �k is the di�erence between the expected and actual output of output neuronk� �k � �yk � ok��
Since all the information of the net is in its weights� we could look at E as a functionof all its weights� We could regard the error to be a surface in W �R� where W isthe weights space� This weights space has as dimension the number of synapses inthe entire network� Every possible state of this network is represented by a point�wh� wo� in W �
Now we can look at the derivative of E with respect to W � This gives us the gradi�ent of E� which always points in the direction of steepest ascent of the surface� So�grad�E� points in the direction of steepest descent � Adjusting the net to a point�wh� wo� in the direction of �grad�E� secures that the net will perform better nexttime� This procedure is visualized in gure ���
��
W-space
E
ED
-grad(E)
Figure ��� The error as a function of the weights
��� Updating the output�layer weights
We will calculate the gradient of E in two parts and start with the output�layerweights�
E
wokj
� ��yk � ok�fok�ho
k�
�hok�
wokj
���
Because we have not yet chosen an activation function f � we can not yet evaluate�fok��ho
k� � We will refer to it as fo
�
k �hok�� What we do know is�
�hok�
wokj
�
wokj
LXl��
woklil � �ok � ij ���
Combining the previous equations gives�
E
wokj
� ��yk � ok�fo�
k �hok�ij ����
Now we want to change wokj in the direction of � �E
�wokj
� We de ne�
�ok � �yk � ok�fo�
k �hok� ����
Then we can update wo according to�
wokj�t� �� � wo
kj�t� � �okij ����
where is called the learning�rate parameter� determines the learning speed� theextent to which the w is adjusted in the gradient�s direction� If it is too small� thesystem will learn very slowly� If it is too big� the algorithmwill adjust w too stronglyand the optimal situation will not be reached� The e�ects of di�erent values of are discussed further in section �
��
��� Updating the hidden�layer weights
To update the hidden�layer weights we will follow a procedure roughly the same asin section ��� In section �� we looked at E as a function of the output�neuronvalues� Now we will look at E as a function of the hidden�neuron values ij �
E ��
�
Xk
�yk � ok��
��
�
Xk
�yk � fok �hok��
�
��
�
Xk
�yk � fok �Xj
wokjij � �ok��
�
And now we examine �E
�whji
�
E
whji
��
�
Xk
whji
�yk � ok��
� �Xk
�yk � ok�ok
hok
hokij
ij
hhj
hhj
whji
We can deal with these four derivatives the same way as section ��� The rst andthe third are clearly equal to the unknown derivatives of f � The second is equal to�
�hok�
ij�
ij
LXj��
wokjij � �ok � wo
kj ����
For the same reason� the last derivative is xi� So we have�
E
whji
� �Xk
�yk � ok�fo�
k wokjf
h�
j xi ����
We de ne a �h similar to the one in �����
�hj � fh�
j �hhj �Xk
�yk � ok�fo�
k �hok�wokj
� fh�
j �hhj �Xk
�okwokj ����
Looking at the de nition of �h� we can see that updating whji in the direction of
�E�wh
ji
is equal to�
whji�t � �� � wh
ji�t� � �hj xi ����
where is again the learning parameter�
��
� A BPN algorithm
In the next sections we will demonstrate a few phenomena as described in chapter �using an application of a ������� back�propagation network� We have seen this rel�atively simple network before in subsection ���� The XOR�gate described there willbe the rst problem we solve with an application of the bpn� In this section we willformulate a general ��������bpn training algorithm�
��� Choice of the activation function
Since we will be simulating the XOR�gate� which has outputs �� and � only� it is anobvious choice to use a sigmoidal activation function� We will use f�x� � tanh�x��
f(x) = tanh(x) df/dx = 1 − tanh^2(x)
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure ��� A sigmoidal activation function
We will also need its derivative� Since tanh�x� � sinh�x�cosh�x� � we have�
tanh �ex � e�x
ex � e�x
Deriving this expression yields�
tanh��x� � ���ex � e�x��
�ex � e�x��� �� tanh��x�
��� Conguring the network
We are going to use a three�layer net� with two input neurons� two hidden neuronsand one output neuron� As we have already chosen the activation function� wenow only have to decide how to implement the thresholds� In section we did notmention them� This was not necessary� since we will show here that they are easilytreated as ordinary weights�
We add a special neuron to both the input and the hidden layer and we de ne thestate of this neuron equal to �� This neuron therefore takes no input from previouslayers� since they would have no impact anyway� The weight of a synapse between
��
this special neuron and one in the next layer then acts as the threshold for thisneuron� When the activation value for a neuron is computed� it now looks like�
hj �kXi��
wi�jsi �wi���j � � �k��Xi��
wi�jsi
Neuron k � � is the special neuron that always has a state equal to ��
In gure ��� we give an example of such a net�
1
1
x x
i i
2
1 2
1
o
Figure ��� A ������� neural net with weights as thresholds
This approach enables us to implement the network by using the techniques fromsection � without paying special attention to the thresholds�
��� An algorithm train����m
Given the above�mentioned choices and the explicit method described in section �we can now implement a training function for the given situation� Appendix A��gives the source of train����m� This function is used as follows�
�WH�WO�E� � train����Wh�Wo�x�y�eta�
where the inputs Wh and Wo represent the current weights in the network� �x�y� isa training input�output pair and eta is the learning parameter�The outputs WH and WO are the updated weight matrices and E is the error� ascomputed before the update�
��
� Application I� the XOR�gate
��� Results and performance
We will now use the algorithm to solve the XOR�gate problem� First� we de ne ourtraining set�
ST ���� �� ��� ��� �� ��� ����� ��� ����� ��
�The elements of this set are given as input to the training algorithm introduced inthe previous section� This is done by a special m� le� which also stores the errorterms� These error terms enable us to analyse the training behaviour of the net� Inthe rest of the section� we will describe several phenomena� using the informationthe error terms give us�
When looking at the performance of the net� we can look� for instance� at the errorof the net on an input�ouput pair of the training set� �xi� yi��
Ei � �yi � oi��
with yi as the expected output and oi as the output of the net with xi as in�put� A measure of performance on the entire training set is the Sum of SquaredErrors�SSE��
SSE �Xi
Ei
Clearly� the SSE is an upper bound for every Ei� We will use this frequently whenexamining the net�s performance� If we want the error on every training set elementto converge to zero� we just compute the SSE and check that it does this�
Now we will have a rst look at the results of training the net on ST � Figure �shows some of the results�
�iters E� E� E� E� SSE
��� ��� ������ ����� ������ ����� �������� ������ ����� ������ ������ ��������� ��� � ���� ������ ������ ������ ������� �� � ��� ������ ������ ��� � ���� ������
�� ��� ������ ������ ����� ������ ������� ����� ������ ������ ������ �������� ��� � ��� ������ ������ ������ ������ ������ ������ ������ ������ ������
Figure �� Some training results
As we see in gure �� both training sessions are succesful� as the SSE becomesvery small� We see that with larger� SSE converges to zero faster� This suggeststaking large values for � To see if this stategy would be succesful� we repeat theexperiment with various values of �
In gure �� the SSE is plotted versus the number of training laps for various �We can see that� for � ��� the SSE converges to zero� For � �� SSE convergesfaster� but less smoothly� After �� trainings� the SSE has a little peak� Takinglarger � as suggested above� does not seem very pro table� When is ��� SSE has
��
eta = .1
eta = .2
eta = .4
eta = .6
eta = .8
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Figure �� SSE vs� number of training laps� for various
strong oscillations and with � ��� SSE does not even converge to zero�
This non�convergence for large can be explained by the error�surface view as usedin section � We regard the error as function of all the weights� This leads to an errorsurface on the weights space� We used the gradient of E in this space to minimizethe error� expresses the extent to which we change the weights in the direction ofthe opposite of the gradient� In this way we hope to nd a minimum in this space�If is too large� we can jump over this minimum� approach it from the other sideand jump over it again� Thus� we will never reach it and the error will not converge�
The conclusion seems to be that the choice of is important� If it is too small� thenetwork will learn very slowly� Larger lead to faster learning� but the networkmight not reach the optimal solution�
We have now trained a network to give the outputs at inputs from the training set�And in this speci c case� these are the only inputs we are interested in� But thenet does give outputs on other inputs� Figure �� shows the output on inputs inthe square between the training�set inputs� The graph clearly shows the XOR�gate�soutputs on the four corners of the surface�
In this case� we were only interested in the training�set elements� What the net doesby representing these four states right� is actually only remembering by learning�Later� we will be looking at cases where we are interested in the outputs on inputsoutside the training set� Then we are investigating the generalizing capabilities ofneural networks�
��
0 2 4 6 8 10 120
5
10
15
−0.2
0
0.2
0.4
0.6
0.8
1
Figure ��� The output of the XOR�net
��� Some criteria for stopping training
When using neural networks in applications� we will in general not be interestedin all the SSE curves etc� In these cases� training is just a way to get a well�performing network� which� after stopping training�� can be used for the requiredpurposes� There are several criteria to stop training�
� � � Train until SSE � a
A very obvious method is to choose a value a and stop training as soon as the SSEgets below this value� In the examples of possible SSE�curves we have seen sofar�the SSE� for suitable � converges more or less monotonically� to zero� So it is boundto decrease below any value required�
Choosing this value depends on the accuracy you demand� As we saw before� theSSE is an upper bound for the Ei� which was the square of y � o� So if we toleratea di�erence of c between the expected output and the net�s output� we want�
Ei � c� �i
Since SSE is an upperbound� we could use SSE � c� as a stopping criterion�
The advantage of this criterion is that you know a lot about the performance of
�Unless the input�output relation is changing through time and we will have to continue training
the new situations�
��
SSE
0 50 100 150 200 250 3000
0.5
1
1.5
Figure ��� Stopping training after �� laps� when SSE � ���
the net if training is stopped by it� A disadvantage is that training might not stopin some situations� Some situations are too complex for a net to reach the givenaccuracy�
� � � Finding an SSE�minimum
The disadvantage of the previous method suggests another method� If SSE does notconverge to zero� we want to at least stop training at its minimum� We might traina net for a very long time� plot the SSE and look for its global minimum� Then weretrain the net under the same circumstances and stop at this optimal point� Thisis not realistic however� since training in complex situations can take a considerableamount of time and complete retraining would take too long�
Another approach is to stop training as soon as SSE starts growing� For small� this might work� since we noticed before that choosing a small leads to verysmooth and monotonic SSE�curves� But there is still a big risk of ending up in alocal SSE�minimum� Training would stop just before a little peak in the SSE�curve�although training on would soon lead to even better results�
The advantages are obvious� given a complex situation with non�convergent SSE� westill reach a relatively optimal situation� The disadvantages are obvious too� Thismethod might very well lead to suboptimal stopping� although we can limit thisrisk by choosing small and maybe combining the two previous techniques� trainthe network through the rst fase with the rst criterion and then nd a minimumin the second� smoother fase with the second criterion�
��� Forgetting
Forgetting is another phenomenon we will demonstrate here� So far our traininghas consisted of subsequently showing the net all the training�set elements an equalnumber of times� We will show that this is very important�
Figure �� shows the error during training on all the individual training�set elements�
�
SSE
0 20 40 60 80 100 120 140 160 180 2000
200
400
600
800
1000
1200
1400
1600
Figure ��� Finding an SSE�minimum
It is clear that these functions Ei do not converge monotonically� While the erroron some elements decreases� the error on others increases� This suggests that train�ing the net on element a might negatively in�uence the performance of the net onanother element b�
This is the basis for the proces of forgetting� If we stop training an element� trainingthe other elements in�uences the performance on this element negatively and causesthe net to forget the element�
In gure �� we see the results of the following experiment� We start training the neton element �� We can see that the performance on the elements � and becomesworse� Surprisingly� the performance on element � improves along with element ��After � rounds of training� we stop training element � and start training the otherthree elements� Clearly� the error on element �� E�� increases dramatically and thenet ends up performing well on the other three� The net forgot element ��
�
E1
E2
E3
E4
SSE
0 20 40 60 80 100 120 140 160 180 2000
0.5
1
1.5
Figure ��� Training per element eta � ��
E1
E2
E3
E4
SSE
0 10 20 30 40 50 60 70 80 90 1000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Figure ��� This net forgets element �
��
Application II� Curve tting
In this section we will look at another application of three�layered networks� Wewill try to use a network to represent a function f � R R� We use a networkwith one input and one output neuron� We will take ve sigmoidal hidden neurons�The output neuron will have a linear activation function� because we want it tohave outputs not just in the ���� ���interval� The rest of the network is similar tothat used in the previous section� Also� the training algorithm is analogous andtherefore not printed in the appendices� The matter of choosing was discussedin the previous section and we will let it rest now� For the rest of the section� wewill use � ��� which will turn out to give just as smooth and convergent trainingresults as in the previous section�
1
1
f(x)
x
Figure ��� A �������neural network
��� A parabola
We will try to t the parabola y � x� and train the network with several inputsfrom the ��� ���interval� The training set we use is�
ST ���� ��� ���� ����� ��� ���� ���� ���� ��� ��
�Training the network shows that the SSE converges to zero smoothly� In this sectionwe will focus less on the SSE and more on the behaviour of the trained network�In the previous section� we wanted the network to perform well on the training set�In this section we want the network to give accurate predictions of the value of x��with x the input value� and not just on the ve training pairs� So we will not showthe SSE graph here� We will plot the networks prediction of the parabola�
As we can see� the network predicts the function really well� After �� trainingruns we have a fairly accurate prediction of the parabola� It is interesting whetherthe network also has any knowledge of what happens outside the ������interval� sowhether it can predict the value outside that interval� Figure � shows that thenetwork fails to do this� Outside its training set� its performance is bad�
��
prediction actual value
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.2
0
0.2
0.4
0.6
0.8
1
Figure ��� The networks prediction after ��� training runs
prediction actual value
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−0.2
0
0.2
0.4
0.6
0.8
1
Figure ��� The networks prediction after �� training runs
��
prediction actual value
0 0.5 1 1.5 2 2.5 3−1
0
1
2
3
4
5
6
7
8
9
Figure �� The network does not extrapolate
��
��� The sine function
In this subsection we will repeat the experiment from the previous subsection forthe sine function� Our training set is�
ST ���� ��� ���� ����� ����� ��� ��� ���� ����� ��
�These are the results of training a net on ST �
prediction actual value
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
Figure �� The networks prediction after �� runs
prediction actual value
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
Figure ��� The networks prediction after ���� runs
Obviously� this problem is a lot harder to solve for the network� After �� runs� theperformance is not good yet and even after ���� runs there is a noticeable di�erencebetween the prediction and the actual value of the sine function�
��
��� Overtraining
An interesting phenomenon is that of overtraining� So far� the only measure ofperformance has been the SSE on the training set� on which the two suggestedstopping criteria were based� In this section� we abandon the SSE�approach becausewe are interested in the performance on sets larger than just the training set� SSE�stopping criteria combined with this new objective of performance on larger setscan lead to overtraining� We give an example� We trained two networks on�
ST ���� ��� ��� ��� ���� ����� ��� �� ��� ��
�Here are the training results�
prediction actual value
0 1 2 3 4 5 6 7−10
0
10
20
30
40
50
Figure ��� Network A predicting the parabola
prediction actual value
0 1 2 3 4 5 6 7−10
0
10
20
30
40
50
Figure ��� Network B predicting the parabola
The question is which of the above networks functions best� With the SSE on STin mind� the answer is obvious� network B has a very small SSE on the trainingset� But we mentioned before that we wanted the network to perform on a wider
��
set� So maybe we should prefer network A after all�
In fact� network B is just a longer�trained version of network A� We call network Bovertrained� Using the discussed methods of stopping training can lead to situationslike this� so these criteria might not be satisfactory�
��� Some new criteria for stopping training
We are looking for a criterion to stop training which avoids the illustrated problems�But the SSE is the only measure of performance we have so far� We will thereforeuse a combination of the two�
As we are interested in the performance of the net on a wider set than just ST �we introduce a reference set SR with input�output elements that are not in ST butrepresent the area on which we want the network to perform well� Now we de nethe performance of the net as the SSE on SR� When we start training a net withST � the SSE on SR is likely to decrease� due to the generalizing capabilities of neu�ral networks� As soon as the network becomes overtrained� the SSE on SR increases�
Now we can use the stopping criteria from subsection ��� with the SSE on SR�
We illustrate this technique in the case of the previous subsection� We de ne�
SR ����� ����� ��� ��� �� ���� �� ��� ��� ���
�and we calculate the SSE on both ST and SR�
SSE op StSSE op Sr
0 20 40 60 80 100 120 140 160 180 2000
500
1000
1500
2000
2500
Figure ��� The SSE on the training set and the reference set
Using the old stopping criteria would obviously lead to network B� A stopping cri�terion that would terminate training somewhere close to the minimum of the SSEon SR would lead to network A�
��
In this case� the overtraining is caused by a bad training set ST � It contains alltraining�pairs on the ��� �� interval and one quite far from that interval� Trainingthe net on SR would have given a much better result�
What we wanted to show however was what happens if we keep training too long ona too limited training set� the net indeed does memorize the entries of the trainingset� but its performance on the neighbourhood of this training set gets worse afterlonger training�
��� Evaluating the curve tting results
In the last few sections� we have not been interested in the individual neurons�Instead� we just looked at the entire network and its performance� We did thisbecause we wanted the network to solve the problem� The strong feature of neuralnetworks is that we do not have to divide the problem into subproblems for theindividual neurons�
It can be interesting though� to look back� We will now analyze the role of everyneuron in the two trained curve� tting networks�
We start with the hidden neurons� Their output was the tanh over their activationvalue�
ik � tanh�whkx� �hk �
The output neuron takes a linear combination of these tanh�curves�
ok ��Xl��
wol il � �o
�� �Xl��
wol tanh�w
hl x� �hl �
�� �o
����
So the network is trying to t tanh�curves to the tted curve as accurately aspossible� We can plot the curves for both the tted parabola and the sine�
In this case� only one neuron has a non�trivial output� the other four are more orless constant� a role �o could have full lled easily� This leads us to assume that theparabola could have been tted by a ������� network�
The sine function is more complex� Fitting a non�monotonic function with mono�tonics obviously takes more functions� Neuron � has a strongly increasing functionas output� Because of the symmetry of the sine� we would expect another neuronto have a equally decreasing output function� It appears that this task has beendivided between neurons � and � they both have a decreasing output function andthey would probably add up to the symmetric function we expected� The two otherneurons have a more or less constant value�
For the same reasons as we mentioned with the parabola� we might expect that thisproblem could have been solved by a ������� or even a ������� network�
Analyzing the output of the neurons after training can give a good idea of the min�imum size of the network required to solve the problem� And we saw in section
��
1
2
3
4
5
6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Figure ��� The tanh�curves used to t the parabola
1
2
3
4
5
6
0 0.5 1 1.5 2 2.5 3 3.5−1
−0.5
0
0.5
1
1.5
Figure ��� The tanh�curves used to t the sine
that over�dimensioned networks can lose their generalizing capabilities fast� Ana�lyzing the neurons could lead to removing neurons from the network and improvingits generalizing capabilities�
There is another interesting evaluation method� We could replace the hidden�outputresults with their Taylor polynomials� This would lead to a polynomial as outputfunction� Question is if this polynomial would be identical to the Taylor polyno�mial of the required output function� Since the functions coincide on an interval�the polynomials would be probably identical for the rst number of coe�cients�
This could lead to a theory on how big a network needs to be in order to t afunction with a given Taylor polynomial� But this would take further research�
�
� Application III� Times Series Forecasting
In the previous section� we trained a neural network to adopt the input�output re�lation of two familiar functions� We used training pairs �x� f�x��� And althoughperformance was acceptabel after small numbers of training� this application hadone shortcoming� it did not extrapolate at all� Neural networks will in generalperform weakly outside their training set� but a smart input�output choice canovercome these limits�
In this section� we will look at time series� A time series is a vector �yt� with xeddistances between subsequent ti� Examples of time series are the daily stock prices�rainfall in the last twenty years and actually every measured quantity over discretetime intervals�
Predicting a future value of y� say yt is now done based on for instance yt�� � � � yt�nbut not on t� In this application we will take n � � and try to train a network togive valuable predictions of yt�
1
1
yt-1yt-2
ty
Figure ��� The network used for TSF
We take a network similar to the network we used in the previous section� Onlywe now have � input neurons� The hidden neurons still have sigmoidal activationfunctions and the output neuron has a linear activation function�
�� Results
Of course we can look at any function f�x� as a time series� We associate withevery entry ti of a vector �t the value of f�ti�� We will rst try to train the networkon the sine function again�
We take �t � f�� ��� �� � � ����g and yt � sin�t�� Training this network enables us topredict the sine of t given the sine of the two previous values of t� t� �� and t� ���But we could also predict the sine of t based on the sines of t� �� and t� ��� thesetwo values gives us a prediction of sin�t � ��� and thus we can predict sin�t�� Ofcourse� basing a prediction on a prediction is less accurate than the prediction basedon two actual sine values� The results of the network is plotted in gure ���
Because we trained the net to predict based on previous behaviour� this network
�
predicting 3 deep
predicting 2 deep
predicting 1 deep
actual value
0 1 2 3 4 5 6−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure ��� The networks performance after �� training runs
will extrapolate� since the sine�curve�s behaviour is periodical�
��
prediction actual value
1 2 3 4 5 6 7 8 9 10 11−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure �� This network does extrapolate
��
Conclusions
In this paper we introduced a technique that in theory should lead to good trainingbehaviour for three�layered neural networks� In order to achieve this behaviour� anumber of important choices has to be made�
�� the choice of learning parameter
�� the choice of the training set ST
�� the con guration of the network
� the choice of a stopping criterion
In application I� we focussed on measuring the SSE and saw that its behaviour wasstrongly dependent on the choice of � A small leads to smooth and convergentSSE�curves and therefore to satisfying training results� In our example� � ��was small enough� but the maximum value of may vary� If an SSE curve is notconvergent or is not smooth� one should always try a smaller �
Also� choosing ST is crucial� In application II we saw that with a non�representativetraining set� a trained network will not generalize well� And if you are not only in�terested in performance on ST � just getting the SSE small is not enough� Thereference�set�SSE method is a good way to reach a compromise� acceptable perfor�mance on ST combined with a reasonable performance on its neighbourhood�
Neural networks seem to be a useful technique to learn the relation between datasets in cases where we have no knowledge of what the characteristics of the relationwill be� The parameters determining the network�s success are not always clear�but there are enough techniques to make these choices�
��
A Source of the used M�les
A�� Associative memory assostore�m� assorecall�m
function w � assostore�S�
� ASSOSTORE�S� has as output the synaptic strength
� matrix w for the associative memory with contents
� the rowvectors of S�
�p�N��size�S��
w�zeros�N��
S�S���
for i�� N
for j�� n
w�i�j���S�� p�i��S�� p�j���
end�
end
w�w�N�
function s� assorecall�sigma�w�
� ASSORECALL�g�w� returns the closest contents of
� memory w� stored by ASSOSTORE�
�N�N��size�w��
s�zeros���N��
sigma�sigma ���
s� wsigma��
s�sign�s��
s���s�������
A�� An example session
� M A T L A B �R� �
�c� Copyright ������� The MathWorks� Inc�
All Rights Reserved
Version ��c
Dec �� ����
�� S � ���������������������������������
S �
� � � � � � � �
� � � � � � � �
�� w�assostore�S��
�� assorecall�������������������w�
ans �
� � � � � � � �
�� assorecall�������������������w�
ans �
� � � � � � � �
��
A�� BPN train����m
function �Wh�Wo�E� � train��Wh�Wo� x� y� eta�
� train� trains a ����� neural net with sigmoidal
� activation functions� It updates the weights Wh
� and Wo for input x and expected output y� eta is
� the learning parameter�
� Returns the updated matrices and the error E
�
� Usage �Wh�Wo�E� � train��Wh�Wo� x� y� eta�
�� Computing the networks output ��
hi � Wh�x����
i � tanh�hi��
ho � Wo �i � ���
o � tanh�ho��
E � y � o�
�� Back Propagation ��
� Computing deltas
deltao � �� � o�� E�
deltah� � �� � �i������ deltao Wo����
deltah � �� � �i����� deltao Wo���
� Updating Outputlayer weights
Wo��� � Wo��� � eta deltao i����
Wo�� � Wo�� � eta deltao i���
Wo��� � Wo��� � eta deltao�
� Updating Hiddenlayer weights
Wh����� � Wh����� � eta deltah� x����
Wh���� � Wh���� � eta deltah� x���
Wh����� � Wh����� � eta deltah��
Wh���� � Wh���� � eta deltah x����
Wh��� � Wh��� � eta deltah x���
Wh���� � Wh���� � eta deltah�
�
Bibliography
�Freeman� James A� Freeman and David M� Skapura� Neural Networks� Algo�
rithms� Applications and Programming Techniques� Addison�Wesley������
�M�uller� B� M�uller� J� Reinhardt� M�T� Strickland� Neural Networks� An Intro�
duction� Berlin� Springer Verlag� ����
�N�rgaard� Magnus N�rgaard� The NNSYSID Toolbox�httpkalman�iau�dtu�dkProjectsprojnnsysid�html
�