arti cial neural networks - computer science departmentbejar/apren/docum/trans/05-ann.pdf ·...
TRANSCRIPT
Artificial Neural Networks
Javier Bejar cbea
LSI - FIB
Term 2012/2013
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 1 / 63
Outline
1 Artificial Neural Networks: Introduction
2 Single Layer Neural Networks
3 Multiple Layer Neural Networks
4 Radial Basis Functions Networks
5 Application
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 2 / 63
Artificial Neural Networks: Introduction
1 Artificial Neural Networks: Introduction
2 Single Layer Neural Networks
3 Multiple Layer Neural Networks
4 Radial Basis Functions Networks
5 Application
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 3 / 63
Artificial Neural Networks: Introduction
The brain and the neurons
Neurons are the building blocks of the brain
Their interconnectivity forms the programming that allows us to solveall our everyday tasks
They are able to perform parallel and fault tolerant computation
Theoretical models of how the neurons in the brain work and how theylearn have been developed from the beginning of Artificial Intelligence
Most of these models are really simple (but yet powerful) and havea slim resemblance to real brain neurons
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 4 / 63
Artificial Neural Networks: Introduction
A neuron model
The task that a single neuron performs is very simple
Given a set of inputs, compute an output when their combined valueexceeds a threshold
InputFunction
Ouput Links
OutputActivation FunctionInput Links
BiasWeight
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 5 / 63
Artificial Neural Networks: Introduction
A neuron model
A neuron has a bias input (a0) with value 1
The weights of the neuron (wij) are the parameters of the model
The input is computed as a weighted sum of the inputs
The activation function g(x) determines the output and can bedifferent functions
Threshold function (perceptron)Logistic function (sigmoid perceptron)
The output aj is obtained by applying the activation function to theinput
aj = g
(n∑
i=0
wijai
)= hα(E)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 6 / 63
Artificial Neural Networks: Introduction
Organization of neurons/Networks
Usually neurons are interconnected forming networks, there arebasically two architectures
Feed forward networks, neurons are connected only in one direction(DAG)Recurrent networks, outputs can be connected to the inputs
Feed forward networks are organized in layers, one connected to theother
Single layer neural networks (perceptron networks): input layer, outputlayerMultiple layer neural networks: input layer, hidden layers, output layer
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 7 / 63
Artificial Neural Networks: Introduction
Neurons as logic gates
Initial research in ANN defined neurons as functions capable ofemulate logic gates (Threshold logical units, TLU, [McCulloch&Pitts])
Inputs xi ∈ {0, 1}, weights wi ∈ {+1,−1}, threshold w0 ∈ R,activation function ⇒ threshold functiong(x) = 1 if x ≥ w0, 0 otherwise
Sets of neurons can compute boolean functions composing TLUs thatcompute OR, AND and NOT functions
A turing machine can be emulated using these networks
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 8 / 63
Artificial Neural Networks: Introduction
Neurons as logic gates
(OR ⇒ w0 = 1,w1 = 1,w2 = 1)x1 x2 x1 OR x2 g(x)
0 0 0 g(0)=00 1 1 g(1)=11 0 1 g(1)=11 1 1 g(2)=1
(AND ⇒ w0 = 2,w1 = 1,w2 = 1)x1 x2 x1 AND x2 g(x)
0 0 0 g(0)=00 1 0 g(1)=01 0 0 g(1)=01 1 1 g(2)=1
(NOT ⇒ w0 = 0,w1 = −1)x1 NOT x1 g(x)
0 0 g(0)=10 1 g(-1)=0
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 9 / 63
Single Layer Neural Networks
1 Artificial Neural Networks: Introduction
2 Single Layer Neural Networks
3 Multiple Layer Neural Networks
4 Radial Basis Functions Networks
5 Application
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 10 / 63
Single Layer Neural Networks
The perceptron
The perceptron extends the model when the inputs and the weightscan be real values
A single unit perceptron can be used for classification when we havetwo classes (several units can be used for more classes)
The model that we obtain is a linear discriminant function thatdivides the examples in two classes
We can use different activation functions, commonly the thresholdfunction sign(x) is used:
sign(x) =
{1 x ≥ 0−1 x < 0
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 11 / 63
Single Layer Neural Networks
The perceptron
Preceptron unit for two inputs
+1
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 12 / 63
Single Layer Neural Networks
The perceptron
The hypothesis obtained is (for two inputs, two dimensions):
w0 + w1x1 + w2x2 = 0
That is the same as:
x2 = −w1
w2x1 −
w0
w2
where the weights of the inputs determine the slope and the weight ofthe bias the offset
Our goal is to learn the adequate weights given a sample
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 13 / 63
Single Layer Neural Networks
The perceptron model
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 14 / 63
Single Layer Neural Networks
The perceptron model
In order to learn in this classification problem we have to minimize theloss function L0/1(y , y) (number of errors)
In order to minimize this function we have to move the lineardiscriminant:
If we missclasify an example of class +1 we have to change the weightsso it moves towards the negative sideIf we missclasify an example of class −1 we have to change the weightsso it moves towards the positive sideOtherwise, the weights are correct for the example
The change of the weights will be proportional to the values of theinput
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 15 / 63
Single Layer Neural Networks
The perceptron learning rule
The algorithm used to perform the updating of the linear discriminantfunction is based on gradient descent (hill climbing algorithm)
For each missclasified example we obtain a direction of movement
We move a step on that direction until convergence (no improvement)
The magnitude of the movement is controlled by a parameter (α),called the learning rate
w(t + 1) = w(t) + α · y · x
This algorithm converges if the examples are linearly separable andthe value of α is sufficiently small
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 16 / 63
Single Layer Neural Networks
The perceptron learning rule
Algorithm: Gradient Descent Thresholded Perceptron
Input : E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set, with binary labelsyi ∈ {+1,−1})
Output: w (vector of weights)t ← 0w(t)← random weightsrepeat
foreach xi ∈ E doif g(xi ) 6= yi then
∆w(t) = α · yi · xiw(t + 1) = w(t) + ∆w(t)t + +
end
end
until no change in w
For the bias weight the value of x is always 1
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 17 / 63
Single Layer Neural Networks
The perceptron learning rule - Example
0 2 3-1-3
In this simple example we have a perceptron with one input (two weights,w1 and w0, the bias weight)The initial weights are w1 = 2, w0 = −1
Only the example x = 0 is missclassified ((0; 1)× (2;−1) = −1), assumingα = 0.5, the first iteration of the perceptron learning rule obtains:
w(2) = w(1) + α · y · x = (2;−1) + 0.5(0; 1) = (2;−0.5)
And still is missclassified, so the next iteration:
w(3) = w(2) + α · y · x = (2;−0, 5) + 0.5(0; 1) = (2; 0)
And now all examples are classified correctly
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 18 / 63
Single Layer Neural Networks
The perceptron learning rule - Example - OR function
1
-1
-1 1
e1 e2
e4 e3In this example we have a set ofexamples that represents the ORfunction, we have three weightsinitialized w2 = 1, w1 = 1,w0 = −0.5, that defines theseparator x2 + x1 − 0.5 = 0.
The example 1 is classified correctlyh(e1) = −1− 1− 0.5 = negative
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 19 / 63
Single Layer Neural Networks
The perceptron learning rule - Example - OR function
1
-1
-1 1
e1 e2
e4 e3
The example 2 is classified incorrectlyh(e2) = −1 + 1− 0.5 = negative.
We apply the perceptron rule withα = 0.5:
w(2) = w(1)+α·y ·x = (1; 1;−0.5)+0.5 · 1 · (−1; 1; 1) = (0.5; 1.5; 0)
That defines the separator0.5x2 + 1.5x1 = 0.
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 20 / 63
Single Layer Neural Networks
The perceptron learning rule - Example - OR function
1
-1
-1 1
e1 e2
e4 e3
The example 3 is classified correctlyh(e3) = 0.5 + 1.5 = positive.
The example 4 is classified incorrectlyh(e4) = 0.5− 1.5 = negative.
We apply the perceptron rule withα = 0.5:
w(3) = w(2)+α·y ·x = (0.5; 1.5; 0)+0.5 · 1 · (1;−1; 1) = (1; 1; 0.5)
That defines the separatorx2 + x1 + 0.5 = 0, that classifiescorrectly all examples,
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 21 / 63
Single Layer Neural Networks
The delta rule
The main problem of this update rule is that it could not converge ifthe examples are not linearly separable
Instead of using the missclassification error, we can obtainconvergence guarantees if we minimize the square error loss functionof a unthresholded perceptron unit
In this case we want to minimize the square distance from themissclassified instances to the linear discriminant function that definesthe perceptron weights
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 22 / 63
Single Layer Neural Networks
The delta rule
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 23 / 63
Single Layer Neural Networks
The delta rule
We want to minimize:
Eemp(hα) = E (−→w ) =1
2
∑x∈E
(yx − yx)2
where yx is the label of the examples and yx is the output obtained bythe unthresholded perceptron, and w are the weights
In order to obtain the update for the gradient descent algorithm wehave to obtain the derivative of the error respect to the parameters∇E (w)
The update rule will be:
−→w ← −→w + ∆−→w
where ∆−→w = −α∇E (−→w ) (steepest descent)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 24 / 63
Single Layer Neural Networks
Derivation of the delta rule
∂E
∂wi=
∂
∂wi
1
2
∑x∈E
(yx − yx)2
=1
2
∑x∈E
∂
∂wi(yx − yx)2
=1
2
∑x∈E
2(yx − yx)∂
∂wi(yx − yx)(chain rule)
=∑x∈E
(yx − yx)∂
∂wi(yx −−→w · −→x )
=∑x∈E
(yx − yx)(−xi )
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 25 / 63
Single Layer Neural Networks
Gradient Descent Linear Perceptron
Now the update rule:
∆wi = α ·∑x∈E
(yx − yx)xi
The space of parameters for linear units is formed by ahyperparaboloid surface (quadratic function), this means that a globalminimum exists
The parameter α has to be small enough so we do not miss theminimum
Usually this parameter α is decreased with the iterations (decay)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 26 / 63
Single Layer Neural Networks
Gradient Descent Linear Perceptron
Algorithm: Gradient Descent Linear Perceptron
Input : E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set, with binarylabels yi ∈ {+1,−1})
Output: w (vector of weights)t ← 0w(t)← random weightsrepeat
∆w(t) = α ·∑
x∈E(yx − yx)xw(t + 1) = w(t) + ∆w(t)t + +
until no change in w
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 27 / 63
Single Layer Neural Networks
The perceptron rule vs the delta rule
The perceptron rule
Is based on the error of the thresholded outputConverges after a finite number of iterations to a linear discriminantfunction if the examples are linearly separable
The delta rule
Is based on the error of the unthresholded outputConverges asymptotically to a linear discriminant function (may requireunbounded time) regardless whether the examples are linearly separableor not
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 28 / 63
Single Layer Neural Networks
Limitations of linear perceptrons
With linear perceptrons we can only classify correctly linearlyseparable problems
The hypothesis space is not powerful enough for real problems
Example, the XOR function:
?
?
x1 x2 g(x)
0 0 10 1 01 0 01 1 1
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 29 / 63
Multiple Layer Neural Networks
1 Artificial Neural Networks: Introduction
2 Single Layer Neural Networks
3 Multiple Layer Neural Networks
4 Radial Basis Functions Networks
5 Application
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 30 / 63
Multiple Layer Neural Networks
Multilayer Perceptron
Perceptron units can be organized in layers forming feed forwardnetworks
Each layer performs a transformation that is passed to the next layeras input
The combination of linear units only allows to learn linear functions
To learn non linear functions we have to change the threshold function
Usually the sigmoid (logistic) function is used:
sigmoid(x) =1
1 + e−x
-5 50
1
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 31 / 63
Multiple Layer Neural Networks
Sigmoid units
The sigmoid function allows for soft thresholding and to performnonlinear regression
The combination of sigmoid functions allows to approximate highlynonlinear functions
It has the advantage of being differentiable (unlike the sign function)
∂sigmoid(x)
∂x= sigmoid(x)(1− sigmoid(x))
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 32 / 63
Multiple Layer Neural Networks
Function approximation
The use of networks with several layers allows to approximate anyfunction
Any boolean function can be represented as a two layer feed forwardnetwork (including XOR) (number of hidden units grows exponentiallywith the number of inputs)Any continuous function can be approximated by a two layer network(sigmoid hidden units, linear output units)Any arbitrary function can be approximated by a three layer network(sigmoid hidden units, linear output units)
The architecture of the network defines the functions that can beapproximated (hypothesis space)
But it is difficult to characterize for a particular network structurewhat functions can be represented and what functions can not
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 33 / 63
Multiple Layer Neural Networks
Hypothesis Space - MLP Linear units
Perceptron 2-Layer 3-Layer
2-L
2-L
2-L
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 34 / 63
Multiple Layer Neural Networks
Learning Multilayer Networks
In the case of single layer networks the parameters to learn are theweights of only one layer
In the multilayer case we have a set of parameters for each layer andeach layer is fully connected to the next layer
For single layer networks when we have multiple outputs we can learneach output separately
In the case of multilayer networks the different outputs areinterconnected
X1
X2
OuputLayer
HiddenLayer
InputLayer
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 35 / 63
Multiple Layer Neural Networks
Backpropagation - Intuitively
The error of the single layer perceptron links directly thetransformation of the input in the output
In the case of multiple layers each layer has its own error
The error of the output layer is directly the error computed from thetrue values
The error for the hidden layers is more difficult to define
The idea is to use the error of the next layer to influence the weightsof the previous layer
We are propagating backwards the output error, hence the name ofbackpropagation
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 36 / 63
Multiple Layer Neural Networks
Backpropagation - Intuitively
X1
X2
OuputLayer
HiddenLayer
InputLayer
Error
Weights Update
X1
X2
OuputLayer
HiddenLayer
InputLayer
Error
Weights Update
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 37 / 63
Multiple Layer Neural Networks
Definition of the error
We are learning multioutput networks, if we define the error as thesquare error loss:
E (−→w ) =1
2
∑x∈E
∑k∈outputs
(yxk − yxk)2
We can define the gradient of the error as:
∂
∂wE (−→w ) =
∂
∂w
∑k∈outputs
(yxk − yxk)2 =∑
k∈outputs
∂
∂w(yxk − yxk)2
That means that the total error can be expressed as the sum of theindividual errors of the output units, so we can update each outputunit independently
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 38 / 63
Multiple Layer Neural Networks
Backpropagation - Update rules
Generalized update rules can be derived from the gradient of the errorsimilarly to the delta rule from single layer networks
The update rules are different depending if we are on the output layeror on the hidden layers
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 39 / 63
Multiple Layer Neural Networks
Backpropagation - Output Layer
For the output layer, for each node k , being yxk the output given bythe node using the sigmoid function and given that the derivative ofthe output is yxk(1− yxk), the update of the weight j is:
∆k = ∇yxk × Errork = yxk(1− yxk)× (yxk − yxk)
And the update rule is:
wjk ← wjk + α× xk ×∆k
in this case xk is the input received from the previous layer (not theexamples)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 40 / 63
Multiple Layer Neural Networks
Backpropagation - Hidden Layers
For each unit of the layers of hidden units, we are going to update theweights using the error from the next layer
This error is computed as the weighted average of the updates of thenext layer (the composed error)
∆h = ∇yxh × Errorh = yxh(1− yxh)×∑
k∈next layer
whk∆k
And the update rule is:
wjh ← wjh + α× xh ×∆h
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 41 / 63
Multiple Layer Neural Networks
Backpropagation - Algorithm
The backpropagation algorithm works in two steps1 Propagate the examples through the network to obtain the output
(forward propagation)2 Propagate the output error layer by layer updating the weights of the
neurons (back propagation)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 42 / 63
Multiple Layer Neural Networks
Backpropagation - Algorithm
Algorithm: Backpropagation
Input: E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set)foreach weight wij do wij ← small random numberrepeat
foreach x ∈ E doforeach input node i in input layer do ai ← xifor l ← 2 to L do
foreach node j in layer l doinj ←
∑i wijai
aj ← g(inj)
foreach node j in output layer do ∆[j ]← ∇g(inj)× (yj − aj)for l ← L− 1 to 1 do
foreach node i in layer l do ∆[i ]← ∇g(ini )×∑
j wij∆[j ]
foreach weight wij in the network dowij ← wij + α× ai ×∆[j ]
until termination condition holds
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 43 / 63
Multiple Layer Neural Networks
Backpropagation - Convergence
For multilayer neural networks the hypothesis space is the set of allpossible weights that the network can have
The surface determined by these parameters can have several localoptima
This means that the solution can change depending on theinitialization
Also the convergence can be very slow for certain problems
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 44 / 63
Multiple Layer Neural Networks
Backpropagation - Momentum
One improvement introduced to obtain more rapid convergence is theuse of the parameter called momentum
This parameter alters the update of the weights including a weightedpart of the previous update, if ∆(wjk , n − 1) is the previous update:
wjk ← wjk + α× xk ×∆k + µ×∆(wjk , n − 1) with 0 ≤ µ < 1
This term has the effect of maintaining the exploration in the samedirection, allowing to skip local minimum and performing longer stepshelping to faster convergence
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 45 / 63
Multiple Layer Neural Networks
Backpropagation - Termination condition
The more obvious condition for terminating the backpropagationalgorithm is when the error can not be reduced any more, usually thisleads to overfitting
This overfitting occurs in the latest iterations of the algorithm
At this point the weights try to minimize the error fitting specificcharacteristics of the data (usually the noise)
A good idea is to limit the number of iterations of the algorithm andperform what is called early termination
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 46 / 63
Multiple Layer Neural Networks
Other error functions
Other approximations can be used to obtain the update rules fortraining the weights
Regularization: We put a penalty on the complexity of the model, forexample the magnitude of the weights:
E (−→w ) =1
2
∑x∈E
∑k∈outputs
(yxk − yxk)2 + γ∑i,j
w2ij
Cross entropy: We assume that the output is a probabilistic function
E (−→w ) = −∑x∈E
∑k∈outputs
(yxk log(yxk) + (1− yxk) log(1− yxk))
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 47 / 63
Multiple Layer Neural Networks
Other optimization algorithms
The gradient descent can be sometimes very slow for optimizing theweights, other algorithms exists like:
Line Search: In this case not only the direction is computed, but alsothe magnitude of the movementConjugate gradient: The first step follows the gradient descent,successive line search steps are performed making the components ofthe gradient zero one by one (selecting conjugate directions)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 48 / 63
Radial Basis Functions Networks
1 Artificial Neural Networks: Introduction
2 Single Layer Neural Networks
3 Multiple Layer Neural Networks
4 Radial Basis Functions Networks
5 Application
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 49 / 63
Radial Basis Functions Networks
Local function approximation
Feedforward networks approximate functions through a globalapproach using hyperplanes
Functions can also be approximated using a local approach
This is similar to what is done by the K-nearest neighbors algorithm
This approximation can be performed using a set of functions asbuilding bricks (think for example of Fourier transform)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 50 / 63
Radial Basis Functions Networks
Radial Basis Function Networks
Radial Basis Function (RBF) networks are a specific feed forwardneural network
Two layersOne hidden layer that uses a specific set of functions (basis functions)One output layer that uses linear units
The effect of this hidden layer is to map the input examples to adifferent space where the problem is probably linearly separable
The second layer computes the separation hyperplane
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 51 / 63
Radial Basis Functions Networks
Radial Basis Functions Networks
The computation performed by this network is as follows:
The hidden layer
φk(x) = φk(||x − xk ||)
where || · || is a distance function (usually the euclidean function)
The output layer
gi (x) =∑∀k
wikφk(x)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 52 / 63
Radial Basis Functions Networks
Radial Basis Functions Networks
This computation can be expressed in matrix notation as:
y = Φ× w
y1
y2...yn
=
φ1(x1) φ2(x1) · · · φn(x1)φ1(x2) φ2(x2) · · · φn(x2)
......
...φ1(xn) φ2(xn) · · · φn(xn)
×
w1
w2...wn
and can be exactly solved as:
w = Φ−1 × y
Unfortunately matrix inversion is O(n3)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 53 / 63
Radial Basis Functions Networks
Radial Basis Functions
To perform this local approximation it is interesting to use functionsthat can be sensitive to the distance of the input example (Radial)
Several functions have this characteristic, among them:
Gaussian: φ(x) = exp(−x2
σ2 )
Multiquadratics: φ(x) = (x2 + c2)12
Inverse multiquadratics: : φ(x) = 1
(x2+c2)12
This functions include local parameters (σ, c) that can be tunedindependently for each unit
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 54 / 63
Radial Basis Functions Networks
Local Prototypes
This locality sensitiveness allows to obtain a more understandableinterpretation of the network
Each hidden unit is activated when an example is near, other units areinactive
Each unit can be interpreted as prototype that describes the behaviorof a region of the problem
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 55 / 63
Radial Basis Functions Networks
Radial Basis Functions Networks
The usual choice for φk(x) is the unnormalized gaussian function withcenter µk and variance σ2
k , representing the area of the unit as ahypersphere
φk(x) = exp
(−||x − µk ||
2σ2k
)A covariance matrix can be used to obtain hyperelipsoids arbitrarilyoriented, but at the cost of increasing quadratically the number ofparameters
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 56 / 63
Radial Basis Functions Networks
RBFs vs MLPs
RBFMLP
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 57 / 63
Radial Basis Functions Networks
Training of RBFs
One possible solution to training RBFs is to modify the updating ruleof backpropagation to learn the parameters of the RBF
The main problems are:
This is a nonlinear optimization problem expensive to solve and withmany local minimaThe large number of parametersThe lack of control of the complexity of the model can result in loosinglocality (regularization can be used)
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 58 / 63
Radial Basis Functions Networks
Training of RBFs
One of the key points of the model is that each layer performs adifferent task
Training can be done by decoupling both layers
Hybrid training strategies can be used:
1 Estimate the parameters of the RBFs unsupervisedly using the inputexamples (no labels)
2 Train the linear units using a supervised method
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 59 / 63
Radial Basis Functions Networks
Training of RBFs
For estimating the parameters of the RBF there are different options:
Use one unit per example (expensive)Choose k centers to represent random subsets of examplesUse the K-means algorithm1 and adjust σ2
For estimating the parameters of the linear units we can:
Use the delta rule (It is usually faster)Apply backpropagation
1We will talk about this algorithm in unsupervised learningJavier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 60 / 63
Application
1 Artificial Neural Networks: Introduction
2 Single Layer Neural Networks
3 Multiple Layer Neural Networks
4 Radial Basis Functions Networks
5 Application
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 61 / 63
Application
Glass Identification
Glass type identification (CSIapplication)
9 Attributes (continuous)
Attributes: Refractive index, Na oxidepercentage, Al oxide percentage, Feoxide pecentage, . . .
214 instances
7 classes
Methods: Neural networks
Validation: 10 fold cross validation
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 62 / 63
Application
Glass Identification: Models
MLP (15 neurons, 900 it, learning rate 0.3, momentum 0.2):accuracy 71.2%
RBF network (3 clusters): accuracy 69.2%
Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 63 / 63