arti cial neural networks - computer science departmentbejar/apren/docum/trans/05-ann.pdf ·...

63
Artificial Neural Networks Javier B´ ejar cbea LSI - FIB Term 2012/2013 Javier B´ ejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 1 / 63

Upload: dangliem

Post on 17-Aug-2019

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks

Javier Bejar cbea

LSI - FIB

Term 2012/2013

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 1 / 63

Page 2: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Outline

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 2 / 63

Page 3: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 3 / 63

Page 4: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

The brain and the neurons

Neurons are the building blocks of the brain

Their interconnectivity forms the programming that allows us to solveall our everyday tasks

They are able to perform parallel and fault tolerant computation

Theoretical models of how the neurons in the brain work and how theylearn have been developed from the beginning of Artificial Intelligence

Most of these models are really simple (but yet powerful) and havea slim resemblance to real brain neurons

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 4 / 63

Page 5: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

A neuron model

The task that a single neuron performs is very simple

Given a set of inputs, compute an output when their combined valueexceeds a threshold

InputFunction

Ouput Links

OutputActivation FunctionInput Links

BiasWeight

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 5 / 63

Page 6: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

A neuron model

A neuron has a bias input (a0) with value 1

The weights of the neuron (wij) are the parameters of the model

The input is computed as a weighted sum of the inputs

The activation function g(x) determines the output and can bedifferent functions

Threshold function (perceptron)Logistic function (sigmoid perceptron)

The output aj is obtained by applying the activation function to theinput

aj = g

(n∑

i=0

wijai

)= hα(E)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 6 / 63

Page 7: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

Organization of neurons/Networks

Usually neurons are interconnected forming networks, there arebasically two architectures

Feed forward networks, neurons are connected only in one direction(DAG)Recurrent networks, outputs can be connected to the inputs

Feed forward networks are organized in layers, one connected to theother

Single layer neural networks (perceptron networks): input layer, outputlayerMultiple layer neural networks: input layer, hidden layers, output layer

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 7 / 63

Page 8: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

Neurons as logic gates

Initial research in ANN defined neurons as functions capable ofemulate logic gates (Threshold logical units, TLU, [McCulloch&Pitts])

Inputs xi ∈ {0, 1}, weights wi ∈ {+1,−1}, threshold w0 ∈ R,activation function ⇒ threshold functiong(x) = 1 if x ≥ w0, 0 otherwise

Sets of neurons can compute boolean functions composing TLUs thatcompute OR, AND and NOT functions

A turing machine can be emulated using these networks

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 8 / 63

Page 9: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Artificial Neural Networks: Introduction

Neurons as logic gates

(OR ⇒ w0 = 1,w1 = 1,w2 = 1)x1 x2 x1 OR x2 g(x)

0 0 0 g(0)=00 1 1 g(1)=11 0 1 g(1)=11 1 1 g(2)=1

(AND ⇒ w0 = 2,w1 = 1,w2 = 1)x1 x2 x1 AND x2 g(x)

0 0 0 g(0)=00 1 0 g(1)=01 0 0 g(1)=01 1 1 g(2)=1

(NOT ⇒ w0 = 0,w1 = −1)x1 NOT x1 g(x)

0 0 g(0)=10 1 g(-1)=0

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 9 / 63

Page 10: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 10 / 63

Page 11: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron

The perceptron extends the model when the inputs and the weightscan be real values

A single unit perceptron can be used for classification when we havetwo classes (several units can be used for more classes)

The model that we obtain is a linear discriminant function thatdivides the examples in two classes

We can use different activation functions, commonly the thresholdfunction sign(x) is used:

sign(x) =

{1 x ≥ 0−1 x < 0

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 11 / 63

Page 12: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron

Preceptron unit for two inputs

+1

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 12 / 63

Page 13: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron

The hypothesis obtained is (for two inputs, two dimensions):

w0 + w1x1 + w2x2 = 0

That is the same as:

x2 = −w1

w2x1 −

w0

w2

where the weights of the inputs determine the slope and the weight ofthe bias the offset

Our goal is to learn the adequate weights given a sample

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 13 / 63

Page 14: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron model

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 14 / 63

Page 15: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron model

In order to learn in this classification problem we have to minimize theloss function L0/1(y , y) (number of errors)

In order to minimize this function we have to move the lineardiscriminant:

If we missclasify an example of class +1 we have to change the weightsso it moves towards the negative sideIf we missclasify an example of class −1 we have to change the weightsso it moves towards the positive sideOtherwise, the weights are correct for the example

The change of the weights will be proportional to the values of theinput

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 15 / 63

Page 16: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron learning rule

The algorithm used to perform the updating of the linear discriminantfunction is based on gradient descent (hill climbing algorithm)

For each missclasified example we obtain a direction of movement

We move a step on that direction until convergence (no improvement)

The magnitude of the movement is controlled by a parameter (α),called the learning rate

w(t + 1) = w(t) + α · y · x

This algorithm converges if the examples are linearly separable andthe value of α is sufficiently small

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 16 / 63

Page 17: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron learning rule

Algorithm: Gradient Descent Thresholded Perceptron

Input : E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set, with binary labelsyi ∈ {+1,−1})

Output: w (vector of weights)t ← 0w(t)← random weightsrepeat

foreach xi ∈ E doif g(xi ) 6= yi then

∆w(t) = α · yi · xiw(t + 1) = w(t) + ∆w(t)t + +

end

end

until no change in w

For the bias weight the value of x is always 1

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 17 / 63

Page 18: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron learning rule - Example

0 2 3-1-3

In this simple example we have a perceptron with one input (two weights,w1 and w0, the bias weight)The initial weights are w1 = 2, w0 = −1

Only the example x = 0 is missclassified ((0; 1)× (2;−1) = −1), assumingα = 0.5, the first iteration of the perceptron learning rule obtains:

w(2) = w(1) + α · y · x = (2;−1) + 0.5(0; 1) = (2;−0.5)

And still is missclassified, so the next iteration:

w(3) = w(2) + α · y · x = (2;−0, 5) + 0.5(0; 1) = (2; 0)

And now all examples are classified correctly

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 18 / 63

Page 19: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron learning rule - Example - OR function

1

-1

-1 1

e1 e2

e4 e3In this example we have a set ofexamples that represents the ORfunction, we have three weightsinitialized w2 = 1, w1 = 1,w0 = −0.5, that defines theseparator x2 + x1 − 0.5 = 0.

The example 1 is classified correctlyh(e1) = −1− 1− 0.5 = negative

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 19 / 63

Page 20: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron learning rule - Example - OR function

1

-1

-1 1

e1 e2

e4 e3

The example 2 is classified incorrectlyh(e2) = −1 + 1− 0.5 = negative.

We apply the perceptron rule withα = 0.5:

w(2) = w(1)+α·y ·x = (1; 1;−0.5)+0.5 · 1 · (−1; 1; 1) = (0.5; 1.5; 0)

That defines the separator0.5x2 + 1.5x1 = 0.

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 20 / 63

Page 21: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron learning rule - Example - OR function

1

-1

-1 1

e1 e2

e4 e3

The example 3 is classified correctlyh(e3) = 0.5 + 1.5 = positive.

The example 4 is classified incorrectlyh(e4) = 0.5− 1.5 = negative.

We apply the perceptron rule withα = 0.5:

w(3) = w(2)+α·y ·x = (0.5; 1.5; 0)+0.5 · 1 · (1;−1; 1) = (1; 1; 0.5)

That defines the separatorx2 + x1 + 0.5 = 0, that classifiescorrectly all examples,

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 21 / 63

Page 22: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The delta rule

The main problem of this update rule is that it could not converge ifthe examples are not linearly separable

Instead of using the missclassification error, we can obtainconvergence guarantees if we minimize the square error loss functionof a unthresholded perceptron unit

In this case we want to minimize the square distance from themissclassified instances to the linear discriminant function that definesthe perceptron weights

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 22 / 63

Page 23: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The delta rule

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 23 / 63

Page 24: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The delta rule

We want to minimize:

Eemp(hα) = E (−→w ) =1

2

∑x∈E

(yx − yx)2

where yx is the label of the examples and yx is the output obtained bythe unthresholded perceptron, and w are the weights

In order to obtain the update for the gradient descent algorithm wehave to obtain the derivative of the error respect to the parameters∇E (w)

The update rule will be:

−→w ← −→w + ∆−→w

where ∆−→w = −α∇E (−→w ) (steepest descent)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 24 / 63

Page 25: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

Derivation of the delta rule

∂E

∂wi=

∂wi

1

2

∑x∈E

(yx − yx)2

=1

2

∑x∈E

∂wi(yx − yx)2

=1

2

∑x∈E

2(yx − yx)∂

∂wi(yx − yx)(chain rule)

=∑x∈E

(yx − yx)∂

∂wi(yx −−→w · −→x )

=∑x∈E

(yx − yx)(−xi )

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 25 / 63

Page 26: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

Gradient Descent Linear Perceptron

Now the update rule:

∆wi = α ·∑x∈E

(yx − yx)xi

The space of parameters for linear units is formed by ahyperparaboloid surface (quadratic function), this means that a globalminimum exists

The parameter α has to be small enough so we do not miss theminimum

Usually this parameter α is decreased with the iterations (decay)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 26 / 63

Page 27: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

Gradient Descent Linear Perceptron

Algorithm: Gradient Descent Linear Perceptron

Input : E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set, with binarylabels yi ∈ {+1,−1})

Output: w (vector of weights)t ← 0w(t)← random weightsrepeat

∆w(t) = α ·∑

x∈E(yx − yx)xw(t + 1) = w(t) + ∆w(t)t + +

until no change in w

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 27 / 63

Page 28: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

The perceptron rule vs the delta rule

The perceptron rule

Is based on the error of the thresholded outputConverges after a finite number of iterations to a linear discriminantfunction if the examples are linearly separable

The delta rule

Is based on the error of the unthresholded outputConverges asymptotically to a linear discriminant function (may requireunbounded time) regardless whether the examples are linearly separableor not

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 28 / 63

Page 29: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Single Layer Neural Networks

Limitations of linear perceptrons

With linear perceptrons we can only classify correctly linearlyseparable problems

The hypothesis space is not powerful enough for real problems

Example, the XOR function:

?

?

x1 x2 g(x)

0 0 10 1 01 0 01 1 1

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 29 / 63

Page 30: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 30 / 63

Page 31: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Multilayer Perceptron

Perceptron units can be organized in layers forming feed forwardnetworks

Each layer performs a transformation that is passed to the next layeras input

The combination of linear units only allows to learn linear functions

To learn non linear functions we have to change the threshold function

Usually the sigmoid (logistic) function is used:

sigmoid(x) =1

1 + e−x

-5 50

1

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 31 / 63

Page 32: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Sigmoid units

The sigmoid function allows for soft thresholding and to performnonlinear regression

The combination of sigmoid functions allows to approximate highlynonlinear functions

It has the advantage of being differentiable (unlike the sign function)

∂sigmoid(x)

∂x= sigmoid(x)(1− sigmoid(x))

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 32 / 63

Page 33: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Function approximation

The use of networks with several layers allows to approximate anyfunction

Any boolean function can be represented as a two layer feed forwardnetwork (including XOR) (number of hidden units grows exponentiallywith the number of inputs)Any continuous function can be approximated by a two layer network(sigmoid hidden units, linear output units)Any arbitrary function can be approximated by a three layer network(sigmoid hidden units, linear output units)

The architecture of the network defines the functions that can beapproximated (hypothesis space)

But it is difficult to characterize for a particular network structurewhat functions can be represented and what functions can not

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 33 / 63

Page 34: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Hypothesis Space - MLP Linear units

Perceptron 2-Layer 3-Layer

2-L

2-L

2-L

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 34 / 63

Page 35: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Learning Multilayer Networks

In the case of single layer networks the parameters to learn are theweights of only one layer

In the multilayer case we have a set of parameters for each layer andeach layer is fully connected to the next layer

For single layer networks when we have multiple outputs we can learneach output separately

In the case of multilayer networks the different outputs areinterconnected

X1

X2

OuputLayer

HiddenLayer

InputLayer

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 35 / 63

Page 36: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Intuitively

The error of the single layer perceptron links directly thetransformation of the input in the output

In the case of multiple layers each layer has its own error

The error of the output layer is directly the error computed from thetrue values

The error for the hidden layers is more difficult to define

The idea is to use the error of the next layer to influence the weightsof the previous layer

We are propagating backwards the output error, hence the name ofbackpropagation

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 36 / 63

Page 37: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Intuitively

X1

X2

OuputLayer

HiddenLayer

InputLayer

Error

Weights Update

X1

X2

OuputLayer

HiddenLayer

InputLayer

Error

Weights Update

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 37 / 63

Page 38: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Definition of the error

We are learning multioutput networks, if we define the error as thesquare error loss:

E (−→w ) =1

2

∑x∈E

∑k∈outputs

(yxk − yxk)2

We can define the gradient of the error as:

∂wE (−→w ) =

∂w

∑k∈outputs

(yxk − yxk)2 =∑

k∈outputs

∂w(yxk − yxk)2

That means that the total error can be expressed as the sum of theindividual errors of the output units, so we can update each outputunit independently

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 38 / 63

Page 39: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Update rules

Generalized update rules can be derived from the gradient of the errorsimilarly to the delta rule from single layer networks

The update rules are different depending if we are on the output layeror on the hidden layers

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 39 / 63

Page 40: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Output Layer

For the output layer, for each node k , being yxk the output given bythe node using the sigmoid function and given that the derivative ofthe output is yxk(1− yxk), the update of the weight j is:

∆k = ∇yxk × Errork = yxk(1− yxk)× (yxk − yxk)

And the update rule is:

wjk ← wjk + α× xk ×∆k

in this case xk is the input received from the previous layer (not theexamples)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 40 / 63

Page 41: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Hidden Layers

For each unit of the layers of hidden units, we are going to update theweights using the error from the next layer

This error is computed as the weighted average of the updates of thenext layer (the composed error)

∆h = ∇yxh × Errorh = yxh(1− yxh)×∑

k∈next layer

whk∆k

And the update rule is:

wjh ← wjh + α× xh ×∆h

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 41 / 63

Page 42: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Algorithm

The backpropagation algorithm works in two steps1 Propagate the examples through the network to obtain the output

(forward propagation)2 Propagate the output error layer by layer updating the weights of the

neurons (back propagation)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 42 / 63

Page 43: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Algorithm

Algorithm: Backpropagation

Input: E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set)foreach weight wij do wij ← small random numberrepeat

foreach x ∈ E doforeach input node i in input layer do ai ← xifor l ← 2 to L do

foreach node j in layer l doinj ←

∑i wijai

aj ← g(inj)

foreach node j in output layer do ∆[j ]← ∇g(inj)× (yj − aj)for l ← L− 1 to 1 do

foreach node i in layer l do ∆[i ]← ∇g(ini )×∑

j wij∆[j ]

foreach weight wij in the network dowij ← wij + α× ai ×∆[j ]

until termination condition holds

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 43 / 63

Page 44: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Convergence

For multilayer neural networks the hypothesis space is the set of allpossible weights that the network can have

The surface determined by these parameters can have several localoptima

This means that the solution can change depending on theinitialization

Also the convergence can be very slow for certain problems

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 44 / 63

Page 45: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Momentum

One improvement introduced to obtain more rapid convergence is theuse of the parameter called momentum

This parameter alters the update of the weights including a weightedpart of the previous update, if ∆(wjk , n − 1) is the previous update:

wjk ← wjk + α× xk ×∆k + µ×∆(wjk , n − 1) with 0 ≤ µ < 1

This term has the effect of maintaining the exploration in the samedirection, allowing to skip local minimum and performing longer stepshelping to faster convergence

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 45 / 63

Page 46: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Backpropagation - Termination condition

The more obvious condition for terminating the backpropagationalgorithm is when the error can not be reduced any more, usually thisleads to overfitting

This overfitting occurs in the latest iterations of the algorithm

At this point the weights try to minimize the error fitting specificcharacteristics of the data (usually the noise)

A good idea is to limit the number of iterations of the algorithm andperform what is called early termination

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 46 / 63

Page 47: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Other error functions

Other approximations can be used to obtain the update rules fortraining the weights

Regularization: We put a penalty on the complexity of the model, forexample the magnitude of the weights:

E (−→w ) =1

2

∑x∈E

∑k∈outputs

(yxk − yxk)2 + γ∑i,j

w2ij

Cross entropy: We assume that the output is a probabilistic function

E (−→w ) = −∑x∈E

∑k∈outputs

(yxk log(yxk) + (1− yxk) log(1− yxk))

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 47 / 63

Page 48: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Multiple Layer Neural Networks

Other optimization algorithms

The gradient descent can be sometimes very slow for optimizing theweights, other algorithms exists like:

Line Search: In this case not only the direction is computed, but alsothe magnitude of the movementConjugate gradient: The first step follows the gradient descent,successive line search steps are performed making the components ofthe gradient zero one by one (selecting conjugate directions)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 48 / 63

Page 49: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 49 / 63

Page 50: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Local function approximation

Feedforward networks approximate functions through a globalapproach using hyperplanes

Functions can also be approximated using a local approach

This is similar to what is done by the K-nearest neighbors algorithm

This approximation can be performed using a set of functions asbuilding bricks (think for example of Fourier transform)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 50 / 63

Page 51: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Radial Basis Function Networks

Radial Basis Function (RBF) networks are a specific feed forwardneural network

Two layersOne hidden layer that uses a specific set of functions (basis functions)One output layer that uses linear units

The effect of this hidden layer is to map the input examples to adifferent space where the problem is probably linearly separable

The second layer computes the separation hyperplane

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 51 / 63

Page 52: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Radial Basis Functions Networks

The computation performed by this network is as follows:

The hidden layer

φk(x) = φk(||x − xk ||)

where || · || is a distance function (usually the euclidean function)

The output layer

gi (x) =∑∀k

wikφk(x)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 52 / 63

Page 53: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Radial Basis Functions Networks

This computation can be expressed in matrix notation as:

y = Φ× w

y1

y2...yn

=

φ1(x1) φ2(x1) · · · φn(x1)φ1(x2) φ2(x2) · · · φn(x2)

......

...φ1(xn) φ2(xn) · · · φn(xn)

×

w1

w2...wn

and can be exactly solved as:

w = Φ−1 × y

Unfortunately matrix inversion is O(n3)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 53 / 63

Page 54: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Radial Basis Functions

To perform this local approximation it is interesting to use functionsthat can be sensitive to the distance of the input example (Radial)

Several functions have this characteristic, among them:

Gaussian: φ(x) = exp(−x2

σ2 )

Multiquadratics: φ(x) = (x2 + c2)12

Inverse multiquadratics: : φ(x) = 1

(x2+c2)12

This functions include local parameters (σ, c) that can be tunedindependently for each unit

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 54 / 63

Page 55: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Local Prototypes

This locality sensitiveness allows to obtain a more understandableinterpretation of the network

Each hidden unit is activated when an example is near, other units areinactive

Each unit can be interpreted as prototype that describes the behaviorof a region of the problem

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 55 / 63

Page 56: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Radial Basis Functions Networks

The usual choice for φk(x) is the unnormalized gaussian function withcenter µk and variance σ2

k , representing the area of the unit as ahypersphere

φk(x) = exp

(−||x − µk ||

2σ2k

)A covariance matrix can be used to obtain hyperelipsoids arbitrarilyoriented, but at the cost of increasing quadratically the number ofparameters

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 56 / 63

Page 57: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

RBFs vs MLPs

RBFMLP

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 57 / 63

Page 58: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Training of RBFs

One possible solution to training RBFs is to modify the updating ruleof backpropagation to learn the parameters of the RBF

The main problems are:

This is a nonlinear optimization problem expensive to solve and withmany local minimaThe large number of parametersThe lack of control of the complexity of the model can result in loosinglocality (regularization can be used)

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 58 / 63

Page 59: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Training of RBFs

One of the key points of the model is that each layer performs adifferent task

Training can be done by decoupling both layers

Hybrid training strategies can be used:

1 Estimate the parameters of the RBFs unsupervisedly using the inputexamples (no labels)

2 Train the linear units using a supervised method

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 59 / 63

Page 60: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Radial Basis Functions Networks

Training of RBFs

For estimating the parameters of the RBF there are different options:

Use one unit per example (expensive)Choose k centers to represent random subsets of examplesUse the K-means algorithm1 and adjust σ2

For estimating the parameters of the linear units we can:

Use the delta rule (It is usually faster)Apply backpropagation

1We will talk about this algorithm in unsupervised learningJavier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 60 / 63

Page 61: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Application

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 61 / 63

Page 62: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Application

Glass Identification

Glass type identification (CSIapplication)

9 Attributes (continuous)

Attributes: Refractive index, Na oxidepercentage, Al oxide percentage, Feoxide pecentage, . . .

214 instances

7 classes

Methods: Neural networks

Validation: 10 fold cross validation

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 62 / 63

Page 63: Arti cial Neural Networks - Computer Science Departmentbejar/apren/docum/trans/05-ANN.pdf · Outline 1 Arti cial Neural Networks: Introduction 2 Single Layer Neural Networks 3 Multiple

Application

Glass Identification: Models

MLP (15 neurons, 900 it, learning rate 0.3, momentum 0.2):accuracy 71.2%

RBF network (3 clusters): accuracy 69.2%

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 63 / 63