arti cial neural networks - computer science departmentbejar/apren/docum/trans/05-ann.pdf ·...

Artificial Neural Networks

Javier Bejar cbea

LSI - FIB

Term 2012/2013

Javier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 1 / 63

Outline

1 Artificial Neural Networks: Introduction

2 Single Layer Neural Networks

3 Multiple Layer Neural Networks

4 Radial Basis Functions Networks

5 Application


Artificial Neural Networks: Introduction





5 Application



The brain and the neurons

Neurons are the building blocks of the brain

Their interconnectivity forms the programming that allows us to solveall our everyday tasks

They are able to perform parallel and fault tolerant computation

Theoretical models of how the neurons in the brain work and how theylearn have been developed from the beginning of Artificial Intelligence

Most of these models are really simple (but yet powerful) and havea slim resemblance to real brain neurons



A neuron model

The task that a single neuron performs is very simple

Given a set of inputs, compute an output when their combined valueexceeds a threshold

InputFunction

Ouput Links

OutputActivation FunctionInput Links

BiasWeight



A neuron model

A neuron has a bias input (a0) with value 1

The weights of the neuron (wij) are the parameters of the model

The input is computed as a weighted sum of the inputs

The activation function g(x) determines the output and can bedifferent functions

Threshold function (perceptron)Logistic function (sigmoid perceptron)

The output aj is obtained by applying the activation function to theinput

aj = g

(n∑

i=0

wijai

)= hα(E)



Organization of neurons/Networks

Usually neurons are interconnected forming networks, there arebasically two architectures

Feed forward networks, neurons are connected only in one direction(DAG)Recurrent networks, outputs can be connected to the inputs

Feed forward networks are organized in layers, one connected to theother

Single layer neural networks (perceptron networks): input layer, outputlayerMultiple layer neural networks: input layer, hidden layers, output layer



Neurons as logic gates

Initial research in ANN defined neurons as functions capable ofemulate logic gates (Threshold logical units, TLU, [McCulloch&Pitts])

Inputs xi ∈ {0, 1}, weights wi ∈ {+1,−1}, threshold w0 ∈ R,activation function ⇒ threshold functiong(x) = 1 if x ≥ w0, 0 otherwise

Sets of neurons can compute boolean functions composing TLUs thatcompute OR, AND and NOT functions

A turing machine can be emulated using these networks



Neurons as logic gates

(OR ⇒ w0 = 1,w1 = 1,w2 = 1)x1 x2 x1 OR x2 g(x)

0 0 0 g(0)=00 1 1 g(1)=11 0 1 g(1)=11 1 1 g(2)=1

(AND ⇒ w0 = 2,w1 = 1,w2 = 1)x1 x2 x1 AND x2 g(x)

0 0 0 g(0)=00 1 0 g(1)=01 0 0 g(1)=01 1 1 g(2)=1

(NOT ⇒ w0 = 0,w1 = −1)x1 NOT x1 g(x)

0 0 g(0)=10 1 g(-1)=0


Single Layer Neural Networks





5 Application



The perceptron

The perceptron extends the model when the inputs and the weightscan be real values

A single unit perceptron can be used for classification when we havetwo classes (several units can be used for more classes)

The model that we obtain is a linear discriminant function thatdivides the examples in two classes

We can use different activation functions, commonly the thresholdfunction sign(x) is used:

sign(x) =

{1 x ≥ 0−1 x < 0



The perceptron

Preceptron unit for two inputs

+1



The perceptron

The hypothesis obtained is (for two inputs, two dimensions):

w0 + w1x1 + w2x2 = 0

That is the same as:

x2 = −w1

w2x1 −

w0

w2

where the weights of the inputs determine the slope and the weight ofthe bias the offset

Our goal is to learn the adequate weights given a sample



The perceptron model



The perceptron model

In order to learn in this classification problem we have to minimize theloss function L0/1(y , y) (number of errors)

In order to minimize this function we have to move the lineardiscriminant:

If we missclasify an example of class +1 we have to change the weightsso it moves towards the negative sideIf we missclasify an example of class −1 we have to change the weightsso it moves towards the positive sideOtherwise, the weights are correct for the example

The change of the weights will be proportional to the values of theinput



The perceptron learning rule

The algorithm used to perform the updating of the linear discriminantfunction is based on gradient descent (hill climbing algorithm)

For each missclasified example we obtain a direction of movement

We move a step on that direction until convergence (no improvement)

The magnitude of the movement is controlled by a parameter (α),called the learning rate

w(t + 1) = w(t) + α · y · x

This algorithm converges if the examples are linearly separable andthe value of α is sufficiently small



The perceptron learning rule

Algorithm: Gradient Descent Thresholded Perceptron

Input : E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set, with binary labelsyi ∈ {+1,−1})

Output: w (vector of weights)t ← 0w(t)← random weightsrepeat

foreach xi ∈ E doif g(xi ) 6= yi then

∆w(t) = α · yi · xiw(t + 1) = w(t) + ∆w(t)t + +

end

end

until no change in w

For the bias weight the value of x is always 1



The perceptron learning rule - Example

0 2 3-1-3

In this simple example we have a perceptron with one input (two weights,w1 and w0, the bias weight)The initial weights are w1 = 2, w0 = −1

Only the example x = 0 is missclassified ((0; 1)× (2;−1) = −1), assumingα = 0.5, the first iteration of the perceptron learning rule obtains:

w(2) = w(1) + α · y · x = (2;−1) + 0.5(0; 1) = (2;−0.5)

And still is missclassified, so the next iteration:

w(3) = w(2) + α · y · x = (2;−0, 5) + 0.5(0; 1) = (2; 0)

And now all examples are classified correctly



The perceptron learning rule - Example - OR function

1

-1

-1 1

e1 e2

e4 e3In this example we have a set ofexamples that represents the ORfunction, we have three weightsinitialized w2 = 1, w1 = 1,w0 = −0.5, that defines theseparator x2 + x1 − 0.5 = 0.

The example 1 is classified correctlyh(e1) = −1− 1− 0.5 = negative




1

-1

-1 1

e1 e2

e4 e3

The example 2 is classified incorrectlyh(e2) = −1 + 1− 0.5 = negative.

We apply the perceptron rule withα = 0.5:

w(2) = w(1)+α·y ·x = (1; 1;−0.5)+0.5 · 1 · (−1; 1; 1) = (0.5; 1.5; 0)

That defines the separator0.5x2 + 1.5x1 = 0.




1

-1

-1 1

e1 e2

e4 e3

The example 3 is classified correctlyh(e3) = 0.5 + 1.5 = positive.

The example 4 is classified incorrectlyh(e4) = 0.5− 1.5 = negative.

We apply the perceptron rule withα = 0.5:

w(3) = w(2)+α·y ·x = (0.5; 1.5; 0)+0.5 · 1 · (1;−1; 1) = (1; 1; 0.5)

That defines the separatorx2 + x1 + 0.5 = 0, that classifiescorrectly all examples,



The delta rule

The main problem of this update rule is that it could not converge ifthe examples are not linearly separable

Instead of using the missclassification error, we can obtainconvergence guarantees if we minimize the square error loss functionof a unthresholded perceptron unit

In this case we want to minimize the square distance from themissclassified instances to the linear discriminant function that definesthe perceptron weights



The delta rule



The delta rule

We want to minimize:

Eemp(hα) = E (−→w ) =1

2

∑x∈E

(yx − yx)2

where yx is the label of the examples and yx is the output obtained bythe unthresholded perceptron, and w are the weights

In order to obtain the update for the gradient descent algorithm wehave to obtain the derivative of the error respect to the parameters∇E (w)

The update rule will be:

−→w ← −→w + ∆−→w

where ∆−→w = −α∇E (−→w ) (steepest descent)



Derivation of the delta rule

∂E

∂wi=

∂

∂wi

1

2

∑x∈E

(yx − yx)2

=1

2

∑x∈E

∂

∂wi(yx − yx)2

=1

2

∑x∈E

2(yx − yx)∂

∂wi(yx − yx)(chain rule)

=∑x∈E

(yx − yx)∂

∂wi(yx −−→w · −→x )

=∑x∈E

(yx − yx)(−xi )



Gradient Descent Linear Perceptron

Now the update rule:

∆wi = α ·∑x∈E

(yx − yx)xi

The space of parameters for linear units is formed by ahyperparaboloid surface (quadratic function), this means that a globalminimum exists

The parameter α has to be small enough so we do not miss theminimum

Usually this parameter α is decreased with the iterations (decay)



Gradient Descent Linear Perceptron

Algorithm: Gradient Descent Linear Perceptron

Input : E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set, with binarylabels yi ∈ {+1,−1})

Output: w (vector of weights)t ← 0w(t)← random weightsrepeat

∆w(t) = α ·∑

x∈E(yx − yx)xw(t + 1) = w(t) + ∆w(t)t + +

until no change in w



The perceptron rule vs the delta rule

The perceptron rule

Is based on the error of the thresholded outputConverges after a finite number of iterations to a linear discriminantfunction if the examples are linearly separable

The delta rule

Is based on the error of the unthresholded outputConverges asymptotically to a linear discriminant function (may requireunbounded time) regardless whether the examples are linearly separableor not



Limitations of linear perceptrons

With linear perceptrons we can only classify correctly linearlyseparable problems

The hypothesis space is not powerful enough for real problems

Example, the XOR function:

?

?

x1 x2 g(x)

0 0 10 1 01 0 01 1 1


Multiple Layer Neural Networks





5 Application



Multilayer Perceptron

Perceptron units can be organized in layers forming feed forwardnetworks

Each layer performs a transformation that is passed to the next layeras input

The combination of linear units only allows to learn linear functions

To learn non linear functions we have to change the threshold function

Usually the sigmoid (logistic) function is used:

sigmoid(x) =1

1 + e−x

-5 50

1



Sigmoid units

The sigmoid function allows for soft thresholding and to performnonlinear regression

The combination of sigmoid functions allows to approximate highlynonlinear functions

It has the advantage of being differentiable (unlike the sign function)

∂sigmoid(x)

∂x= sigmoid(x)(1− sigmoid(x))



Function approximation

The use of networks with several layers allows to approximate anyfunction

Any boolean function can be represented as a two layer feed forwardnetwork (including XOR) (number of hidden units grows exponentiallywith the number of inputs)Any continuous function can be approximated by a two layer network(sigmoid hidden units, linear output units)Any arbitrary function can be approximated by a three layer network(sigmoid hidden units, linear output units)

The architecture of the network defines the functions that can beapproximated (hypothesis space)

But it is difficult to characterize for a particular network structurewhat functions can be represented and what functions can not



Hypothesis Space - MLP Linear units

Perceptron 2-Layer 3-Layer

2-L

2-L

2-L



Learning Multilayer Networks

In the case of single layer networks the parameters to learn are theweights of only one layer

In the multilayer case we have a set of parameters for each layer andeach layer is fully connected to the next layer

For single layer networks when we have multiple outputs we can learneach output separately

In the case of multilayer networks the different outputs areinterconnected

X1

X2

OuputLayer

HiddenLayer

InputLayer



Backpropagation - Intuitively

The error of the single layer perceptron links directly thetransformation of the input in the output

In the case of multiple layers each layer has its own error

The error of the output layer is directly the error computed from thetrue values

The error for the hidden layers is more difficult to define

The idea is to use the error of the next layer to influence the weightsof the previous layer

We are propagating backwards the output error, hence the name ofbackpropagation



Backpropagation - Intuitively

X1

X2

OuputLayer

HiddenLayer

InputLayer

Error

Weights Update

X1

X2

OuputLayer

HiddenLayer

InputLayer

Error

Weights Update



Definition of the error

We are learning multioutput networks, if we define the error as thesquare error loss:

E (−→w ) =1

2

∑x∈E

∑k∈outputs

(yxk − yxk)2

We can define the gradient of the error as:

∂

∂wE (−→w ) =

∂

∂w

∑k∈outputs

(yxk − yxk)2 =∑

k∈outputs

∂

∂w(yxk − yxk)2

That means that the total error can be expressed as the sum of theindividual errors of the output units, so we can update each outputunit independently



Backpropagation - Update rules

Generalized update rules can be derived from the gradient of the errorsimilarly to the delta rule from single layer networks

The update rules are different depending if we are on the output layeror on the hidden layers



Backpropagation - Output Layer

For the output layer, for each node k , being yxk the output given bythe node using the sigmoid function and given that the derivative ofthe output is yxk(1− yxk), the update of the weight j is:

∆k = ∇yxk × Errork = yxk(1− yxk)× (yxk − yxk)

And the update rule is:

wjk ← wjk + α× xk ×∆k

in this case xk is the input received from the previous layer (not theexamples)



Backpropagation - Hidden Layers

For each unit of the layers of hidden units, we are going to update theweights using the error from the next layer

This error is computed as the weighted average of the updates of thenext layer (the composed error)

∆h = ∇yxh × Errorh = yxh(1− yxh)×∑

k∈next layer

whk∆k

And the update rule is:

wjh ← wjh + α× xh ×∆h



Backpropagation - Algorithm

The backpropagation algorithm works in two steps1 Propagate the examples through the network to obtain the output

(forward propagation)2 Propagate the output error layer by layer updating the weights of the

neurons (back propagation)



Backpropagation - Algorithm

Algorithm: Backpropagation

Input: E = {(x1, y1), (x2, y2), ..., (xn, yn)} (training set)foreach weight wij do wij ← small random numberrepeat

foreach x ∈ E doforeach input node i in input layer do ai ← xifor l ← 2 to L do

foreach node j in layer l doinj ←

∑i wijai

aj ← g(inj)

foreach node j in output layer do ∆[j ]← ∇g(inj)× (yj − aj)for l ← L− 1 to 1 do

foreach node i in layer l do ∆[i ]← ∇g(ini )×∑

j wij∆[j ]

foreach weight wij in the network dowij ← wij + α× ai ×∆[j ]

until termination condition holds



Backpropagation - Convergence

For multilayer neural networks the hypothesis space is the set of allpossible weights that the network can have

The surface determined by these parameters can have several localoptima

This means that the solution can change depending on theinitialization

Also the convergence can be very slow for certain problems



Backpropagation - Momentum

One improvement introduced to obtain more rapid convergence is theuse of the parameter called momentum

This parameter alters the update of the weights including a weightedpart of the previous update, if ∆(wjk , n − 1) is the previous update:

wjk ← wjk + α× xk ×∆k + µ×∆(wjk , n − 1) with 0 ≤ µ < 1

This term has the effect of maintaining the exploration in the samedirection, allowing to skip local minimum and performing longer stepshelping to faster convergence



Backpropagation - Termination condition

The more obvious condition for terminating the backpropagationalgorithm is when the error can not be reduced any more, usually thisleads to overfitting

This overfitting occurs in the latest iterations of the algorithm

At this point the weights try to minimize the error fitting specificcharacteristics of the data (usually the noise)

A good idea is to limit the number of iterations of the algorithm andperform what is called early termination



Other error functions

Other approximations can be used to obtain the update rules fortraining the weights

Regularization: We put a penalty on the complexity of the model, forexample the magnitude of the weights:

E (−→w ) =1

2

∑x∈E

∑k∈outputs

(yxk − yxk)2 + γ∑i,j

w2ij

Cross entropy: We assume that the output is a probabilistic function

E (−→w ) = −∑x∈E

∑k∈outputs

(yxk log(yxk) + (1− yxk) log(1− yxk))



Other optimization algorithms

The gradient descent can be sometimes very slow for optimizing theweights, other algorithms exists like:

Line Search: In this case not only the direction is computed, but alsothe magnitude of the movementConjugate gradient: The first step follows the gradient descent,successive line search steps are performed making the components ofthe gradient zero one by one (selecting conjugate directions)


Radial Basis Functions Networks





5 Application



Local function approximation

Feedforward networks approximate functions through a globalapproach using hyperplanes

Functions can also be approximated using a local approach

This is similar to what is done by the K-nearest neighbors algorithm

This approximation can be performed using a set of functions asbuilding bricks (think for example of Fourier transform)



Radial Basis Function Networks

Radial Basis Function (RBF) networks are a specific feed forwardneural network

Two layersOne hidden layer that uses a specific set of functions (basis functions)One output layer that uses linear units

The effect of this hidden layer is to map the input examples to adifferent space where the problem is probably linearly separable

The second layer computes the separation hyperplane




The computation performed by this network is as follows:

The hidden layer

φk(x) = φk(||x − xk ||)

where || · || is a distance function (usually the euclidean function)

The output layer

gi (x) =∑∀k

wikφk(x)




This computation can be expressed in matrix notation as:

y = Φ× w

y1

y2...yn

=

φ1(x1) φ2(x1) · · · φn(x1)φ1(x2) φ2(x2) · · · φn(x2)

......

...φ1(xn) φ2(xn) · · · φn(xn)

×

w1

w2...wn

and can be exactly solved as:

w = Φ−1 × y

Unfortunately matrix inversion is O(n3)



Radial Basis Functions

To perform this local approximation it is interesting to use functionsthat can be sensitive to the distance of the input example (Radial)

Several functions have this characteristic, among them:

Gaussian: φ(x) = exp(−x2

σ2 )

Multiquadratics: φ(x) = (x2 + c2)12

Inverse multiquadratics: : φ(x) = 1

(x2+c2)12

This functions include local parameters (σ, c) that can be tunedindependently for each unit



Local Prototypes

This locality sensitiveness allows to obtain a more understandableinterpretation of the network

Each hidden unit is activated when an example is near, other units areinactive

Each unit can be interpreted as prototype that describes the behaviorof a region of the problem




The usual choice for φk(x) is the unnormalized gaussian function withcenter µk and variance σ2

k , representing the area of the unit as ahypersphere

φk(x) = exp

(−||x − µk ||

2σ2k

)A covariance matrix can be used to obtain hyperelipsoids arbitrarilyoriented, but at the cost of increasing quadratically the number ofparameters



RBFs vs MLPs

RBFMLP



Training of RBFs

One possible solution to training RBFs is to modify the updating ruleof backpropagation to learn the parameters of the RBF

The main problems are:

This is a nonlinear optimization problem expensive to solve and withmany local minimaThe large number of parametersThe lack of control of the complexity of the model can result in loosinglocality (regularization can be used)



Training of RBFs

One of the key points of the model is that each layer performs adifferent task

Training can be done by decoupling both layers

Hybrid training strategies can be used:

1 Estimate the parameters of the RBFs unsupervisedly using the inputexamples (no labels)

2 Train the linear units using a supervised method



Training of RBFs

For estimating the parameters of the RBF there are different options:

Use one unit per example (expensive)Choose k centers to represent random subsets of examplesUse the K-means algorithm1 and adjust σ2

For estimating the parameters of the linear units we can:

Use the delta rule (It is usually faster)Apply backpropagation

1We will talk about this algorithm in unsupervised learningJavier Bejar cbea (LSI - FIB) Artificial Neural Networks Term 2012/2013 60 / 63

Application





5 Application


Application

Glass Identification

Glass type identification (CSIapplication)

9 Attributes (continuous)

Attributes: Refractive index, Na oxidepercentage, Al oxide percentage, Feoxide pecentage, . . .

214 instances

7 classes

Methods: Neural networks

Validation: 10 fold cross validation


Application

Glass Identification: Models

MLP (15 neurons, 900 it, learning rate 0.3, momentum 0.2):accuracy 71.2%

RBF network (3 clusters): accuracy 69.2%


arti cial neural networks - computer science departmentbejar/apren/docum/trans/05-ann.pdf ·...

Documents