kevin swingler- lecture 4: multi-layer perceptrons

8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

1/20

Dept. of ComputingScience & Math

1

Lecture 4: Multi-Layer Perceptrons

Kevin [email protected]


2/20


2

Review of Gradient Descent Learning

1. The purpose of neural network training is to minimize the outputerrors on a particular set of training data by adjusting the networkweights w

2. We define an Cost Function E(w) that measures how far thecurrent networks output is from the desired one

3. Partial derivatives of the cost function E(w)/wtell us whichdirection we need to move in weight space to reduce the error

4. The learning rate specifies the step sizes we take in weightspace for each iteration of the weight update equation

5. We keep stepping through weight space until the errors are smallenough.


3/20


3

Review of Perceptron Training

1. Generate a training pair or pattern xthat you wish your network tolearn

2. Setup your network with Ninput units fully connected to Moutputunits.

3. Initialize weights, w, at random

4. Select an appropriate error function E(w) and learning rate

5. Apply the weight changew = - E(w)/wto each weight wforeach training pattern p. One set of updates for all the weights forall the training patterns is called one epochof training.

6. Repeat step 5 until the network error function is small enough


4/20


4

Review of XOR and Linear Separability

Recall that it is not possible to find weights that enable Single Layer Perceptrons

to deal with non-linearly separable problems like XOR

011

101

110

000

outin2in1

XORXOR

The proposed solution was to use a more complex network that is able togenerate more complex decision boundaries. That network is the Multi-LayerPerceptron.

I1

I2


5/20


5

Multi-Layer Perceptrons (MLPs)

X1 X2 X3 Xi

O1Oj

Y1 Y2 Yk

Output layer, k

Hidden layer, j

Input layer, i

)( jj

jkk OwfY =

)( ii

ijj XwfO =


6/20


6

Can We Use a Generalized Form of the

PLR/Delta Rule to Train the MLP?Recall the PLR/Delta rule: Adjust neuron weights to reduce error at neurons

output:

xww old += where yy ett=

arg

Main problem: How to adjust the weights in the hidden layer, so they reducethe error in the output layer, when there is no specified target response in thehidden layer?

+1

-1

Threshold function

+1

-1

Sigmoid function

Solution: Alter the non-linear Perceptron (discrete threshold) activationfunction to make it differentiable and hence, help derive Generalized DR for

MLP training.


7/20


7

Sigmoid (S-shaped) Function Properties

Approximates the threshold function

Smoothly differentiable (everywhere), and hence DR applicable

Positive slope

Popular choice is

a

e

afy

+==

1

1)(

Derivative of sigmoidal function is:

))(1()()('

afafaf =


8/20


8

Weight Update Rule

Generally, weight change from any unit j to unit k by gradient descent (i.e.

weight change by small increment in negative direction to the gradient)

is now called Generalized Delta Rule (GDR or Backpropagation):

xw

Ewww old +=

==

So, the weight change from the input layer unit ito hidden layer unitjis:

ijij xw = kk

jkjjj woo = )1(where

The weight change from the hidden layer unitjto the output layer unit kis:

wherejkjk ow = )1()( ,arg kkkkettk yyyy =


9/20


9

Training of a 2-Layer Feed Forward Network

1. Take the set of training patterns you wish the network to learn

2. Set up the network with N input units fully connected to M hidden non-linear hidden units via connections with weights wij, which in turn are

fully connected to P output units via connections with weights wjk

3. Generate random initial weights, e.g. from range [-wt, +wt]

4. Select appropriate error function E(wjk) and learning rate 5. Apply the weight update equation wjk=-E(wjk)/wjk to each weight

wjk for each training pattern p.

6. Do the same to all hidden layers.

7. Repeat step 5-6 until the network error function is small enough


10/20


10

Graphical Representation of GDR

Ideal weight

Weight, wi

Total Error

Local minimum

Global minimum


11/20


11

Practical Considerations for Learning Rules

There are a number of important issues about training single layer neuralnetworks that need further resolving:

1. Do we need to pre-process the training data? If so, how?2. How do we choose the initial weights from which we start the training?

3. How do we choose an appropriate learning rate ?

4. Should we change the weights after each training pattern, or after the

whole set?

5. Are some activation/transfer functions better than others?

6. How do we avoid local minima in the error function?

7. How do we know when we should stop the training?

8. How many hidden units do we need?

9. Should we have different learning rates for the different layers?

We shall now consider each of these issues one by one.


12/20


12

Pre-processing of the Training DataIn principle, we can just use any raw input-output data to train our networks.

However, in practice, it often helps the network to learn appropriately if wecarry out some pre-processing of the training data before feeding it to the

network.

We should make sure that the training data is representativerepresentative it should notcontain too many examples of one type at the expense of another. On theother hand, if one class of pattern is easy to learn, having large numbers ofpatterns from that class in the training set will only slow down the over-all

learning process.


13/20


13

Choosing the Initial Weight Values

The gradient descent learning algorithm treats all the weights in the same way,so if we start them all off with the same values, all the hidden units will endup doing the same thing and the network will never learn properly.

For that reason, we generally start off all the weights with small random values.Usually we take them from a flat distribution around zero [wt, +wt], or froma Gaussian distribution around zero with standard deviation wt.

Choosing a good value of wtcan be difficult. Generally, it is a good idea to

make it as large as you can without saturating any of the sigmoids.

We usually hope that the final network performance will be independent of thechoice of initial weights, but we need to check this by training the networkfrom a number of different random initial weight sets.


14/20


14

Choosing the Learning RateChoosing a good value for the learning rate is constrained by two opposing

facts:1. If is too small, it will take too long to get anywhere near the minimum

of the error function.2. If is too large, the weight updates will over-shoot the error minimum

and the weights will oscillate, or even diverge.

Unfortunately, the optimal value is very problem and network dependent, so one

cannot formulate reliable general prescriptions.

Generally, one should try a range of different values (e.g. = 0.1, 0.01, 1.0,0.0001) and use the results as a guide.


15/20


15

Batch Training vs. On-line Training

Batch Training:Batch Training: update the weights after all training patterns havebeen presented.

OnOn--line Trainingline Training (or Sequential Training):(or Sequential Training): A natural alternative is toupdate all the weights immediately after processing each trainingpattern.

On-line learning does not perform true gradient descent, and theindividual weight changes can be rather erratic. Normally a muchlower learning rate will be necessary than for batch learning.However, because each weight now has Nupdates (where N is thenumber of patterns) per epoch, rather than just one, overall the

learning is often much quicker. This is particularly true if there is alot of redundancy in the training data, i.e. many training patternscontaining similar information.


16/20


16

Choosing the Transfer Function

We have already seen that having a differentiable transfer/activationfunction is important for the gradient descent algorithm to work. We

have also seen that, in terms of computational efficiency, thestandard sigmoid (i.e. logistic function) is a particularly convenientreplacement for the step function of the Simple Perceptron.

The logistic function ranges from 0 to 1. There is some evidence thatan anti-symmetric transfer function, i.e. one that satisfies f(x) = f(x), enables the gradient descent algorithm to learn faster.

When the outputs are required to be non-binary, i.e. continuous real

values, having sigmoidal transfer functions no longer makes sense.In these cases, a simple linear transfer function f(x) = xisappropriate.


17/20


17

Local MinimaCost functions can quite easily have more than one minimum:

If we start off in the vicinity of the local minimum, we may end up at the local

minimum rather than the global minimum. Starting with a range of different initialweight sets increases our chances of finding the global minimum. Any variationfrom true gradient descent will also increase our chances of stepping into thedeeper valley.


18/20


18

When to Stop TrainingThe Sigmoid(x) function only takes on its extreme values of 0 and 1 at x= .

In effect, this means that the network can only achieve its binary targetswhen at least some of its weights reach . So, given finite gradientdescent step sizes, our networks will never reach their binary targets.

Even if we off-set the targets (to 0.1 and 0.9 say) we will generally require aninfinite number of increasingly small gradient descent steps to achievethose targets.

Clearly, if the training algorithm can never actually reach the minimum, wehave to stop the training process when it is near enough. What constitutesnear enough depends on the problem. If we have binary targets, it mightbe enough that all outputs are within 0.1 (say) of their targets. Or, it mightbe easier to stop the training when the sum squared error function becomesless than a particular small value (0.2 say).


19/20


19

How Many Hidden Units?The best number of hidden units depends in a complex way on many factors,

including:1. The number of training patterns

2. The numbers of input and output units3. The amount of noise in the training data4. The complexity of the function or classification to be learned5. The type of hidden unit activation function6. The training algorithm

Too few hidden units will generally leave high training and generalisation errorsdue to under-fitting. Too many hidden units will result in low training errors,but will make the training unnecessarily slow, and will result in poorgeneralisation unless some other technique (such as regularisationregularisation) isused to prevent over-fitting.

Virtually all rules of thumb you hear about are actually nonsense. A sensiblestrategy is to try a range of numbers of hidden units and see which worksbest.


20/20


20

Different Learning Rates for Different Layers?A network as a whole will usually learn most efficiently if all its neurons are

learning at roughly the same speed. So maybe different parts of the networkshould have different learning rates . There are a number of factors thatmay affect the choices:

1. The later network layers (nearer the outputs) will tend to have larger localgradients (deltas) than the earlier layers (nearer the inputs).

2. The activations of units with many connections feeding into or out of them tend tochange faster than units with fewer connections.

3. Activations required for linear units will be different for Sigmoidal units.

4. There is empirical evidence that it helps to have different learning rates for thethresholds/biases compared with the real connection weights.

In practice, it is often quicker to just use the same rates for all the weightsand thresholds, rather than spending time trying to work out appropriatedifferences. A very powerful approach is to use evolutionary strategies to

determine good learning rates.

kevin swingler- lecture 4: multi-layer perceptrons

Documents