kevin swingler- lecture 4: multi-layer perceptrons
TRANSCRIPT
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
1/20
Dept. of ComputingScience & Math
1
Lecture 4: Multi-Layer Perceptrons
Kevin [email protected]
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
2/20
Dept. of ComputingScience & Math
2
Review of Gradient Descent Learning
1. The purpose of neural network training is to minimize the outputerrors on a particular set of training data by adjusting the networkweights w
2. We define an Cost Function E(w) that measures how far thecurrent networks output is from the desired one
3. Partial derivatives of the cost function E(w)/wtell us whichdirection we need to move in weight space to reduce the error
4. The learning rate specifies the step sizes we take in weightspace for each iteration of the weight update equation
5. We keep stepping through weight space until the errors are smallenough.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
3/20
Dept. of ComputingScience & Math
3
Review of Perceptron Training
1. Generate a training pair or pattern xthat you wish your network tolearn
2. Setup your network with Ninput units fully connected to Moutputunits.
3. Initialize weights, w, at random
4. Select an appropriate error function E(w) and learning rate
5. Apply the weight changew = - E(w)/wto each weight wforeach training pattern p. One set of updates for all the weights forall the training patterns is called one epochof training.
6. Repeat step 5 until the network error function is small enough
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
4/20
Dept. of ComputingScience & Math
4
Review of XOR and Linear Separability
Recall that it is not possible to find weights that enable Single Layer Perceptrons
to deal with non-linearly separable problems like XOR
011
101
110
000
outin2in1
XORXOR
The proposed solution was to use a more complex network that is able togenerate more complex decision boundaries. That network is the Multi-LayerPerceptron.
I1
I2
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
5/20
Dept. of ComputingScience & Math
5
Multi-Layer Perceptrons (MLPs)
X1 X2 X3 Xi
O1Oj
Y1 Y2 Yk
Output layer, k
Hidden layer, j
Input layer, i
)( jj
jkk OwfY =
)( ii
ijj XwfO =
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
6/20
Dept. of ComputingScience & Math
6
Can We Use a Generalized Form of the
PLR/Delta Rule to Train the MLP?Recall the PLR/Delta rule: Adjust neuron weights to reduce error at neurons
output:
xww old += where yy ett=
arg
Main problem: How to adjust the weights in the hidden layer, so they reducethe error in the output layer, when there is no specified target response in thehidden layer?
+1
-1
Threshold function
+1
-1
Sigmoid function
Solution: Alter the non-linear Perceptron (discrete threshold) activationfunction to make it differentiable and hence, help derive Generalized DR for
MLP training.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
7/20
Dept. of ComputingScience & Math
7
Sigmoid (S-shaped) Function Properties
Approximates the threshold function
Smoothly differentiable (everywhere), and hence DR applicable
Positive slope
Popular choice is
a
e
afy
+==
1
1)(
Derivative of sigmoidal function is:
))(1()()('
afafaf =
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
8/20
Dept. of ComputingScience & Math
8
Weight Update Rule
Generally, weight change from any unit j to unit k by gradient descent (i.e.
weight change by small increment in negative direction to the gradient)
is now called Generalized Delta Rule (GDR or Backpropagation):
xw
Ewww old +=
==
So, the weight change from the input layer unit ito hidden layer unitjis:
ijij xw = kk
jkjjj woo = )1(where
The weight change from the hidden layer unitjto the output layer unit kis:
wherejkjk ow = )1()( ,arg kkkkettk yyyy =
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
9/20
Dept. of ComputingScience & Math
9
Training of a 2-Layer Feed Forward Network
1. Take the set of training patterns you wish the network to learn
2. Set up the network with N input units fully connected to M hidden non-linear hidden units via connections with weights wij, which in turn are
fully connected to P output units via connections with weights wjk
3. Generate random initial weights, e.g. from range [-wt, +wt]
4. Select appropriate error function E(wjk) and learning rate 5. Apply the weight update equation wjk=-E(wjk)/wjk to each weight
wjk for each training pattern p.
6. Do the same to all hidden layers.
7. Repeat step 5-6 until the network error function is small enough
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
10/20
Dept. of ComputingScience & Math
10
Graphical Representation of GDR
Ideal weight
Weight, wi
Total Error
Local minimum
Global minimum
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
11/20
Dept. of ComputingScience & Math
11
Practical Considerations for Learning Rules
There are a number of important issues about training single layer neuralnetworks that need further resolving:
1. Do we need to pre-process the training data? If so, how?2. How do we choose the initial weights from which we start the training?
3. How do we choose an appropriate learning rate ?
4. Should we change the weights after each training pattern, or after the
whole set?
5. Are some activation/transfer functions better than others?
6. How do we avoid local minima in the error function?
7. How do we know when we should stop the training?
8. How many hidden units do we need?
9. Should we have different learning rates for the different layers?
We shall now consider each of these issues one by one.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
12/20
Dept. of ComputingScience & Math
12
Pre-processing of the Training DataIn principle, we can just use any raw input-output data to train our networks.
However, in practice, it often helps the network to learn appropriately if wecarry out some pre-processing of the training data before feeding it to the
network.
We should make sure that the training data is representativerepresentative it should notcontain too many examples of one type at the expense of another. On theother hand, if one class of pattern is easy to learn, having large numbers ofpatterns from that class in the training set will only slow down the over-all
learning process.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
13/20
Dept. of ComputingScience & Math
13
Choosing the Initial Weight Values
The gradient descent learning algorithm treats all the weights in the same way,so if we start them all off with the same values, all the hidden units will endup doing the same thing and the network will never learn properly.
For that reason, we generally start off all the weights with small random values.Usually we take them from a flat distribution around zero [wt, +wt], or froma Gaussian distribution around zero with standard deviation wt.
Choosing a good value of wtcan be difficult. Generally, it is a good idea to
make it as large as you can without saturating any of the sigmoids.
We usually hope that the final network performance will be independent of thechoice of initial weights, but we need to check this by training the networkfrom a number of different random initial weight sets.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
14/20
Dept. of ComputingScience & Math
14
Choosing the Learning RateChoosing a good value for the learning rate is constrained by two opposing
facts:1. If is too small, it will take too long to get anywhere near the minimum
of the error function.2. If is too large, the weight updates will over-shoot the error minimum
and the weights will oscillate, or even diverge.
Unfortunately, the optimal value is very problem and network dependent, so one
cannot formulate reliable general prescriptions.
Generally, one should try a range of different values (e.g. = 0.1, 0.01, 1.0,0.0001) and use the results as a guide.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
15/20
Dept. of ComputingScience & Math
15
Batch Training vs. On-line Training
Batch Training:Batch Training: update the weights after all training patterns havebeen presented.
OnOn--line Trainingline Training (or Sequential Training):(or Sequential Training): A natural alternative is toupdate all the weights immediately after processing each trainingpattern.
On-line learning does not perform true gradient descent, and theindividual weight changes can be rather erratic. Normally a muchlower learning rate will be necessary than for batch learning.However, because each weight now has Nupdates (where N is thenumber of patterns) per epoch, rather than just one, overall the
learning is often much quicker. This is particularly true if there is alot of redundancy in the training data, i.e. many training patternscontaining similar information.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
16/20
Dept. of ComputingScience & Math
16
Choosing the Transfer Function
We have already seen that having a differentiable transfer/activationfunction is important for the gradient descent algorithm to work. We
have also seen that, in terms of computational efficiency, thestandard sigmoid (i.e. logistic function) is a particularly convenientreplacement for the step function of the Simple Perceptron.
The logistic function ranges from 0 to 1. There is some evidence thatan anti-symmetric transfer function, i.e. one that satisfies f(x) = f(x), enables the gradient descent algorithm to learn faster.
When the outputs are required to be non-binary, i.e. continuous real
values, having sigmoidal transfer functions no longer makes sense.In these cases, a simple linear transfer function f(x) = xisappropriate.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
17/20
Dept. of ComputingScience & Math
17
Local MinimaCost functions can quite easily have more than one minimum:
If we start off in the vicinity of the local minimum, we may end up at the local
minimum rather than the global minimum. Starting with a range of different initialweight sets increases our chances of finding the global minimum. Any variationfrom true gradient descent will also increase our chances of stepping into thedeeper valley.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
18/20
Dept. of ComputingScience & Math
18
When to Stop TrainingThe Sigmoid(x) function only takes on its extreme values of 0 and 1 at x= .
In effect, this means that the network can only achieve its binary targetswhen at least some of its weights reach . So, given finite gradientdescent step sizes, our networks will never reach their binary targets.
Even if we off-set the targets (to 0.1 and 0.9 say) we will generally require aninfinite number of increasingly small gradient descent steps to achievethose targets.
Clearly, if the training algorithm can never actually reach the minimum, wehave to stop the training process when it is near enough. What constitutesnear enough depends on the problem. If we have binary targets, it mightbe enough that all outputs are within 0.1 (say) of their targets. Or, it mightbe easier to stop the training when the sum squared error function becomesless than a particular small value (0.2 say).
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
19/20
Dept. of ComputingScience & Math
19
How Many Hidden Units?The best number of hidden units depends in a complex way on many factors,
including:1. The number of training patterns
2. The numbers of input and output units3. The amount of noise in the training data4. The complexity of the function or classification to be learned5. The type of hidden unit activation function6. The training algorithm
Too few hidden units will generally leave high training and generalisation errorsdue to under-fitting. Too many hidden units will result in low training errors,but will make the training unnecessarily slow, and will result in poorgeneralisation unless some other technique (such as regularisationregularisation) isused to prevent over-fitting.
Virtually all rules of thumb you hear about are actually nonsense. A sensiblestrategy is to try a range of numbers of hidden units and see which worksbest.
-
8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons
20/20
Dept. of ComputingScience & Math
20
Different Learning Rates for Different Layers?A network as a whole will usually learn most efficiently if all its neurons are
learning at roughly the same speed. So maybe different parts of the networkshould have different learning rates . There are a number of factors thatmay affect the choices:
1. The later network layers (nearer the outputs) will tend to have larger localgradients (deltas) than the earlier layers (nearer the inputs).
2. The activations of units with many connections feeding into or out of them tend tochange faster than units with fewer connections.
3. Activations required for linear units will be different for Sigmoidal units.
4. There is empirical evidence that it helps to have different learning rates for thethresholds/biases compared with the real connection weights.
In practice, it is often quicker to just use the same rates for all the weightsand thresholds, rather than spending time trying to work out appropriatedifferences. A very powerful approach is to use evolutionary strategies to
determine good learning rates.