kevin swingler- lecture 4: multi-layer perceptrons

Upload: roots999

Post on 06-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    1/20

    Dept. of ComputingScience & Math

    1

    Lecture 4: Multi-Layer Perceptrons

    Kevin [email protected]

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    2/20

    Dept. of ComputingScience & Math

    2

    Review of Gradient Descent Learning

    1. The purpose of neural network training is to minimize the outputerrors on a particular set of training data by adjusting the networkweights w

    2. We define an Cost Function E(w) that measures how far thecurrent networks output is from the desired one

    3. Partial derivatives of the cost function E(w)/wtell us whichdirection we need to move in weight space to reduce the error

    4. The learning rate specifies the step sizes we take in weightspace for each iteration of the weight update equation

    5. We keep stepping through weight space until the errors are smallenough.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    3/20

    Dept. of ComputingScience & Math

    3

    Review of Perceptron Training

    1. Generate a training pair or pattern xthat you wish your network tolearn

    2. Setup your network with Ninput units fully connected to Moutputunits.

    3. Initialize weights, w, at random

    4. Select an appropriate error function E(w) and learning rate

    5. Apply the weight changew = - E(w)/wto each weight wforeach training pattern p. One set of updates for all the weights forall the training patterns is called one epochof training.

    6. Repeat step 5 until the network error function is small enough

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    4/20

    Dept. of ComputingScience & Math

    4

    Review of XOR and Linear Separability

    Recall that it is not possible to find weights that enable Single Layer Perceptrons

    to deal with non-linearly separable problems like XOR

    011

    101

    110

    000

    outin2in1

    XORXOR

    The proposed solution was to use a more complex network that is able togenerate more complex decision boundaries. That network is the Multi-LayerPerceptron.

    I1

    I2

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    5/20

    Dept. of ComputingScience & Math

    5

    Multi-Layer Perceptrons (MLPs)

    X1 X2 X3 Xi

    O1Oj

    Y1 Y2 Yk

    Output layer, k

    Hidden layer, j

    Input layer, i

    )( jj

    jkk OwfY =

    )( ii

    ijj XwfO =

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    6/20

    Dept. of ComputingScience & Math

    6

    Can We Use a Generalized Form of the

    PLR/Delta Rule to Train the MLP?Recall the PLR/Delta rule: Adjust neuron weights to reduce error at neurons

    output:

    xww old += where yy ett=

    arg

    Main problem: How to adjust the weights in the hidden layer, so they reducethe error in the output layer, when there is no specified target response in thehidden layer?

    +1

    -1

    Threshold function

    +1

    -1

    Sigmoid function

    Solution: Alter the non-linear Perceptron (discrete threshold) activationfunction to make it differentiable and hence, help derive Generalized DR for

    MLP training.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    7/20

    Dept. of ComputingScience & Math

    7

    Sigmoid (S-shaped) Function Properties

    Approximates the threshold function

    Smoothly differentiable (everywhere), and hence DR applicable

    Positive slope

    Popular choice is

    a

    e

    afy

    +==

    1

    1)(

    Derivative of sigmoidal function is:

    ))(1()()('

    afafaf =

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    8/20

    Dept. of ComputingScience & Math

    8

    Weight Update Rule

    Generally, weight change from any unit j to unit k by gradient descent (i.e.

    weight change by small increment in negative direction to the gradient)

    is now called Generalized Delta Rule (GDR or Backpropagation):

    xw

    Ewww old +=

    ==

    So, the weight change from the input layer unit ito hidden layer unitjis:

    ijij xw = kk

    jkjjj woo = )1(where

    The weight change from the hidden layer unitjto the output layer unit kis:

    wherejkjk ow = )1()( ,arg kkkkettk yyyy =

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    9/20

    Dept. of ComputingScience & Math

    9

    Training of a 2-Layer Feed Forward Network

    1. Take the set of training patterns you wish the network to learn

    2. Set up the network with N input units fully connected to M hidden non-linear hidden units via connections with weights wij, which in turn are

    fully connected to P output units via connections with weights wjk

    3. Generate random initial weights, e.g. from range [-wt, +wt]

    4. Select appropriate error function E(wjk) and learning rate 5. Apply the weight update equation wjk=-E(wjk)/wjk to each weight

    wjk for each training pattern p.

    6. Do the same to all hidden layers.

    7. Repeat step 5-6 until the network error function is small enough

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    10/20

    Dept. of ComputingScience & Math

    10

    Graphical Representation of GDR

    Ideal weight

    Weight, wi

    Total Error

    Local minimum

    Global minimum

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    11/20

    Dept. of ComputingScience & Math

    11

    Practical Considerations for Learning Rules

    There are a number of important issues about training single layer neuralnetworks that need further resolving:

    1. Do we need to pre-process the training data? If so, how?2. How do we choose the initial weights from which we start the training?

    3. How do we choose an appropriate learning rate ?

    4. Should we change the weights after each training pattern, or after the

    whole set?

    5. Are some activation/transfer functions better than others?

    6. How do we avoid local minima in the error function?

    7. How do we know when we should stop the training?

    8. How many hidden units do we need?

    9. Should we have different learning rates for the different layers?

    We shall now consider each of these issues one by one.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    12/20

    Dept. of ComputingScience & Math

    12

    Pre-processing of the Training DataIn principle, we can just use any raw input-output data to train our networks.

    However, in practice, it often helps the network to learn appropriately if wecarry out some pre-processing of the training data before feeding it to the

    network.

    We should make sure that the training data is representativerepresentative it should notcontain too many examples of one type at the expense of another. On theother hand, if one class of pattern is easy to learn, having large numbers ofpatterns from that class in the training set will only slow down the over-all

    learning process.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    13/20

    Dept. of ComputingScience & Math

    13

    Choosing the Initial Weight Values

    The gradient descent learning algorithm treats all the weights in the same way,so if we start them all off with the same values, all the hidden units will endup doing the same thing and the network will never learn properly.

    For that reason, we generally start off all the weights with small random values.Usually we take them from a flat distribution around zero [wt, +wt], or froma Gaussian distribution around zero with standard deviation wt.

    Choosing a good value of wtcan be difficult. Generally, it is a good idea to

    make it as large as you can without saturating any of the sigmoids.

    We usually hope that the final network performance will be independent of thechoice of initial weights, but we need to check this by training the networkfrom a number of different random initial weight sets.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    14/20

    Dept. of ComputingScience & Math

    14

    Choosing the Learning RateChoosing a good value for the learning rate is constrained by two opposing

    facts:1. If is too small, it will take too long to get anywhere near the minimum

    of the error function.2. If is too large, the weight updates will over-shoot the error minimum

    and the weights will oscillate, or even diverge.

    Unfortunately, the optimal value is very problem and network dependent, so one

    cannot formulate reliable general prescriptions.

    Generally, one should try a range of different values (e.g. = 0.1, 0.01, 1.0,0.0001) and use the results as a guide.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    15/20

    Dept. of ComputingScience & Math

    15

    Batch Training vs. On-line Training

    Batch Training:Batch Training: update the weights after all training patterns havebeen presented.

    OnOn--line Trainingline Training (or Sequential Training):(or Sequential Training): A natural alternative is toupdate all the weights immediately after processing each trainingpattern.

    On-line learning does not perform true gradient descent, and theindividual weight changes can be rather erratic. Normally a muchlower learning rate will be necessary than for batch learning.However, because each weight now has Nupdates (where N is thenumber of patterns) per epoch, rather than just one, overall the

    learning is often much quicker. This is particularly true if there is alot of redundancy in the training data, i.e. many training patternscontaining similar information.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    16/20

    Dept. of ComputingScience & Math

    16

    Choosing the Transfer Function

    We have already seen that having a differentiable transfer/activationfunction is important for the gradient descent algorithm to work. We

    have also seen that, in terms of computational efficiency, thestandard sigmoid (i.e. logistic function) is a particularly convenientreplacement for the step function of the Simple Perceptron.

    The logistic function ranges from 0 to 1. There is some evidence thatan anti-symmetric transfer function, i.e. one that satisfies f(x) = f(x), enables the gradient descent algorithm to learn faster.

    When the outputs are required to be non-binary, i.e. continuous real

    values, having sigmoidal transfer functions no longer makes sense.In these cases, a simple linear transfer function f(x) = xisappropriate.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    17/20

    Dept. of ComputingScience & Math

    17

    Local MinimaCost functions can quite easily have more than one minimum:

    If we start off in the vicinity of the local minimum, we may end up at the local

    minimum rather than the global minimum. Starting with a range of different initialweight sets increases our chances of finding the global minimum. Any variationfrom true gradient descent will also increase our chances of stepping into thedeeper valley.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    18/20

    Dept. of ComputingScience & Math

    18

    When to Stop TrainingThe Sigmoid(x) function only takes on its extreme values of 0 and 1 at x= .

    In effect, this means that the network can only achieve its binary targetswhen at least some of its weights reach . So, given finite gradientdescent step sizes, our networks will never reach their binary targets.

    Even if we off-set the targets (to 0.1 and 0.9 say) we will generally require aninfinite number of increasingly small gradient descent steps to achievethose targets.

    Clearly, if the training algorithm can never actually reach the minimum, wehave to stop the training process when it is near enough. What constitutesnear enough depends on the problem. If we have binary targets, it mightbe enough that all outputs are within 0.1 (say) of their targets. Or, it mightbe easier to stop the training when the sum squared error function becomesless than a particular small value (0.2 say).

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    19/20

    Dept. of ComputingScience & Math

    19

    How Many Hidden Units?The best number of hidden units depends in a complex way on many factors,

    including:1. The number of training patterns

    2. The numbers of input and output units3. The amount of noise in the training data4. The complexity of the function or classification to be learned5. The type of hidden unit activation function6. The training algorithm

    Too few hidden units will generally leave high training and generalisation errorsdue to under-fitting. Too many hidden units will result in low training errors,but will make the training unnecessarily slow, and will result in poorgeneralisation unless some other technique (such as regularisationregularisation) isused to prevent over-fitting.

    Virtually all rules of thumb you hear about are actually nonsense. A sensiblestrategy is to try a range of numbers of hidden units and see which worksbest.

  • 8/3/2019 Kevin Swingler- Lecture 4: Multi-Layer Perceptrons

    20/20

    Dept. of ComputingScience & Math

    20

    Different Learning Rates for Different Layers?A network as a whole will usually learn most efficiently if all its neurons are

    learning at roughly the same speed. So maybe different parts of the networkshould have different learning rates . There are a number of factors thatmay affect the choices:

    1. The later network layers (nearer the outputs) will tend to have larger localgradients (deltas) than the earlier layers (nearer the inputs).

    2. The activations of units with many connections feeding into or out of them tend tochange faster than units with fewer connections.

    3. Activations required for linear units will be different for Sigmoidal units.

    4. There is empirical evidence that it helps to have different learning rates for thethresholds/biases compared with the real connection weights.

    In practice, it is often quicker to just use the same rates for all the weightsand thresholds, rather than spending time trying to work out appropriatedifferences. A very powerful approach is to use evolutionary strategies to

    determine good learning rates.