ai – week 23 sub-symbolic ai multi-layer neural networks

Neural Networks

AI – Week 23Sub-symbolic AI Multi-Layer Neural Networks

Lee McCluskey, room 3/10

Email [email protected]

http://scom.hud.ac.uk/scomtlm/cha2555/

mailto:[email protected]

http://scom.hud.ac.uk/scomtlm/cha2555/

RECAP: Simple Model of an Artificial Neuron (McCulloch and Pitts 1943)

• A set of synapses (i.e. connections) brings in activations (inputs) from other neurons.

• A processing unit sums the inputs x weights, and then applies a transfer function using a “threshold value” to see if the neuron “fires”.

• An output line transmits the result to other neurons (output can be binary or continuous). If the sum does not reach the threshold, output is 0.

Another Look at XORWe showed in a previous lecture that the XOR truth table can not be

realised using a single-layer perceptron network; because it is not linearly separable.

Multi-Layer networks / multi-layer perceptrons (MLNs) are able to deal with non-linearly separable problems.

We can use a MLN to classify the XOR data using two separating lines (and the step function).

Constructing the Network

Here we are constructing the two required separating linesi.e., I1 = -x1 - x2 + 1.5 I2 = -x1 - x2 + 0.5

Here we are combining the information into a single output

Consider the following feed forward, fully connected 2-2-1 network:

Evaluating the Network

x1 x2 I1 I2 f(I1) f(I2)

1 1 -0.5 -1.5 0 0

1 0 0.5 -0.5 1 0

0 1 0.5 -0.5 1 0

0 0 1.5 0.5 1 1

Inputs tTotal input hidden layer

Output from hidden layer

We can calculate the activations of the hidden layer for the network

I1 = -x1 - x2 + 1.5 I2 = -x1 - x2 + 0.5

Perceptrons

value thresholdk theis thk

To determine whether the jth output node should fire, we calculate the value

If this value exceeds 0 the neuron will fire otherwise it will not fire.

j

n

iiji xw

1,sgn

Multi-layer Perceptrons (MLPs)In general, MLPs use the sigmoid activation function:

The sigmoid function is mathematically more “user friendly” than the step function.Due to the asymptotic nature of the sigmoid function it is unrealistic to expect values of 0 and 1 to be realised exactly. It is usual to relax the output requirements to target values of 0.1 and 0.9.

By adopting the sigmoid function with a more complex architecture, the multi-layer perceptron is able to solve complicated problems.

jIje

If

1

1)(

Backpropagation learningPseudo code:Assume all weights have been initialised randomly to [-1,1]

REPEAT

NOCHANGES = TRUE

For each input pattern

1. Perform a forward sweep to find the actual output

2. Calculate network errors tj – oj

3. If any tj – oj > TOLERANCE set NOCHANGES = FALSE

4. DO BACKPROPAGATION to determine weight changes

5. Update weights

UNTIL NOCHANGES

The Backpropagation Algorithm

The change to make to a weight called Δwij is got by “gradient descent”. It is based on the “delta value” δj for an output node j, which represents the error at output j,

Defined by

δj = “difference between output required and output observed” times “gradient of the threshold function”

= (tj - oj) * df/dx

f is the threshold function 1/(1 – e^(-x)); oj = f(input), the output at j

Hence (do some differentiation)

δj = (tj - oj) * oj *(1- oj)

for an output node j.

The Backpropagation Algorithm

So for the weight before output nodes ….

(new weight) wij = (old weight) wij’

+ oi * “learning rate” * δj

And for the weight before hidden nodes similarly ….

(new weight) wij = (old weight) wij’

+ oi * “learning rate” * (sum of wkj *

δk )

Where j – k is a link output from j

Hidden Layers and Hidden NodesThe question of how many hidden layers to use and how many nodes each layer should contain needs to be addressed and answered.

Consider first an m-1-n network with n input nodes, m output nodes and just a single node in the hidden layer. This produces m+n weights. It is useful to regard the weights as being degrees of freedom in the network.

Adding a second node to the hidden layer doubles the freedom in the network; producing 2(m+n) weights.

It is not difficult to see the effect that adding a single node has on the size of the problem to be solved.

An m-k-n MLP will produce k(m+n) = km + kn degrees of freedom in the network.

Hidden Layers and Hidden NodesIf we assume that training time is proportional to the number of weights in the network, then we can see a need to balance effectiveness (reasonable accuracy) with efficiency (reasonable training time).

A good “rule of thumb” is to start with

and increase the number of nodes in the hidden layer if the network has trouble training – experience counts for a lot here.

It is only when we have tried everything else – and failed –(i.e., number of hidden nodes, activation function, data scaling etc., ) that further hidden layers are added.

outputs# inputs# nodeshidden

Conclusion: RL v ANNtypes of learning:

ANN - learning by example, supervised learning,

RL – learning by observation, low level cognition

characterisation of applications:

ANN - learning an approximation to a function where lots of training data are available. Particularly good in classification where there is noisy data e.g. diagnosis or object recognition

RL – learning low level reactive behaviour, such as in lower forms of animals, good for low level cognitive tasks. Also been used for learning in high level tasks (eg games) where rewards are possible and reasoning with actions (moves) too complex.

Conclusion: RL v ANNSimilarities :

- both classed as "sub-symbolic" in heavy use of numbers and rather opaque when functioning.

- both learning approaches requiring repeated trials

- both inspired by natural learning

- both resistant to noise and more graceful in degradation with degraded inputs

Conclusion: RL v ANNDifferences –

•ANNs fixed architecture of layers of neurons, with simple firing mechanism and weights randomly assigned at start, and fixed set of inputs

•ANNs needs supervised TRAINING ie classified data a priori, in the form of value for inputs and a correct output

•RL need to perform trial and error interactions with the environment

•RL learns a mapping from a situation to an action by trial and error: it learns to perform actions which will maximise the sum of re-inforcements, so is more of a real time “hands off” approach than ANNs, it aims to learn policies by assigning blame and learning to avoid situations.

Summary of MLPs• Feed forward.• Fully connected.• Sigmoid activation function.• Restriction on 0, 1 outputs are relaxed to 0.1, 0.9 to

accommodate the asymptotic properties of the sigmoid function.

• Backpropagation learning is used to train the network.• The number of hidden nodes (units) can be chosen using a

“rule of thumb”.• Outputs are continuous rather than binary.

x2

x1

1

h1

o2

o1

1

0.50.5

0.5

-0.3

-0.5

0.4

0.3

Inputs x2 = 0.5 x1 = 0.1Required outputs o2 = 0.9 o1 = 0.1

Example MLP

ai – week 23 sub-symbolic ai multi-layer neural networks

Documents