contents sequences time delayed i time delayed ii recurrent i cs 476: networks of neural...

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I

CS 476: Networks of Neural Computation, CSD, UOC, 2009

Recurrent II

Conclusions

WK5 – Dynamic Networks

CS 476: Networks of Neural Computation

WK5 – Dynamic Networks:

Time Delayed & Recurrent Networks

Dr. Stathis Kasderidis

Dept. of Computer Science

University of Crete

Spring Semester, 2009

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

ContentsContents

Contents

•Sequence Learning

•Time Delayed Networks I: Implicit Representation

•Time Delayed Networks II: Explicit Representation

•Recurrent Networks I: Elman + Jordan Networks

•Recurrent Networks II: Back Propagation Through Time

•Conclusions

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

SequencesSequences

Sequence Learning

•MLP & RBF networks are static networks, i.e. they learn a mapping from a single input signal to a single output response for an arbitrary large number of pairs;

•Dynamic networks learn a mapping from a single input signal to a sequence of response signals, for an arbitrary number of pairs (signal,sequence).

•Typically the input signal to a dynamic network is an element of the sequence and then the network produces as a response the rest of the sequence.

•To learn sequences we need to include some form of memory (short term memory) to the network.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

SequencesSequences

Sequence Learning II

•We can introduce memory effects with two principal ways:

•Implicit: e.g. Time lagged signal as input to a static network or as recurrent connections

•Explicit: e.g. Temporal Backpropagation Method

•In the implicit form, we assume that the environment from which we collect examples of (input signal, output sequence) is stationary. For the explicit form the environment could be non-stationary, i.e. the network can track the changes in the structure of the signal.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

Time Delayed ITime Delayed I

Time Delayed Networks I

•The time delayed approach includes two basic types of networks:

•Implicit Representation of Time: We combine a memory structure in the input layer of the network with a static network model•Explicit Representation of Time: We explicitly allow the network to code time, by generalising the network weights from scalars to vectors, as in TBP (Temporal Backpropagation).

•Typical forms of memories that are used are the Tapped Delay Line and the Gamma Memory family.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•The Tapped Delay Line form of memory is shown below for an input signal x(n):

•The Gamma form of memory is defined by:

pnp

nng pnp

p

)1(1

1)(

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•The Gamma Memory is shown below:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•In the implicit representation approach we combine a static network (e.g. MLP / RBF) with a memory structure (e.g. tapped delay line). •An example is shown below (the NETtalk network):

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•We present the data in a sliding window. For example in NETtalk the middle group of input neurons present the letter in focus. The rest of the input groups, three before & three after, present context.•The purpose is to predict for example the next symbol in the sequence.•The NETtalk model (Sejnowski & Rosenberg, 1987) has:

•203 input nodes•80 hidden neurons•26 output neurons•18629 weights•Used BP method for training

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

Time Delayed IITime Delayed II

Time Delayed Networks II

•In explicit time representation, neurons have a spatio-temporal structure, i.e. its synapse arriving to a neuron is not a scalar number but a vector of weights, which are used for convolution of the time-delayed input signal of a previous neuron with the synapse.

•A schematic representation of a neuron is given below:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•The output of the neuron in this case is given by:

•In case of a whole network, for example assuming a single output node and a linear output layer, the response is given by:

Where p is the depth of the memory and b0 is the bias of the output neuron

p

ljjj blnxlwny

0

)()()(

0011

)()()()(11

bblnxlwwnywnyp

ljj

m

jj

m

jjj

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•In the more general case, where we have multiple neurons at the output layer, we have for neuron j of any layer:

•The output of any synapse is given by the convolution sum:

0

1 0

)()()(m

i

p

ljijij blnxlwny

p

lijii

Tjiji lnxlwnxwns

0

)()()()(

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



Where the state vector xi(n) and weight vector wji for synapse I are defined as follows:

•xi(n)=[xi(n), xi(n-1),…, xi(n-p)]T

•wji=[wji(0), wji(1),…, wji(p)]T

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Time Delayed Networks II: Learning Law

•To train such a network we use the Temporal BackPropagation algorithm. We present the algorithm below.

•Assume that neuron j lies in the output layer and its response is denoted by yj(n) at time n, while its desired response is given by dj(n).

•We can define an instantaneous value for the sum of squared errors produced by the network as follows:

j

j nenE )(2

1)( 2

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Time Delayed Networks II: Learning Law-1

•The error signal at the output layer is defined by:

•The idea is a minimise an overall cost function, calculated over all time:

•We could proceed as usual by calculating the gradient of the cost function over the weights. This implies that we need to calculate the instantaneous gradient:

)()()( nyndne jjj

n

total nEE )(

n jiji

total

w

nE

w

E)(

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•However this approach to work we need to unfold the network in time (i.e. to convert it to an equivalent static network and then calculate the gradient). This option presents a number of disadvantages:

•A loss of symmetry between forward and backward pass for the calculation of instantaneous gradient;

•No nice recursive formula for propagation of error terms;

•Need for global bookkeeping to keep track of which static weights are actually the same in the equivalent network

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•For these reasons we prefer to calculate the gradient of the cost function as follows:

•Note that in general holds:

•The equality is correct only when we take the sum over all time.

n ji

j

jji

total

w

nv

nv

E

w

E)(

)(

total

jiji

j

j

total

w

nE

w

nv

nv

E

)()(

)(

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•To calculate the weight update we use the steepest descent method:

Where is the learning rate.

•We calculate the terms in the above relation as follows:

This is by definition the induced field vj(n)

)(

)(

)()()1(

nw

nv

nv

Enwnw

ji

j

j

totaljiji

)()(

)(nx

nw

nvi

ji

j

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•We define the local gradient as:

•Thus we can write the weight update equations in the familiar form:

•We need to calculate the for the cases of output and hidden layers.

)()(

nv

En

j

totalj

)()()()1( nxnnwnw ijjiji

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•For the output layer the local gradient is given by:

•For a hidden layer we assume that neuron j is connected to a set A of neurons in the next layer (hidden or output). Then we have:

))((')()(

)(

)()( nvne

nv

nE

nv

En jj

jj

totalj

Ar k j

r

r

total

j

totalj

nv

kv

kv

E

nv

En

)(

)(

)(

)()(

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



•By re-writing we get the following:

•Finally we putting all together we get:

Ar k j

j

j

rr

Ar k j

rrj

nv

ny

ny

kvk

nv

kvkn

)(

)(

)(

)()(

)(

)()()(

Ar

p

lrjrj

Ar

pn

nkrjrjj

nwlnnv

lkwknvn

0

)()())(('

)()())((')(

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions



Where l is the layer level

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

Recurrent IRecurrent I

Recurrent I

•A network is called recurrent when there are connections which feedback to previous layers or neurons, including self-connections. An example is shown next:

•Successful early models of recurrent networks are:

•Jordan Network

•Elman Network

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent I

•The Jordan Network has the structure of an MLP and additional context units. The Output neurons feedback to the context neurons in 1-1 fashion. The context units also feedback to themselves.

•The network is trained by using the Backpropagation algorithm

•A schematic is shown in the next figure:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent I

•The Elman Network has also the structure of an MLP and additional context units. The Hidden neurons feedback to the context neurons in 1-1 fashion. The hidden neurons’ connections to the context units are constant and equal to 1. It is also called Simple Recurrent Network (SRN).

•The network is trained by using the Backpropagation algorithm•A schematic is shown in the next figure:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions

Recurrent IIRecurrent II

Recurrent II

•More complex forms of recurrent networks are possible. We can start by extending a MLP as a basic building block.

•Typical paradigms of complex recurrent models are:

•Nonlinear Autoregressive with Exogenous Inputs Network (NARX)

•The State Space Model

•The Recurrent Multilayer Perceptron (RMLP)

•Schematic representations of the networks are given in the next slides:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-1

•The structure of the NARX model includes:

•A MLP static network;

•A current input u(n) and its delayed versions up to a time q;

•A time delayed version of the current output y(n) which feeds back to the input layer. The memory of the delayed output vector is in general p.

•The output is calculated as:

y(n+1)=F(y(n),…,y(n-p+1),u(n),…,u(n-q+1))

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-2

•A schematic of the NARX model is as follows:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-3

•The structure of the State Space model includes:

•A MLP network with a single hidden layer;

•The hidden neurons define the state of the network;

•A linear output layer;

•A feedback of the hidden layer to the input layer assuming a memory of q lags;

•The output is determined by the coupled equations:

x(n+1)=f(x(n),u(n))

y(n+1)=C x(n+1)

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-4

Where f is a suitable nonlinear function characterising the hidden layer. x is the state vector, as it is produced by the hidden layer. It has q components. y is the output vector and it has p components. The input vector is given by u and it has m components.

•A schematic representation of the network is given below:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-5

•The structure of the RMLP includes:

•One or more hidden layers;

•Feedback around each layer;

•The general structure of a static MLP network;

•The output is calculated as follows (assuming that xI, xII, and xo are the first, second and output layer outputs):

xI(n+1)=I(xI(n), u(n))

xII(n+1)=II(xII(n), xI(n+1))

xO(n+1)=O(xO(n), xII(n+1))

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-6

Where the functions I(•), II(•) and O(•) denote the

Activation functions of the corresponding layer.

•A schematic representation is given below:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-7

•Some theorems on the computational power of recurrent networks:

•Thm 1: All Turing machines may be simulated by fully connected recurrent networks built on neurons with sigmoid activation functions.

•Thm 2: NARX networks with one layer of hidden neurons with bounded, one-sided saturated activation functions and a linear output neuron can simulate fully connected recurrent networks with bounded, one-sided saturated activation functions, except for a linear slowdown.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-8

•Corollary: NARX networks with one hidden layer of neurons with BOSS activations functions and a linear output neuron are Turing equivalent.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-9

•The training of the recurrent networks can be done with two methods:

•BackPropagation Through Time

•Real-Time Recurrent Learning

•We can train a recurrent network with either epoch-based or continuous training operation. However an epoch in recurrent networks does not mean the presentation of all learning patterns but rather denotes the length of a single sequence that we use for training. So an epoch in recurrent network corresponds in presenting only one pattern to the network.

•At the end of an epoch the network stabilises.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-10

•Some useful heuristics for the training is given below:

•Lexigraphic order of training samples should be followed, with the shortest strings of symbols being presented in the network first;

•The training should begin with a small training sample and then its size should be incrementally increased as the training proceeds;

•The synaptic weights of the network should be updated only if the absolute error on the training sample currently being processed by the network is greater than some prescribed criterion;

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-11

•The use of weight decay during training is recommended; weight decay was discussed in WK3

•The BackPropagation Through Time algorithm proceeds by unfolding a network in time. To be more specific:

•Assume that we have a recurrent network N which is required to learn a temporal task starting from time n0 and going all the way to time n.

•Let N* denote the feedforward network that results from unfolding the temporal operation of the recurrent network N.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-12

•The network N* is related to the original network N as follows:

•For each time step in the interval (n0,n], the network N* has a layer containing K neurons, where K is the number of neurons contained in network N;

•In every layer of network N* there is a copy of each neuron in network N;

•For every time step l [n0,n], the synaptic connection from neuron i in layer l to neuron j in layer l+1 of the network N* is a copy of the synaptic connection from neuron i to neuron j in the network N.

•The following example explains the idea of unfolding:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-13

•We assume that we have a network with two neurons which is unfolded for a number of steps, n:

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-14

• We present now the method of Epochwise BackPropagation Through Time.

• Let the dataset used for training the network be partitioned into independent epochs, with each epoch representing a temporal pattern of interest. Let n0 denote the start time of an epoch and n1 denotes its end time.

• We can define the following cost function:

1

0

)(2

1),( 210

n

nn Ajjtotal nennE

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-15

Where A is the set of indices j pertaining to those neurons in the network for which desired responses are specified, and ej(n) is the error signal at the output of such a neuron measured with respect to some desired response.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-16

• The algorithm proceeds as follows:

1. For a given epoch, the recurrent network starts running from some initial state until it reaches a new state, at which point the training is stopped and the network is reset to an initial state for the next epoch. The initial state doesn’t have to be the same for each epoch of training. Rather, what is important is for the initial state for the new epoch to be different from the state reached by the network at the end of the previous epoch;

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-17

2. First a single forward pass of the data through the network for the interval (n0, n1) is performed. The complete record of input data, network state (i.e. synaptic weights), and desired responses over this interval is saved;

3. A single backward pass over this past record is performed to compute the values of the local gradients:

For all j A and n0 < n n1 . This computation is performed by the formula:

)(

),()( 10

nv

nnEn

j

totalj

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-18

Where ’(•) is the derivative of an activation function with respect to its argument, and vj(n) is the induced local field of neuron j.

The use of above formula is repeated, starting from time n1 and working back, step by step, to time n0 ; the number of steps involved here is equal to the number of time steps contained in the epoch.

10

1

)1()())(('

)())(('

)(nnnfornwnenv

nnfornenv

n

Akkjkjj

jj

j

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-19

4. Once the computation of back-propagation has been performed back to time n0+1, the following adjustment is applied to the synaptic weight wji of neuron j:

Where is the learning rate parameter and xi(n-1) is the input applied to the ith synapse of neuron j at time n-1.

1

0 1

)1()(n

nnij

ji

totalji

nxn

w

Ew

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

Conclusions


Recurrent II-20

• There is a potential problem with the method, which is called the Vanishing Gradients Problem, i.e. the corrections calculated for the weights are not large enough when using methods based on steepest descent.

• However this is a research problem currently and ones has to see the literature for details.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

ConclusionsConclusionsConclusions

Conclusions

•Dynamic networks learn sequences in contrast to the static mappings of MLP and RBF networks.

•Time representation takes place explicitly or implicitly.

•The implicit form includes time-delayed versions of the input vector and use of a static network model afterwards or the use of recurrent networks.

•The explicit form uses a generalisation of the MLP model where a synapse is modelled now as a weight vector and not as a single number. The synapse activation is not any more the product of the synapse’s weight with the output of a previous neuron but rather the inner product of the synapse’s weight vector with the time-delayed state vector of the previous neuron.

Contents

Sequences

Time Delayed I

Time Delayed II

Recurrent I


Recurrent II

ConclusionsConclusionsConclusions

Conclusions I

•The extended MLP networks with explicit temporal structure are trained with the Temporal BackPropagation algorithm.

•The recurrent networks include a number of simple and complex architectures. In the simpler case we train the networks using the standard BackProgation algorithm.

•In the more complex cases we first unfold the network in time and then train it using the BackProgation Through Time algorithm.

contents sequences time delayed i time delayed ii recurrent i cs 476: networks of neural...

Documents

time conclusions

conclusions time

time delayed approach

recurrent i cs

contents sequences time

code time

static networks

time lagged signal