barbara hammer- perspectives on learning symbolic data with connectionistic systems

8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

1/19

Perspectives on Learning Symbolic Data with

Connectionistic Systems

Barbara Hammer

University of Osnabruck, Department of Mathematics/Computer Science, D-49069 Osnabruck,

Germany, e-mail: [email protected].

Abstract. This paper deals with the connection of symbolic and subsymbolic systems. It focuses

on connectionistic systems processing symbolic data. We examine the capability of learning sym-

bolic data with various neural architectures which constitute partially dynamic approaches: dis-

crete time partially recurrent neural networks as a simple and well established model for process-

ing sequences, and advanced generalizations like holographic reduced representation, recursive

autoassociative memory, and folding networks for processing tree structured data. The methods

share the basic dynamics, but they differ in the specific training methods. We consider the follow-

ing questions: Which are the representational capabilities of the architectures from an algorithmic

point of view? Which are the representational capabilities from a statistical point of view? Are

the architectures learnable in an appropriate sense? Are they efficiently learnable?

1 Introduction

Symbolic methods and connectionistic or subsymbolic systems constitute complemen-

tary approaches in order to automatically process data appropriately. Various learning

algorithms for learning an unknown regularity based on training examples exist in both

domains: decision trees, rule induction, inductive logic programming, version spaces,

. . . on the one side and Bayesian reasoning, vector quantization, clustering algorithms,

neural networks, . . . on the other side [24]. The specific properties of the learning algo-

rithms are complementary as well. Symbolic methods deal with high level information

formulated via logical formulas, for example; data processing is human-understandable;

hence it is often easy to involve prior knowledge, to adapt the training outputs to specific

domains, or to retrain the system on additional data; at the same time, training is often

complex, inefficient, and sensitive to noise. In comparison, connectionistic systems deal

with low level information. Since they perform pattern recognition, their behavior is not

human understandable and often, adaptation to specific situations or additional data re-

quires complete retraining. At the same time, training is efficient, noise tolerant, and

robust. Common data structures for symbolic methods are formulas or terms, i.e., high

level data with little redundant information and a priori unlimited structure where lots

of information lie in the interaction of the single data components. As an example, the

meaning of each of the symbols in the term father(John,Bill) is essentially connected toits respective positions in the term. No symbol can be omitted without loosing impor-

tant information. Assumed Bill was the friend of Marys brother, the above term could

be substituted by father(John,friend(brother(Mary))), a term with a different length and

structure. Connectionistic methods process patterns, i.e., real vectors of a fixed dimen-

sion, which commonly comprise low level, noisy, and redundant information of a fixed


2/19

2 Barbara Hammer

Fig. 1. Example for subsymbolic data: hand-written digit

.

and determined form. The precise value and location of the single components is often

unimportant, information comes from the sum of local features. As an example, Fig. 1

depicts various representations of the digit

; each picture can be represented by a vec-

tor of gray-levels; the various pictures differ considerably in detail while preserving

important features such as two curved lines of the digit

.

Often, data possess both, symbolic and subsymbolic aspects: As an example, database

entries may combine the picture of a person, his income, and his occupation; web sites

consist of text, pictures, formulas, and links; arithmetical formulas may contain vari-

ables and symbols as well as real numbers. Hence appropriate machine learning meth-

ods have to process hybrid data. Moreover, people are capable of dealing with both

aspects at the same time. It would be interesting to see which mechanisms allow artifi-

cial learning systems to handle both aspects simultaneously. We will focus on connec-

tionistic systems capable of dealing with symbolic and hybrid data. Our main interests

are twofold: On the one hand, we would like to obtain an efficient learning system

which can be used for practical applications involving hybrid data. On the other hand,

we would like to gain insight into the questions of how symbolic data can be processed

with connectionistic systems in principle; do there exist basic limitations; does this point

of view allow further insight into the black-box dynamics of connectionistic systems?

Due to the nature of symbolic and hybrid data, there exist two ways of asking questions

about the theoretical properties of those mechanisms: the algorithmic point of view and

the statistical point of view. One can, for example, consider the question whether sym-bolic mechanisms can be learned with hybrid systems exactly; alternatively, the focus

can lie on the property that the probability of poor performance on input data can be

limited. Generally speaking, one can focus on the symbolic data; alternatively, one can

focus on the connectionistic systems. It will turn out that this freedom leads to both,

further insight into the systems as well as additional problems which are to be solved.

Various mechanisms extend connectionistic systems with symbolic aspects; a ma-

jor problem of networks dealing with symbolic or hybrid data lies in the necessity of

processing structures with a priori unlimited size. Mainly three different approaches

can be found in the literature: Symbolic data may be represented by a fixed number of

features and further processed with standard neural networks. Time series, as an exam-

ple, may be represented by a local time window of fixed length and additional global

features such as the overall trend [23]. Formulas may be represented by the involved

symbols and a measure of their complexity. This approach is explicitely static: Data are

encoded in a finite dimensional vector space via problem specific features before further

processing with a connectionistic system. Obviously, the representation of data is not

fitted to the specific learning task since learning is independent of encoding. Moreover,

it may be difficult or in general impossible to find a representation in a finite dimen-


3/19

Perspectives on Learning Symbolic Data with Connectionistic Systems 3

sional vector space such that all relevant information is preserved. As an example, the

terms equal(a,a), equal(f(a),f(a)), equal(f(f(a)),f(f(a))), . . . could be represented by the

number of occurrences of the symbol

at the first and second position in the respec-

tive term. The terms equal(g(a,g(a,a)),g(a,g(a,a))), equal(g(g(a,a),a),g(g(a,a),a)) can

no longer be represented in the same way without loss of information, we have to add

an additional part encoding the order of the symbols.

Alternatively, the a priori unlimited structure of the inputs can be mapped to a pri-

ori unlimited processing time of the connectionistic system. Standard neural networks

are equipped with additional recurrent connections for this purpose. Data are processed

in a dynamic way involving the additional dimension of time. This can either be fully

dynamic, i.e., symbolic input and output data are processed over time, the precise dy-

namics and number of recurrent computation steps being unlimited and correlated to

the respective computation; or the model can be partially dynamic and implicitly static,

i.e., the precise dynamics are correlated to the structure of the respective symbolic data

only. In the first case, complex data may be represented via a limiting trajectory of the

system, via the location of neurons with highest activities in the neural system, or viasynchronous spike trains, for example. Processing may be based on Hebbian or compet-

itive activation such as in LISA or SHRUTI [15,39] or on an underlying potential which

is minimized such as in Hopfield networks [14]. There exist advanced approaches which

enable complex reasoning or language processing with fully dynamic systems; how-

ever, these models are adapted to the specific area of application and require a detailed

theoretical investigation for each specific approach.

In the second case, the recurrent dynamics directly correspond to the data structure

and can be determined precisely assumed the input or output structure, respectively, is

known. One can think of the processing as an inherently static approach: The recur-

rence enables the systems to encode or decode data appropriately. After encoding, a

standard connectionistic representation is available for the system. The difference to a

feature based approach consists in the fact that the encoding is adapted to the specificlearning task and need not be separated from the processing part, coding and processing

constitute one connected system. A simple example of these dynamics are discrete time

recurrent neural networks or Elman networks which can handle sequences of real vec-

tors [6,9]. Knowledge of the respective structure, i.e., the length of the sequence allows

to substitute the recurrent dynamics by an equivalent standard feedforward network.

Input sequences are processed step by step such that the computation for each entry

is based on the context of the already computed coding of the previous entries of the

sequence. A natural generalization of this mechanism allows neural encoding and de-

coding of tree structured data as well. Instead of linear sequences, one has to deal with

branchings. Concrete implementations of this approach are the recursive autoassocia-

tive memory (RAAM) [30] and labeled RAAM (LRAAM) [40], holographic reduced

representations (HRR) [29], and recurrent and folding networks [7]. They differ in the

method of how they are trained and in the question as to whether the inputs, the outputs,or both may be structured or real valued, respectively. The basic recurrent dynamics are

the same for all approaches. The possibility to deal with symbolic data, tree structures,

relies on some either fixed or trainable recursive encoding and decoding of data with

simple mappings computed by standard networks. Hence the approaches are uniform


4/19

4 Barbara Hammer

and a general theory can be developed in contrast to often very specific fully dynamic

systems. However, the idea is limited to data structures whose dynamics can be mapped

to an appropriate recursive network. It includes recursive data like sequences or tree

structures, possibly cyclic graphs are not yet covered.

We will start with the investigation of standard recurrent networks because they are

a well established and successful method and, at the same time, demonstrate a typical

behavior. Their in principle capacity as well as their learnability can be investigated

from an algorithmic as well as a statistical point of view. From an algorithmic point of

view, the connection to classical approaches like finite automata and Turing machines

is interesting. Moreover, this connection allows partial insight into the way in which

the networks perform their tasks. There are only few results concerning the learnability

of these dynamics from an algorithmic point of view. Afterwards, we will study the

statistical learnability and approximation ability of recurrent networks. These results

are transferred to various more general approaches for tree structured data.

2 Network Dynamics

First, the basic recurrent dynamics are defined. As usual, a feedforward network con-

sists of a weighted directed acyclic graph of neurons such that a global processing rule

is obtained via successive local computations of the neurons. Commonly, the neurons

iteratively compute their activation

"

$,

% & (denoting the

predecessors of neuron(,

denoting some real-valued weight assigned to connection

% & (

, "3 2 # 4

denoting the bias of neuron(

, and

its activation function4 & 4

.

Starting with the neurons without predecessors, the so called input neurons, which ob-

tain their activation from outside, the neurons successively compute their activation

until the output of the network can be found at some specified output neurons. Hence

feedforward networks compute functions from a finite dimensional real-vector space

into a finite dimensional real-vector space. A network architecture only specifies the

directed graph and the activation functions, but not the weights and biases. Often, so-

called multilayer networks or multilayer architectures are used, meaning that the graph

decomposes into subsets, so-called layers, such that connections can only be found

between consecutive layers. It is well known that feedforward neural networks are uni-

versal approximators in an appropriate sense: Every continuous or measurable function,

respectively, can be approximatedby some network with appropriate activation function

on any compact input domain or for inputs of arbitrarily high probability, respectively.

Moreover, such mappings can be learned from a finite set of examples. This, in more

detail, means that two requirements are met. First, neural networks yield valid general-

ization: The empirical error, i.e., the error on the training data, is representative for the

real error of the architecture, i.e., the error for unknown inputs, if a sufficiently large

training set has been taken into account. Concrete bounds on the required training setsize can be derived. Second, effective training algorithms for minimizing the empirical

error on concrete training data can be found. Usually, training is performed with some

modification of backpropagation like the very robust and fast method RProp [32].

Sequences of real vectors constitute simple symbolic structures. They are difficult

for standard connectionistic methods due to their unlimited length. We denote the set


5/19


of sequences with elements in an alphabet 6 by 6 7 . A common way of processing

sequences with standard networks consists in truncating, i.e., a sequence9 @ A C E E E C @ P R

with initially unknown lengthT

is substituted by only a part9 @ A C E E E C @ V R

with a priori

fixed time horizonX

. Obviously, truncation usually leads to information loss. Alter-

natively, one can equip feedforward networks with recurrent connections and use the

further dimension of time. Here, we introduce the general concept of recurrent coding

functions. Every mapping with appropriate domain and codomain induces a mapping

on sequences or into sequences, respectively, via recursive application as follows:

Definition 1. Assume6

is some set. Any function Y 6 b 4

P

& 4

P

and initial

contextf8 2 g 4

P

induce a recursive encoding

ench

Y 6

7

& 4

P

C

ench

9 @A

C E E E C @P

R $

u

fif

T w

@ P C

ench

9 @ A C E E E C @ P A R $ $otherwise

Any function C A $ Y 4

P

& 6 b 4

P

and final set 4

P

induce a recursive

decoding

decY 4

P

& 6

7

C

dec @ $

u

9 Rif

@ 2

9 @ $ C

dec A @ $ $ R

otherwise

Note that

dec @ $

may be not defined if the decoding

does not lead to values in

. Therefore one often restricts decoding to decoding of sequences up to a fixed finite

length in practice. Recurrent neural networks compute the composition of up to three

functions

dec

ench depending on their respective domain and codomain where

, ,

and

are computed by standard feedforward networks. Note that this notation is some-

what unusual in the literature. Mostly, recurrent networks are defined via their transition

function and referring to the standard dynamics of discrete dynamic system. However,

the above definition has the advantage that the role of the single network parts can be

made explicit: Symbolic data are first encoded into a connectionistic representation, this

connectionistic representation is further processed with a standard network, finally, the

implicit representation is decoded to symbolic data. In practice, these three parts are not

well separated and one can indeed show that the transformation part can be included

in either encoding or decoding. Encoding and decoding need not compute a precise

encoding or decoding such that data can be restored perfectly. Encoding and decoding

are part of a system which as a whole should approximate some function. Hence only

those parts of the data have to be taken into account which contribute to the specific

learning task. Recurrent networks are mostly used for time series prediction, i.e., the

decoding

dec is dropped. Long term prediction of time series, where the decoding part

is necessary, is a particularly difficult task and can rarely be found in applications.

A second advantage of the above formalism is the possibility to generalize the dy-

namics to tree structured data. Note that terms and formulas possess a natural repre-sentation via a tree structure: The single symbols, i.e., the variables, constants, function

symbols, predicates, and logical symbols are encoded in some real-vector space via

unique values, e.g., natural numbers or unary vectors; these values correspond to the

labels of the nodes in a tree. The tree structure directly corresponds to the structure of

the term or formula; i.e., subterms of a single term correspond to subtrees of a node


6/19

6 Barbara Hammer

(1,0,0)

(1,0,0)

(0,0,1) (0,0,1)(0,0,1)

(0,1,0)

Fig. 2. Example for a tree representation of symbolic data: Left: encoding of j m n o m o m n n

,

where o o n

represents

, o o n

representsj

, and o o n

representsm

. Right: encoding of

z n n n z n n .

equipped with the label encoding the function symbol. See Fig. 2 for an example. In

the following, we restrict the maximum arity of functions and predicates to some fixed

value|

. Hence data we are interested in are trees where each node has at most|

succes-

sors. Expanding the tree by empty nodes if necessary, we can restrict ourselves to the

case of trees with fan-out exactly|

of the nodes. Hence we will deal with tree structureswith fan-out

|as inputs or outputs of network architectures in the following.

Definition 2. A| -tree with labels in some set 6 is either the empty tree which we

denote by}

, or it consists of a root labeled with some~ 2 6

and|

subtrees, some of

which may be empty,X

A

, . . . ,X

. In the latter case we denote the tree by~ X

AC E E E C X $

.

Denote the set of|

-trees with labels in6

by 6 $ 7

.

The recursive nature of trees induces a natural dynamics for recursively encoding

or decoding trees to real vectors. We can define an induced encoding or decoding, re-

spectively, for each mapping with appropriate arity in the following way:

Definition 3. Denote by6

a set. Any mapping Y 6 b 4 $

& 4 and initial

contextf8 2 g 4

induces a recursive encoding

ench

Y 6 $

7

& 4

C X &

u

fif

X }

~ C

ench

XA

$ C E E E C

ench

X $ $ if X ~ XA

C E E E C X $ E

Any mapping

C A

C E E E C $ Y 4

& 6 b 4

$

and set 4

induces a

recursive decoding

decY 4

& 6 $

7

C @ &

u

}if

@ 2

@ $

dec A @ $ $ C E E E C

dec

@ $ $ $

otherwise.

Again,

dec might be a partial function. Therefore decoding is often restricted to decod-

ing of trees up to a fixed height in practice. The encoding recursively applies a mapping

in order to obtain a code for a tree in a real-vector space. One starts at the leaves and

recursively encodes the single subtrees. At each level the already computed codes of the

respective subtrees are used as context. The recursive decoding is defined in a similarmanner: Recursively applying some decoding function to a real vector yields the label

of the root and codes for the | subtrees. In the connectionistic setting, the two mappings

used for encoding or decoding, respectively, can be computed by standard feedforward

neural networks. As in the linear case, i.e., the case of simple recurrent networks, one

can combine mappings

ench ,

dec , and depending on the specific learning task.


7/19


Note that this definition constitutes a natural generalization of standard recurrent

networks and hence allows for successful practical applications as well as general in-

vestigations concerning concrete learning algorithms, the connection to classical mech-

anisms like tree automata, and the theoretical properties of approximation ability and

learnability. However, it is not biologically motivated compared to standard recurrent

networks, and though this approach can shed some light on the possibility of dealing

with structured data in connectionistic systems, it does not necessarily enlighten the

way in which humans solve these tasks. We will start with a thorough investigation of

simple recurrent networks since they are biologically plausible and, moreover, signifi-

cant theoretical difficulties and benefits can already be found at this level.

3 Recurrent Neural Networks

Recurrent networks are a natural tool in any domain where time plays a role, such as

speech recognition, control, or time series prediction, to mention just a few [8,9,25,41].They are also used for the classification of symbolic data such as DNA sequences [31].

Turing Capabilities

The fact that their inputs and outputs may be sequences suggests the comparison to

other mechanisms operating on sequences, such as classical Turing machines. One can

consider the internal states of the network as a memory or tape of the Turing machine.

Note that the internal states of the network may consist of real values, hence an infinite

memory is available in the network. In Turing machines, operations on the tape are per-

formed. Each operation can be simulated in a network by a recursive computation step

of the transition function. In a Turing machine, the end of a computation is indicated by

a specific final state. In a network, this behavior can be mimicked by the activation of

some specific neuron which indicates whether the computation is finished or still con-

tinues. The output of the computation can be found at the same time step at some other

specified neuron of the network. Note that computations of a Turing machine which do

not halt correspond to recursive computations of the network such that the value of the

specified halting neuron is different from some specified value. A schematic view of

such a computation is depicted in Fig. 3. A possible formalization is as follows:

Definition 4. A (possibly partial) function Y w C 7 & w C

can be computedby a

recurrent neural network if feedforward networks Y w C b 4

P

& 4

P

, Y 4

P

& 4

P

,

and Y 4

P

& 4exist such that

@ $

ench

@ $for all sequences

@, where

denotes the smallest number of iterations such that the activation of some specified

output neuron of the part is contained in a specified set encoding the end of the

computation after iteratively applying

to

ench

@ $

.Note that simulations of Turing machines are merely of theoretical interest; such com-

putation mechanisms will not be used in practice. However, the results shed some light

on the power of recurrent networks. The networks capacity naturally depends on the

choice of the activation functions. Common activation functions in the literature are

piecewise polynomial or S-shaped functions such as:


8/19

8 Barbara Hammer

inputrecursiveencoding computation

recursive output

1

not yet 1

Fig. 3. Turing computation with a recurrent network.

perceptron function H @ $ w

for@ w

, H @ $

for@

,

semilinear activation lin @ $ @

forw @

, lin @ $ @ $

otherwise,

sigmoidal function sgd @ $

e

$

A

E

Obviously, recurrent networks with a finite number of neurons and the perceptron acti-

vation function have at most the power of finite automata, since their internal stack isfinite. In [38] it is shown that recurrent networks with the semilinear activation function

are Turing universal, i.e., there exists for every Turing machine a finite size recurrent

network which computes the same function. The proof consists essentially in a simu-

lation of the stacks corresponding to the left and right half of the Turing tape via the

activation of two neurons. Additionally, it is shown that standard tape operations like

push and pop and Boolean operations can be computed with a semilinear network.

The situation is more complicated for the standard sigmoidal activation function since

exact classical computations which require precise activationsw

or

, as an example,

can only be approximated within a sigmoidal network. Hence the approximation errors

which add up in recurrent computations must be controlled. [16] shows the Turing uni-

versality of sigmoidal recurrent networks via simulating so-called clock machines, a

Turing-universal formalism which, unfortunately, leads to an exponential delay. How-

ever, people believe that standard sigmoidal recurrent networks are Turing universal

with polynomial resources, too, although the formal proof is still missing.

In [37] the converse direction, simulation of neural network computations with clas-

sical mechanisms is investigated. The authors relate semilinear recurrent networks to

so-called non-uniform Boolean circuits. This is particularly interesting due to the fact

that non-uniform circuits are super-Turing universal, i.e., they can compute every, pos-

sibly non-computable function, possibly requiring exponential time. Speaking in terms

of neural networks: Additionally to the standard operations, networks can use the un-

limited storage capacity of the single digits in their real weights as an oracle, a linear

number of such digits is available in linear time. Again, the situation is more diffi-

cult for the sigmoidal activation function. The super-Turing capability is demonstrated

in [36], for example. [11] shows the super-Turing universality in possibly exponential

time and which is necessary in all super-Turing capability demonstrations of recurrentnetworks, of course with at least one irrational weight. Note that the latter results rely

on an additional severe assumption: The operations on the real numbers are performed

with infinite precision. Hence further investigation could naturally be put in a line with

the theory of computation on the real numbers [3].


9/19


Finite Automata and Languages

The transition dynamics of recurrent networks directly correspond to finite automata,

hence comparing to finite automata is a very natural question. For formal definition, afinite automaton with states computes a function X ench Y 6 7 & w C , where 6 is a

finite alphabet,X Y 6 b C E E E C & C E E E C

is a transition function mapping an

input letter and a context state to a new state,f 2 C E E E C

is the initial state, and

is a projection of the states to w C

. A language 6 7

is accepted by an automaton

if some automaton computing

X

ench exists such that

@ 2 6 7 - @ $

.

Since neural networks are far more powerful, they are super-Turing universal, it is

not surprising that finite automata and some context sensitive languages, too, can be

simulated by recurrent networks. However, automata simulations have practical conse-

quences: The constructions lead to effective techniques of automata rule insertion and

extraction; moreover, the automaton behavior is even learnable from data as demon-

strated in computer simulations. It has been shown in [27], for example, that finite

automata can be simulated by recurrent networks of the form

ench

,

being a sim-ple projection, and

being a standard feedforward network. The number of neurons

which are sufficient in

is upper bounded by a linear term in

, the number of states of

the automaton. Moreover, the perceptron activation function or the sigmoidal activation

function or any other function with similar properties will do. One could ask whether

less neurons are sufficient since one could encode

states in the activation of only

binary valued neurons. However, an abstract argumentation shows that at least

for perceptron networks, a number of

$neurons is necessary. Since this

argumentation can be used at several places, we shortly outline the main steps:

The set of finite automata with

states and binary inputs defines the class of func-

tions computable with such a finite automaton, say F

. Assume a network with at most

neurons could implement every

-state finite automaton. Then the class of functions

computable with

-neuron architectures, say F , would be at least as powerful as F

.

Consider the sequences of length

with an entry

precisely at the(th position.

Assume some arbitrary binary function

is fixed on these sequences. Then there exists

an

-state automaton which implements

on the sequences: We can use

states

for counting the positions of the respective input entry. We map to a specified final ac-

cepting state whenever the corresponding function value is

. As a consequence, we

need to find for those

sequences and every dichotomy some recurrent network

which maps the sequences accordingly, too. However, the number of input sequences

which can be mapped to arbitrary values is upper bounded by the so-called pseudodi-

mension, a quantity measuring the richness of function classes as we will see later. In

particular, this quantity can be upper bounded by a term O

$

$ $

for perceptron networks with input sequences of length

and

weights. Hence

the limit

$follows.

However, various researchers have demonstrated in theory as well as in practice thatsigmoidal recurrent networks can recognize some context sensitive languages as well:

It is proved in [13] that they can perform counting, i.e., recognize languages of the form

~

P P P

- T 2 0 or, generally spoken, languages where the multiplicities of various

symbols have to match. Approaches like [20] demonstrate that a finite approximation of

these languages can be learned from a finite set of examples. This capacity is of partic-


10/19

10 Barbara Hammer

ular interest due to its importance for the capability of understanding natural languages

with nested structures. The learning algorithms are usually standard algorithms for re-

current networks which we will explain later. Commonly, they do not guarantee the

correct long-term behavior of the networks, i.e., they lead only sometimes to the correct

behavior for long input sequences, although they perform surprisingly well on short

training samples. Learnability for example in the sense of identification in the limit as

introduced by Gold is not granted. Approaches which explicitely tackle the long term

behavior and which, moreover, allow for a symbolic interpretation of the connection-

istic processing are automata rule insertion or extraction: The possibly partial explicit

knowledge of the automatons behavior can be directly encoded in a recurrent network

used for connectionistic processing, if necessary with further retraining of the network.

Conversely, automata rules can be extracted from a trained network which describe the

behavior approximately and generalize to arbitrarily long sequences [5,26].

However, all these approaches are naturally limited due to the fact that common

connectionistic data are subject to noise. Adequate recursive processing relies to some

extent on the accuracy of the computation and the input data. The capacity is different ifnoise is present: At most finite state automata can be found assumed the support of the

noise is limited. Assumed the support of the noise is not limited, e.g. the noise is Gaus-

sian, then the capacity reduces to the capacity of simple feedforward dynamics with a

finite time window [21,22]. Hence recurrent networks can in a finite approximation

algorithmically process symbolic data, the presence of noise limits their capacities.

Learning Algorithms

Naturally, an alternative point of view is the classical statistical scenario, i.e., possibly

noisy data allow to learn an unknown regularity with high accuracy and confidence for

data of high probability. In particular, the behavior need not be correct for every input;

the learning algorithms are only guaranteed to work well in typical cases, in unlikelysituations the system may fail. The classical PAC setting as introduced by Valiant for-

malizes this approach of learnability [42] as follows: Some unknown regularity

for

which only a finite set of exampless @ C @ $ $

is available, is to be learned. A learning

algorithm chooses a function from a specified class of functions, e.g. given by a neural

architecture, based on the training examples. There are two demands: The output of

the algorithm should nearly coincide with the unknown regularity, mathematically, the

probability that the algorithm outputs a function which differs considerably from the

function to be learned should be small. Moreover, the algorithm should run in polyno-

mial time, the parameters being the desired accuracy and confidence of the algorithm.

Usually, learning separates into two steps as depicted in Fig. 4: First, a function

class with limited capacity is chosen, e.g. the number of neurons and weights is fixed,

such that the function class is large enough to approximate the regularity to be learned

and, at the same time, allows identification of an approximation based on the availabletraining set, i.e., guarantees valid generalization to unseen samples. This is commonly

addressed by the term structural risk minimization and obtained via a control of the

so-called pseudodimension of the function class. We will address this topic later. In

a second step, a concrete regularity is actually searched for in the specified function

class, commonly via so called empirical risk minimization, i.e., a function is chosen


11/19


nested function classes

of increasing complexity

f

g

f

First step: choose a function class

^f: empirical approximation^f: function to be learned

Second step: minimize the empirical error,g is the output of the algorithm^

generalization error

Fig. 4. Structural and empirical risk minimization

which nearly coincides with the regularity

to be learned on the training examples

@

. According to these two steps, the generalization error divides into two parts: The

structural error, i.e., the deviation of the empirical error on a finite set of data from

the overall error for functions in the specified class, and the empirical error, i.e., the

deviation of the output function from the regularity on the training set.We will shortly summarize various empirical risk minimization techniques for re-

current neural networks: Assume @ C @ $ $

are the training data and some neural ar-

chitecture computing a function

which is parameterized by the weights

is chosen.

Often, training algorithms choose appropriate weights

by means of minimizing the

quadratic error

@ $ C @ $ $

$,

being some appropriate distance, e.g. the Eu-

clidian distance. Since in popular cases the above term is differentiable with respect

to the weights, a simple gradient descent can be used. The derivative with respect to

one weight decomposes into various terms according to the sequential structure of the

inputs and outputs, i.e., the number of recursive applications of the transition functions.

A direct recursive computation of the single terms has the complexity O

$,

being the number of weights and

being the number of recurrent steps. In so-called

real time recurrent learning, these weight updates are performed immediately after thecomputation such that initially unlimited time series can be processed. This method can

be applied in online learning in robotics, for example. In analogy to standard backprop-

agation, the most popular learning algorithm for feedforward networks, one can speed

up the computation and obtain the derivatives in time O

$via first propagating the

signals forward through the entire network and recursive steps and afterwards propagat-

ing the error signals backwards through the network and all recursive steps. However,

the possibility of online adaptation while a sequence is still processed is lost in this so

called backpropagation through time [28,44]. There exist combinations of both meth-

ods and variations for training continuous systems [33]. The true gradient is sometimes

substituted by a truncated gradient in earlier approaches [6]. Since theoretical investi-

gation suggests that pure gradient descent techniques will likely suffer from numerical

instabilities the gradients will either blow up or vanish at propagation through the

recursive steps alternative methods propose a random guessing, statistical approacheslike the EM algorithm, or an explicit normalization of the error like LSTM [1,12].

Practice shows that training recurrent networks is harder than training feedforward

networks due to numerically ill-behaved gradients as shown in [2]. Hence the com-

plexity of training recurrent networks is a very interesting topic; moreover, the fact that


12/19

12 Barbara Hammer

the empirical error can be minimized efficiently is one ingredient of PAC learnabil-

ity. Unfortunately, precise theoretical investigations can be found only for very limited

situations: It has been proved that fixed recurrent architectures with the perceptron ac-

tivation function can be trained in polynomial time [11]. Things change if architectural

parameters are allowed to vary. This means that the number of input neurons, for exam-

ple, may change from one training problem to the next since most learning algorithm

are uniform with respect to the architectural size. In this case, almost every realistic

situation is NP-hard already for feedforward networks, although this has not yet been

proved for a sufficiently general scenario. One recent result reads as follows: Assume

there is given a multilayer perceptron architecture where the number of input neurons

is allowed to vary from one instance to the next instance, the input biases are dropped,

and no solution without errors exist. Then it is NP-hard to find a network such that

the number of misclassified points of the network compared to the optimum achievable

number is limited by a term which may even be exponential in the network size [4].

People are working on adequate generalizations to more general or typical situations.

Approximation Ability

The ability of recurrent neural networks to simulate Turing machines manifests their

enormous capacity. From a statistical point of view, we are interested in a slightly dif-

ferent question: Given some finite set of examples @ C f $

, the inputs@

or outputsf

may be sequences, does there exist a network which maps each@

approximately onto

the correspondingf

? Which are the required resources? If there is an underlying map-

ping, can it be approximated in an appropriate sense, too? The difference to the previous

argumentation consists in the fact that there need not be a recursive underlying regular-

ity producingf

from@

. At the same time we do not require to interpolate or simulate

the underlying possibly non-recursive behavior precisely in the long term limit.

One way to attack the above questions consists in a division of the problem intothree parts: It is to be shown that sequences can be encoded or decoded, respectively,

with a neural network, and that the induced mapping on the connectionistic represen-

tation can be approximated with a standard feedforward network. There exist two nat-

ural ways of encoding sequences in a finite dimensional vector space: Sequences of

length at most|

can be written in a vector space of dimension|

, filling the empty

spaces, if any, with entriesw

; we refer to this coding as vector-coding. Alternatively,

the single entries in a sequence can be cut to a fixed precision and concatenated in a

single real number; we refer to this method as real-value-coding. Hence the sequence

9 w E C w E C w E C w E Rbecomes

w E C w E C w E C w E C w C w C w $or

w E w w, as an

example. One can show that both codings can be computed with a recurrent network.

Vector-encoding and decoding can be performed with a network whose number of

neurons is linear in the maximum input length and which possesses an appropriate ac-

tivation function. Real-value-encoding is possible with only a fixed number of neuronsfor purely symbolic data, i.e., inputs from

6 7with

- 6 - . Sequences in

4

P

$ 7re-

quire additional neurons which compute the discretization of the real values. Naturally,

precise decoding of the discretization is not possible since this information is lost in the

coding. Encoding such that unique codes result can be performed with O

$ neurons

being the number of sequences to be encoded. Decoding real-value codes is possible,


13/19


too. However, a standard activation function requires a number of neurons increasing

with the maximum length even for symbolic data [11].

It is well known that feedforward networks with one hidden layer and appropriate

activation function are universal approximators. Hence one can conclude that approx-

imation of general functions is possible if the above encoding or decoding networks

are combined with a standard feedforward network which approximates the induced

mappings on the connectionistic codes. To be more precise, approximating measurable

functions on inputs of arbitrary high probability is possible through real-value encoding.

Each continuous function can be approximated for inputs from a compact set through

vector-encoding. In the latter case, the dimension used for the connectionistic represen-

tation necessarily increases for increasing length of the sequences [11].

Learnability

Having settled the universal approximation ability, we should make sure that the struc-

tural risk can be controlled within a fixed neural architecture. I.e., we have to show thata finite number of training examples is sufficient in order to nearly specify the unknown

underlying regularity. Assume there is fixed some probability measure

on the inputs.

For the moment assume that we deal with real-valued outputs only. Then one standard

way to guarantee the above property for a function class Fis via the so called uniform

convergence of empirical distances (UCED) property, i.e.,

@ -

F- C $

C C @ $ - $ & w & $

holds for every w

where C $ - - @ $

is the real errorand

C C @ $

- @

$ @

$ - is the empirical error. The UCED property guarantees that the

empirical error of any learning algorithm is representative for the real generalization

error. We refer to the above distance as the risk. A standard way to prove the UCED

property consists in an estimation of a combinatorial quantity, the pseudodimension.

Definition 5. The pseudodimension of a function class F, VC F$ is the largest car-

dinality (possibly infinite) of a set of points which can be shattered. A set of points

@A

C E E E C @

is shattered if reference points

2 4

exist such that for every

Y

@A

C E E E C @

& w C some function

2 Fexists with @

$

@

$ .

The pseudodimension measures the richness of a function class. It is the largest set of

points such that every possible binary function can be realized on these points. No gen-

eralization can be expected if a training set can be shattered. It is well known that the

UCED property holds if the pseudodimension of a function class is finite [43]. More-

over, the number of examples required for valid generalization can be explicitely limited

by roughly the order , being the pseudodimension and the required accuracy.Assume F is given by a recurrent architecture with

weights. Denote by FV

the

restriction to inputs of length at most X . Then one can limit VC FV

$ by a polynomial in

X and

. However, lower bounds exist which show that the pseudodimension necessar-

ily depends onX

in most interesting cases [17]. Hence VCF

$is infinite for unrestricted


14/19

14 Barbara Hammer

sequences. As a consequence, the above argumentation proves learnability only for re-

stricted inputs. Moreover, since a finite pseudodimension (more precisely, a finite so-

called fat-shattering dimension) is necessary for distribution independent learnability

under realistic conditions, distribution independent bounds for the risk cannot exist in

principle [43]. Hence one has to add special considerations to the standard argumenta-

tion for recurrent architectures. Mainly two possibilities can be found in the literature:

One can either take specific knowledge about the underlying probability into consider-

ation, or one can derive posterior bounds which depend on the specific training set. The

results are as follows [11]:

Assume w

and one can findX

such that the probability of sequences of length

Xis bounded from above by

. Then the risk is limited by

provided that the

number of examples is roughly of order

V

,

V

being the (finite) pseudodimension of

the architecture restricted to input sequences of length X

.

Assume training on a set of size

and maximum lengthX

has been performed. Then

the risk can be bounded by a term of roughly order V

$ ,

Vbeing the (finite)

pseudodimension of the architecture restricted to input sequences of length at most X . Amore detailed analysis even allows to drop the long sequences before measuring

X[10].

Hence one can guarantee valid generalization, although only with additional con-

siderations compared to the feedforward case. Moreover, there may exist particularly

ugly situations for recurrent networks where training is possible only with an exponen-

tially increasing number of training examples [11]. This is the price one has to pay for

the possibility of dealing with structured data, in particular data with a priori unlimited

length. Note that the above argumentation holds only for architectures with real val-

ues as outputs. The case of structured outputs requires a more advanced analysis via so

called loss functions and yields to similar results [10].

4 Advanced Architectures

The next step is to go from sequences to tree structured data. Since trees cover terms

and formulas, this is a fairly general approach. The network dynamics and theoretical

investigations are direct generalizations of simple recurrent networks. One can obtain

a recursive neural encoding

ench

Y 6

$ 7 & 4

P

and a recursive neural decoding

decY

4

P

& 6

$7 of trees if

and

are computed by standard networks. These codings

can be composed with standard networks for the approximation of general functions.

Depending on whether the inputs, the outputs, or both may be structured and depending

on which part is trainable, we obtain different connectionistic mechanisms. A sketch of

the first two mechanisms which are described in the following can be found in Fig. 5.

Recursive Autoassociative Memory

The recursive autoassociative memory (RAAM) as introduced by Pollack and gener-

alized by Sperduti and Starita [30,40] consists of a recursive encoding

ench , a recursive

decoding dec , and being standard feedforward networks, and a standard feedforward

network . An appropriate composition of these parts can approximate mappings where

the inputs or the outputs may be | -trees or vectors, respectively. Training proceeds in


15/19


encoding

a

b c

d fe

a

b c

d fe

x

Folding networks:

decodingencoding

a

b c

d fe

RAAM:

Fig. 5. Processing tree structures with connectionistic methods.

two steps, first the composition

dec

ench is trained on the identity on a given training set

with truncated gradient descent such that the two parts constitute a proper encoding or

decoding, respectively. Afterwards, a standard feedforward network is combined with

either the encoding or decoding and trained via standard backpropagation where the

weights in the recursive coding are fixed. Hence arbitrary mappings on structured datacan be approximated. Note that the encoding is fitted to the specific training set. It is

not fitted to the specific approximation task. In all cases encoding and decoding must

be learned even if only the inputs or only the outputs are structured.

In analogy to simple recurrent networks the following questions arise: Can any map-

ping be approximated in principle? Do the respective parts show valid generalization?

Is training efficient? We will not consider the efficiency of training in the following

since the question is not yet satisfactorily answered for feedforward networks. The other

questions are to be answered for both, the coding parts and the feedforward approxima-

tion on the encoded values. Note that the latter task only deals with standard feedfor-

ward networks whose approximation and generalization properties are well established.

Concerning the approximation capability of the coding parts we can borrow ideas from

recurrent networks: A natural encoding of tree structured data consists in the prefix rep-

resentation of a tree. For example, the

-tree~

A

A C

A C

$ $ C

$can uniquely be

represented by the sequence9 ~ C

A C

A C } C } C

C A C } C } C C } C } C

C } C } Rincluding

the empty tree}

. Depending on whether real-labeled trees are to be encoded precisely,

or only symbolic data, i.e., labels from6 7

where- 6 -

, are dealt with or a finite

approximation of the real values is sufficient, the above sequence can be encoded in


16/19

16 Barbara Hammer

a real-value code with a fixed dimension or a vector code whose dimension depends

on the maximum height of the trees. The respective encoding or decoding can be com-

puted with recursive architectures induced by a standard feedforward network [11]. The

required resources are as follows: Vector-coding requires a number of neurons which

increases exponentially with the maximum height of the trees. Real-value-encoding re-

quires only a fixed number of neurons for symbolic data and a number of neurons which

is quadratic in the number of patterns for real-valued labels. Real-value-decoding re-

quires a number of neurons which increases exponentially with increasing height of

the trees; the argument consists in a lower bound of the pseudodimension of function

classes which perform proper decoding. This number increases more than exponentially

in the height. Learning the coding yields valid generalization provided prior information

about the input distribution is available. Alternatively, one can derive posterior bounds

on the generalization error which depend on the concrete training set. These results

follow in the same way as for standard recurrent networks. Hence the RAAM consti-

tutes a promising and in principle applicable mechanism. Due to the difficulty of proper

decoding, applications can be found for small training examples only [40].

Folding Networks

Folding networks use ideas of the LRAAM [19]. They focus on clustering symbolic

data, i.e., the outputs are not structured, but real vectors. This limitation makes decoding

superfluous. For training, the encoding part and the feedforward network are composed

and simultaneously trained on the respective task via a gradient descent method, so-

called backpropagation through structure, a generalization of backpropagation through

time. Hence the encoding is fitted to the data and the respective learning task.

It follows immediately from the above discussion that folding networks can approx-

imate every measurable function in probability using real-value codes, and they can

approximate every continuous function on compact input domains with vector codes.Additionally, valid generalization can be guaranteed with a similar argumentation as

above with bounds depending on the input distribution or the concrete training set.

Due to the fact that the difficult part, proper decoding, is dropped, several applica-

tions of folding networks for large data sets can be found in the literature: classification

of terms and formulas, logo recognition, drug design, support of automatic theorem

provers, . . . [19,34,35]. Moreover, they can be related to finite tree automata in analogy

to the correlation of recurrent networks and finite automata [18].

Holographic Reduced Representation

Holographic reduced representation (HRR) is identical to RAAM with a fixed encod-

ing and decoding: a priori chosen functions given by so-called circular correlation or

convolution, respectively [29]. Correlation (denoted by

) and convolution (denotedby

) constitute a specific way to relate two vectors to a vector of the same dimen-

sion such that correlation and convolution are approximately inverse to each other, i.e.,

~

~

$ ~ . Hence one can encode a tree ~ XA

C X

$ via computing the convolution of

each entry with a specific vector indicating the role of the component and adding these

three vectors: A ~

X A

X ,

being the roles. The single entries can be


17/19


approximately restored via correlation: A A ~ X A X $ ~ . One can

compute the deviation in the above equation under statistical assumptions. Commonly,

the restored values are accurate provided the dimension of the vectors is sufficiently

high, the height of the trees is limited, and the vectors are additionally cleaned up in an

associative memory. It follows immediately from our above argumentation that these

three conditions are necessary: Decoding is a difficult task which requires for standard

computations exponentially increasing resources. HRR is used in the literature for stor-

ing and recognizing language [29]. Since encoding and decoding are fixed, no further

investigation of the approximation or generalization ability is necessary.

5 Conclusions

Combinations of symbolic and connectionistic systems, more precisely, connectionistic

systems processing symbolic data have been investigated. A particular difficulty con-

sists in the fact that the informational content of symbolic data is not limited a priori.Hence a priori unlimited length is to be mapped to a connectionistic vector represen-

tation. We have focused on recurrent systems which map the unlimited length to a

priori unlimited processing time. Simple recurrent neural networks constitute a well

established model. Apart from the simplicity of the data they process, sequences, the

main theoretical properties are the same as for advanced mechanisms. One can inves-

tigate algorithmic or statistical aspects of learning, the first ones being induced by the

nature of the data, the second ones by the nature of the connectionistic system. We

covered algorithmic aspects mainly in comparison to standard mechanisms. Although

being of merely theoretical interest, the enormous capacity of recurrent networks has

turned out. Concerning statistical learning theory, satisfactory results for the universal

approximation capability and the generalization ability have been established, although

generalization can only be guaranteed if specifics of the data are taken into account.

The idea of coding leads to an immediate generalization to tree structured data.

Well established approaches like RAAM, HRR, and folding networks fall within this

general definition. The theory established for recurrent networks can be generalized to

these advanced approaches immediately. The in-principle statistical learnability of these

mechanisms follows. However, some specific situations might be extremely difficult:

Decoding requires an increasing amount of resources. Hence the RAAM is applicable

for small data only, decoding in HRR requires an additional cleanup, whereas folding

networks can be found in real world applications.

Nevertheless, the results are encouraging since they prove the possibility to process

symbolic data with neural networks and constitute a theoretical foundation for the suc-

cess of some of the above mentioned methods. Unfortunately, the general approaches

do neither generalize to cyclic structures like graphs, nor do they provide biological

plausibility and could help explaining human recognition of these data. For both as-pects fully dynamic approaches would be more promising although it would be more

difficult to find effective training algorithms for practical applications.


18/19

18 Barbara Hammer

References

1. Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backpropaga-

tion. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information

Processing Systems, Volume 5. Morgan Kaufmann, 1994.

2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient de-

scent is difficult. IEEE Transactions on Neural Networks, 5(2), 1994.

3. L. Blum, F. Chucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer,

1998.

4. B. DasGupta, and B. Hammer. On approximate learning by multi-layered feedforward

circuits. In: H. Arimura, S. Jain, A. Sharma (eds.), Algorithmic Learning Theory2000,

Springer, 2000.

5. M. W. Craven and J. W. Shavlik. Using sampling and queries to extract rules from trained

neural networks. In: Proceedings of the Eleventh International Conference on Machine

Learning, Morgan Kaufmann, 1994.

6. J. L. Elman. Finding structure in time. Cognitive Science, 14, 1990.

7. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of datasequences. IEEE Transactions on Neural Networks, 9(5), 1997.

8. C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural

networks. IEEE Transactions on Neural Networks, 5(2), 1994.

9. M. Gori, M. Mozer, A. C. Tsoi, and R. L. Watrous. Special issue on recurrent neural networks

for sequence processing. Neurocomputing, 15(3-4), 1997.

10. B. Hammer. Approximation and generalization issues of recurrent networks dealing with

structured data. In: P. Frasconi, M. Gori, F. Kurfes, and A. Sperduti, Proceedings of the ECAI

workshop on Foundations of connectionist-symbolic integration: representation, paradigms,

and algorithms, 2000.

11. B. Hammer. Learning with recurrent neural networks. Lecture Notes in Control and Infor-

mation Sciences 254, Springer, 2000.

12. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8),

1997.

13. S. Holldobler, Y. Kalinke, and H. Lehmann. Designing a Counter: Another case study of Dy-

namics and Activation Landscapes in Recurrent Networks. In G. Brewka and C. Habel and

B. Nebel (eds.): KI97: Advances in Artificial Intelligence, Proceedings of the 21st German

Conference on Artificial Intelligence, LNAI 1303, Springer, 1997.

14. J.J. Hopfield and D.W. Tank. Neural computation of decisions in optimization problems.

Biological Cybernetics, 52, 1985.

15. J.E. Hummel and K.L. Holyoak. Distributed representation of structure: a theory of analog-

ical access and mapping. Psychological Review, 104, 1997.

16. J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks.

Information and Computation, 128, 1996.

17. P. Koiran and E. D. Sontag. Neural networks with quadratic VC dimension. Journal of

Computer and System Sciences, 54, 1997.

18. A. Kuchler. On the correspondence between neural folding architectures and tree automata.

Technical report, University of Ulm, 1998.19. A. Kuchler and C. Goller. Inductive learning symbolic domains using structure-driven neural

networks. In G. Gorz and S. Holldobler, editors, KI-96: Advances in Artificial Intelligence.

Springer, 1996.

20. S. Lawrence, C.L. Giles, and S. Fong. Can recurrent neural networks learn natural language

grammars?. In: International Conference on Neural Networks, IEEE Press, 1996.


19/19


21. W. Maass and P. Orponen. On the effect of analog noise in discrete-time analog computation.

Neural Computation, 10(5), 1998.

22. W. Maass and E. D. Sontag. Analog neural nets with Gaussian or other common noise

distributions cannot recognize arbitrary regular languages. In M. C. Mozer, M. I. Jordan,

and T. Petsche, editors, Advances in Neural Information Processing Systems, Volume 9. The

MIT Press, 1998.

23. M. Masters. Neural, Novel, & Hybrid Algorithms for Time Series Prediction. Wiley, 1995.

24. T. Mitchel. Machine Learning. McGraw-Hill, 1997.

25. M. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and

N. Gershenfeld, editors, Predicting the future and understanding the past. Addison-Wesley,

1993.

26. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent neural net-

works. Neural Networks, 9(1), 1996.

27. C. Omlin and C. Giles. Constructing deterministic finite-state automata in recurrent neural

networks. Journal of the ACM, 43(2), 1996.

28. B. A. Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey.

IEEE Transactions on Neural Networks, 6(5), 1995.29. T. Plate. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3),

1995.

30. J. Pollack. Recursive distributed representation. Artificial Intelligence, 46, 1990.

31. M. Reczko. Protein secondary structure prediction with partially recurrent neural networks.

SAR and QSAR in environmental research, 1, 1993.

32. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation: The

RPROP algorithm. In Proceedings of the Sixth International Conference on Neural Net-

works. IEEE, 1993.

33. J. Schmidhuber. A fixed size storage O(

) time complexity learning algorithm for fully

recurrent continually running networks. Neural Computation, 4(2), 1992.

34. T. Schmitt and C. Goller. Relating chemical structure to activity with the structure processing

neural folding architecture. In Engineering Applications of Neural Networks, 1998.

35. S. Schulz, A. Kuchler, and C. Goller. Some experiments on the applicability of folding

architectures to guide theorem proving. In Proceedings of the 10th International FLAIRSConference, 1997.

36. H. T. Siegelmann. The simple dynamics of super Turing theories. Theoretical Computer

Science, 168, 1996.

37. H. T. Siegelmann and E. D. Sontag. Analog computation, neural networks, and circuits.

Theoretical Computer Science, 131, 1994.

38. H. T. Siegelmann and E. D. Sontag. On the computational power of neural networks. Journal

of Computer and System Sciences, 50, 1995.

39. L.Shastri. Advances in Shruti A neurally motivated model of relational knowledge repre-

sentation and rapid inference using temporal synchrony. Applied Intelligence, 11, 1999.

40. A. Sperduti. Labeling RAAM. Connection Science, 6(4), 1994.

41. J. Suykens, B. DeMoor, and J. Vandewalle. Static and dynamic stabilizing neural controllers

applicable to transition between equilibrium point. Neural Networks, 7(5), 1994.

42. L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984.43. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997.

44. R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and

their computational complexity. In Y. Chauvin and D. Rumelhart, editors, Back-propagation:

Theory, Architectures and Applications. Erlbaum, 1992.

barbara hammer- perspectives on learning symbolic data with connectionistic systems

Documents