barbara hammer- perspectives on learning symbolic data with connectionistic systems

Upload: grettsz

Post on 06-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    1/19

    Perspectives on Learning Symbolic Data with

    Connectionistic Systems

    Barbara Hammer

    University of Osnabruck, Department of Mathematics/Computer Science, D-49069 Osnabruck,

    Germany, e-mail: [email protected].

    Abstract. This paper deals with the connection of symbolic and subsymbolic systems. It focuses

    on connectionistic systems processing symbolic data. We examine the capability of learning sym-

    bolic data with various neural architectures which constitute partially dynamic approaches: dis-

    crete time partially recurrent neural networks as a simple and well established model for process-

    ing sequences, and advanced generalizations like holographic reduced representation, recursive

    autoassociative memory, and folding networks for processing tree structured data. The methods

    share the basic dynamics, but they differ in the specific training methods. We consider the follow-

    ing questions: Which are the representational capabilities of the architectures from an algorithmic

    point of view? Which are the representational capabilities from a statistical point of view? Are

    the architectures learnable in an appropriate sense? Are they efficiently learnable?

    1 Introduction

    Symbolic methods and connectionistic or subsymbolic systems constitute complemen-

    tary approaches in order to automatically process data appropriately. Various learning

    algorithms for learning an unknown regularity based on training examples exist in both

    domains: decision trees, rule induction, inductive logic programming, version spaces,

    . . . on the one side and Bayesian reasoning, vector quantization, clustering algorithms,

    neural networks, . . . on the other side [24]. The specific properties of the learning algo-

    rithms are complementary as well. Symbolic methods deal with high level information

    formulated via logical formulas, for example; data processing is human-understandable;

    hence it is often easy to involve prior knowledge, to adapt the training outputs to specific

    domains, or to retrain the system on additional data; at the same time, training is often

    complex, inefficient, and sensitive to noise. In comparison, connectionistic systems deal

    with low level information. Since they perform pattern recognition, their behavior is not

    human understandable and often, adaptation to specific situations or additional data re-

    quires complete retraining. At the same time, training is efficient, noise tolerant, and

    robust. Common data structures for symbolic methods are formulas or terms, i.e., high

    level data with little redundant information and a priori unlimited structure where lots

    of information lie in the interaction of the single data components. As an example, the

    meaning of each of the symbols in the term father(John,Bill) is essentially connected toits respective positions in the term. No symbol can be omitted without loosing impor-

    tant information. Assumed Bill was the friend of Marys brother, the above term could

    be substituted by father(John,friend(brother(Mary))), a term with a different length and

    structure. Connectionistic methods process patterns, i.e., real vectors of a fixed dimen-

    sion, which commonly comprise low level, noisy, and redundant information of a fixed

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    2/19

    2 Barbara Hammer

    Fig. 1. Example for subsymbolic data: hand-written digit

    .

    and determined form. The precise value and location of the single components is often

    unimportant, information comes from the sum of local features. As an example, Fig. 1

    depicts various representations of the digit

    ; each picture can be represented by a vec-

    tor of gray-levels; the various pictures differ considerably in detail while preserving

    important features such as two curved lines of the digit

    .

    Often, data possess both, symbolic and subsymbolic aspects: As an example, database

    entries may combine the picture of a person, his income, and his occupation; web sites

    consist of text, pictures, formulas, and links; arithmetical formulas may contain vari-

    ables and symbols as well as real numbers. Hence appropriate machine learning meth-

    ods have to process hybrid data. Moreover, people are capable of dealing with both

    aspects at the same time. It would be interesting to see which mechanisms allow artifi-

    cial learning systems to handle both aspects simultaneously. We will focus on connec-

    tionistic systems capable of dealing with symbolic and hybrid data. Our main interests

    are twofold: On the one hand, we would like to obtain an efficient learning system

    which can be used for practical applications involving hybrid data. On the other hand,

    we would like to gain insight into the questions of how symbolic data can be processed

    with connectionistic systems in principle; do there exist basic limitations; does this point

    of view allow further insight into the black-box dynamics of connectionistic systems?

    Due to the nature of symbolic and hybrid data, there exist two ways of asking questions

    about the theoretical properties of those mechanisms: the algorithmic point of view and

    the statistical point of view. One can, for example, consider the question whether sym-bolic mechanisms can be learned with hybrid systems exactly; alternatively, the focus

    can lie on the property that the probability of poor performance on input data can be

    limited. Generally speaking, one can focus on the symbolic data; alternatively, one can

    focus on the connectionistic systems. It will turn out that this freedom leads to both,

    further insight into the systems as well as additional problems which are to be solved.

    Various mechanisms extend connectionistic systems with symbolic aspects; a ma-

    jor problem of networks dealing with symbolic or hybrid data lies in the necessity of

    processing structures with a priori unlimited size. Mainly three different approaches

    can be found in the literature: Symbolic data may be represented by a fixed number of

    features and further processed with standard neural networks. Time series, as an exam-

    ple, may be represented by a local time window of fixed length and additional global

    features such as the overall trend [23]. Formulas may be represented by the involved

    symbols and a measure of their complexity. This approach is explicitely static: Data are

    encoded in a finite dimensional vector space via problem specific features before further

    processing with a connectionistic system. Obviously, the representation of data is not

    fitted to the specific learning task since learning is independent of encoding. Moreover,

    it may be difficult or in general impossible to find a representation in a finite dimen-

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    3/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 3

    sional vector space such that all relevant information is preserved. As an example, the

    terms equal(a,a), equal(f(a),f(a)), equal(f(f(a)),f(f(a))), . . . could be represented by the

    number of occurrences of the symbol

    at the first and second position in the respec-

    tive term. The terms equal(g(a,g(a,a)),g(a,g(a,a))), equal(g(g(a,a),a),g(g(a,a),a)) can

    no longer be represented in the same way without loss of information, we have to add

    an additional part encoding the order of the symbols.

    Alternatively, the a priori unlimited structure of the inputs can be mapped to a pri-

    ori unlimited processing time of the connectionistic system. Standard neural networks

    are equipped with additional recurrent connections for this purpose. Data are processed

    in a dynamic way involving the additional dimension of time. This can either be fully

    dynamic, i.e., symbolic input and output data are processed over time, the precise dy-

    namics and number of recurrent computation steps being unlimited and correlated to

    the respective computation; or the model can be partially dynamic and implicitly static,

    i.e., the precise dynamics are correlated to the structure of the respective symbolic data

    only. In the first case, complex data may be represented via a limiting trajectory of the

    system, via the location of neurons with highest activities in the neural system, or viasynchronous spike trains, for example. Processing may be based on Hebbian or compet-

    itive activation such as in LISA or SHRUTI [15,39] or on an underlying potential which

    is minimized such as in Hopfield networks [14]. There exist advanced approaches which

    enable complex reasoning or language processing with fully dynamic systems; how-

    ever, these models are adapted to the specific area of application and require a detailed

    theoretical investigation for each specific approach.

    In the second case, the recurrent dynamics directly correspond to the data structure

    and can be determined precisely assumed the input or output structure, respectively, is

    known. One can think of the processing as an inherently static approach: The recur-

    rence enables the systems to encode or decode data appropriately. After encoding, a

    standard connectionistic representation is available for the system. The difference to a

    feature based approach consists in the fact that the encoding is adapted to the specificlearning task and need not be separated from the processing part, coding and processing

    constitute one connected system. A simple example of these dynamics are discrete time

    recurrent neural networks or Elman networks which can handle sequences of real vec-

    tors [6,9]. Knowledge of the respective structure, i.e., the length of the sequence allows

    to substitute the recurrent dynamics by an equivalent standard feedforward network.

    Input sequences are processed step by step such that the computation for each entry

    is based on the context of the already computed coding of the previous entries of the

    sequence. A natural generalization of this mechanism allows neural encoding and de-

    coding of tree structured data as well. Instead of linear sequences, one has to deal with

    branchings. Concrete implementations of this approach are the recursive autoassocia-

    tive memory (RAAM) [30] and labeled RAAM (LRAAM) [40], holographic reduced

    representations (HRR) [29], and recurrent and folding networks [7]. They differ in the

    method of how they are trained and in the question as to whether the inputs, the outputs,or both may be structured or real valued, respectively. The basic recurrent dynamics are

    the same for all approaches. The possibility to deal with symbolic data, tree structures,

    relies on some either fixed or trainable recursive encoding and decoding of data with

    simple mappings computed by standard networks. Hence the approaches are uniform

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    4/19

    4 Barbara Hammer

    and a general theory can be developed in contrast to often very specific fully dynamic

    systems. However, the idea is limited to data structures whose dynamics can be mapped

    to an appropriate recursive network. It includes recursive data like sequences or tree

    structures, possibly cyclic graphs are not yet covered.

    We will start with the investigation of standard recurrent networks because they are

    a well established and successful method and, at the same time, demonstrate a typical

    behavior. Their in principle capacity as well as their learnability can be investigated

    from an algorithmic as well as a statistical point of view. From an algorithmic point of

    view, the connection to classical approaches like finite automata and Turing machines

    is interesting. Moreover, this connection allows partial insight into the way in which

    the networks perform their tasks. There are only few results concerning the learnability

    of these dynamics from an algorithmic point of view. Afterwards, we will study the

    statistical learnability and approximation ability of recurrent networks. These results

    are transferred to various more general approaches for tree structured data.

    2 Network Dynamics

    First, the basic recurrent dynamics are defined. As usual, a feedforward network con-

    sists of a weighted directed acyclic graph of neurons such that a global processing rule

    is obtained via successive local computations of the neurons. Commonly, the neurons

    iteratively compute their activation

    "

    $,

    % & (denoting the

    predecessors of neuron(,

    denoting some real-valued weight assigned to connection

    % & (

    , "3 2 # 4

    denoting the bias of neuron(

    , and

    its activation function4 & 4

    .

    Starting with the neurons without predecessors, the so called input neurons, which ob-

    tain their activation from outside, the neurons successively compute their activation

    until the output of the network can be found at some specified output neurons. Hence

    feedforward networks compute functions from a finite dimensional real-vector space

    into a finite dimensional real-vector space. A network architecture only specifies the

    directed graph and the activation functions, but not the weights and biases. Often, so-

    called multilayer networks or multilayer architectures are used, meaning that the graph

    decomposes into subsets, so-called layers, such that connections can only be found

    between consecutive layers. It is well known that feedforward neural networks are uni-

    versal approximators in an appropriate sense: Every continuous or measurable function,

    respectively, can be approximatedby some network with appropriate activation function

    on any compact input domain or for inputs of arbitrarily high probability, respectively.

    Moreover, such mappings can be learned from a finite set of examples. This, in more

    detail, means that two requirements are met. First, neural networks yield valid general-

    ization: The empirical error, i.e., the error on the training data, is representative for the

    real error of the architecture, i.e., the error for unknown inputs, if a sufficiently large

    training set has been taken into account. Concrete bounds on the required training setsize can be derived. Second, effective training algorithms for minimizing the empirical

    error on concrete training data can be found. Usually, training is performed with some

    modification of backpropagation like the very robust and fast method RProp [32].

    Sequences of real vectors constitute simple symbolic structures. They are difficult

    for standard connectionistic methods due to their unlimited length. We denote the set

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    5/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 5

    of sequences with elements in an alphabet 6 by 6 7 . A common way of processing

    sequences with standard networks consists in truncating, i.e., a sequence9 @ A C E E E C @ P R

    with initially unknown lengthT

    is substituted by only a part9 @ A C E E E C @ V R

    with a priori

    fixed time horizonX

    . Obviously, truncation usually leads to information loss. Alter-

    natively, one can equip feedforward networks with recurrent connections and use the

    further dimension of time. Here, we introduce the general concept of recurrent coding

    functions. Every mapping with appropriate domain and codomain induces a mapping

    on sequences or into sequences, respectively, via recursive application as follows:

    Definition 1. Assume6

    is some set. Any function Y 6 b 4

    P

    & 4

    P

    and initial

    contextf8 2 g 4

    P

    induce a recursive encoding

    ench

    Y 6

    7

    & 4

    P

    C

    ench

    9 @A

    C E E E C @P

    R $

    u

    fif

    T w

    @ P C

    ench

    9 @ A C E E E C @ P A R $ $otherwise

    Any function C A $ Y 4

    P

    & 6 b 4

    P

    and final set 4

    P

    induce a recursive

    decoding

    decY 4

    P

    & 6

    7

    C

    dec @ $

    u

    9 Rif

    @ 2

    9 @ $ C

    dec A @ $ $ R

    otherwise

    Note that

    dec @ $

    may be not defined if the decoding

    does not lead to values in

    . Therefore one often restricts decoding to decoding of sequences up to a fixed finite

    length in practice. Recurrent neural networks compute the composition of up to three

    functions

    dec

    ench depending on their respective domain and codomain where

    , ,

    and

    are computed by standard feedforward networks. Note that this notation is some-

    what unusual in the literature. Mostly, recurrent networks are defined via their transition

    function and referring to the standard dynamics of discrete dynamic system. However,

    the above definition has the advantage that the role of the single network parts can be

    made explicit: Symbolic data are first encoded into a connectionistic representation, this

    connectionistic representation is further processed with a standard network, finally, the

    implicit representation is decoded to symbolic data. In practice, these three parts are not

    well separated and one can indeed show that the transformation part can be included

    in either encoding or decoding. Encoding and decoding need not compute a precise

    encoding or decoding such that data can be restored perfectly. Encoding and decoding

    are part of a system which as a whole should approximate some function. Hence only

    those parts of the data have to be taken into account which contribute to the specific

    learning task. Recurrent networks are mostly used for time series prediction, i.e., the

    decoding

    dec is dropped. Long term prediction of time series, where the decoding part

    is necessary, is a particularly difficult task and can rarely be found in applications.

    A second advantage of the above formalism is the possibility to generalize the dy-

    namics to tree structured data. Note that terms and formulas possess a natural repre-sentation via a tree structure: The single symbols, i.e., the variables, constants, function

    symbols, predicates, and logical symbols are encoded in some real-vector space via

    unique values, e.g., natural numbers or unary vectors; these values correspond to the

    labels of the nodes in a tree. The tree structure directly corresponds to the structure of

    the term or formula; i.e., subterms of a single term correspond to subtrees of a node

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    6/19

    6 Barbara Hammer

    (1,0,0)

    (1,0,0)

    (0,0,1) (0,0,1)(0,0,1)

    (0,1,0)

    Fig. 2. Example for a tree representation of symbolic data: Left: encoding of j m n o m o m n n

    ,

    where o o n

    represents

    , o o n

    representsj

    , and o o n

    representsm

    . Right: encoding of

    z n n n z n n .

    equipped with the label encoding the function symbol. See Fig. 2 for an example. In

    the following, we restrict the maximum arity of functions and predicates to some fixed

    value|

    . Hence data we are interested in are trees where each node has at most|

    succes-

    sors. Expanding the tree by empty nodes if necessary, we can restrict ourselves to the

    case of trees with fan-out exactly|

    of the nodes. Hence we will deal with tree structureswith fan-out

    |as inputs or outputs of network architectures in the following.

    Definition 2. A| -tree with labels in some set 6 is either the empty tree which we

    denote by}

    , or it consists of a root labeled with some~ 2 6

    and|

    subtrees, some of

    which may be empty,X

    A

    , . . . ,X

    . In the latter case we denote the tree by~ X

    AC E E E C X $

    .

    Denote the set of|

    -trees with labels in6

    by 6 $ 7

    .

    The recursive nature of trees induces a natural dynamics for recursively encoding

    or decoding trees to real vectors. We can define an induced encoding or decoding, re-

    spectively, for each mapping with appropriate arity in the following way:

    Definition 3. Denote by6

    a set. Any mapping Y 6 b 4 $

    & 4 and initial

    contextf8 2 g 4

    induces a recursive encoding

    ench

    Y 6 $

    7

    & 4

    C X &

    u

    fif

    X }

    ~ C

    ench

    XA

    $ C E E E C

    ench

    X $ $ if X ~ XA

    C E E E C X $ E

    Any mapping

    C A

    C E E E C $ Y 4

    & 6 b 4

    $

    and set 4

    induces a

    recursive decoding

    decY 4

    & 6 $

    7

    C @ &

    u

    }if

    @ 2

    @ $

    dec A @ $ $ C E E E C

    dec

    @ $ $ $

    otherwise.

    Again,

    dec might be a partial function. Therefore decoding is often restricted to decod-

    ing of trees up to a fixed height in practice. The encoding recursively applies a mapping

    in order to obtain a code for a tree in a real-vector space. One starts at the leaves and

    recursively encodes the single subtrees. At each level the already computed codes of the

    respective subtrees are used as context. The recursive decoding is defined in a similarmanner: Recursively applying some decoding function to a real vector yields the label

    of the root and codes for the | subtrees. In the connectionistic setting, the two mappings

    used for encoding or decoding, respectively, can be computed by standard feedforward

    neural networks. As in the linear case, i.e., the case of simple recurrent networks, one

    can combine mappings

    ench ,

    dec , and depending on the specific learning task.

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    7/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 7

    Note that this definition constitutes a natural generalization of standard recurrent

    networks and hence allows for successful practical applications as well as general in-

    vestigations concerning concrete learning algorithms, the connection to classical mech-

    anisms like tree automata, and the theoretical properties of approximation ability and

    learnability. However, it is not biologically motivated compared to standard recurrent

    networks, and though this approach can shed some light on the possibility of dealing

    with structured data in connectionistic systems, it does not necessarily enlighten the

    way in which humans solve these tasks. We will start with a thorough investigation of

    simple recurrent networks since they are biologically plausible and, moreover, signifi-

    cant theoretical difficulties and benefits can already be found at this level.

    3 Recurrent Neural Networks

    Recurrent networks are a natural tool in any domain where time plays a role, such as

    speech recognition, control, or time series prediction, to mention just a few [8,9,25,41].They are also used for the classification of symbolic data such as DNA sequences [31].

    Turing Capabilities

    The fact that their inputs and outputs may be sequences suggests the comparison to

    other mechanisms operating on sequences, such as classical Turing machines. One can

    consider the internal states of the network as a memory or tape of the Turing machine.

    Note that the internal states of the network may consist of real values, hence an infinite

    memory is available in the network. In Turing machines, operations on the tape are per-

    formed. Each operation can be simulated in a network by a recursive computation step

    of the transition function. In a Turing machine, the end of a computation is indicated by

    a specific final state. In a network, this behavior can be mimicked by the activation of

    some specific neuron which indicates whether the computation is finished or still con-

    tinues. The output of the computation can be found at the same time step at some other

    specified neuron of the network. Note that computations of a Turing machine which do

    not halt correspond to recursive computations of the network such that the value of the

    specified halting neuron is different from some specified value. A schematic view of

    such a computation is depicted in Fig. 3. A possible formalization is as follows:

    Definition 4. A (possibly partial) function Y w C 7 & w C

    can be computedby a

    recurrent neural network if feedforward networks Y w C b 4

    P

    & 4

    P

    , Y 4

    P

    & 4

    P

    ,

    and Y 4

    P

    & 4exist such that

    @ $

    ench

    @ $for all sequences

    @, where

    denotes the smallest number of iterations such that the activation of some specified

    output neuron of the part is contained in a specified set encoding the end of the

    computation after iteratively applying

    to

    ench

    @ $

    .Note that simulations of Turing machines are merely of theoretical interest; such com-

    putation mechanisms will not be used in practice. However, the results shed some light

    on the power of recurrent networks. The networks capacity naturally depends on the

    choice of the activation functions. Common activation functions in the literature are

    piecewise polynomial or S-shaped functions such as:

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    8/19

    8 Barbara Hammer

    inputrecursiveencoding computation

    recursive output

    1

    not yet 1

    Fig. 3. Turing computation with a recurrent network.

    perceptron function H @ $ w

    for@ w

    , H @ $

    for@

    ,

    semilinear activation lin @ $ @

    forw @

    , lin @ $ @ $

    otherwise,

    sigmoidal function sgd @ $

    e

    $

    A

    E

    Obviously, recurrent networks with a finite number of neurons and the perceptron acti-

    vation function have at most the power of finite automata, since their internal stack isfinite. In [38] it is shown that recurrent networks with the semilinear activation function

    are Turing universal, i.e., there exists for every Turing machine a finite size recurrent

    network which computes the same function. The proof consists essentially in a simu-

    lation of the stacks corresponding to the left and right half of the Turing tape via the

    activation of two neurons. Additionally, it is shown that standard tape operations like

    push and pop and Boolean operations can be computed with a semilinear network.

    The situation is more complicated for the standard sigmoidal activation function since

    exact classical computations which require precise activationsw

    or

    , as an example,

    can only be approximated within a sigmoidal network. Hence the approximation errors

    which add up in recurrent computations must be controlled. [16] shows the Turing uni-

    versality of sigmoidal recurrent networks via simulating so-called clock machines, a

    Turing-universal formalism which, unfortunately, leads to an exponential delay. How-

    ever, people believe that standard sigmoidal recurrent networks are Turing universal

    with polynomial resources, too, although the formal proof is still missing.

    In [37] the converse direction, simulation of neural network computations with clas-

    sical mechanisms is investigated. The authors relate semilinear recurrent networks to

    so-called non-uniform Boolean circuits. This is particularly interesting due to the fact

    that non-uniform circuits are super-Turing universal, i.e., they can compute every, pos-

    sibly non-computable function, possibly requiring exponential time. Speaking in terms

    of neural networks: Additionally to the standard operations, networks can use the un-

    limited storage capacity of the single digits in their real weights as an oracle, a linear

    number of such digits is available in linear time. Again, the situation is more diffi-

    cult for the sigmoidal activation function. The super-Turing capability is demonstrated

    in [36], for example. [11] shows the super-Turing universality in possibly exponential

    time and which is necessary in all super-Turing capability demonstrations of recurrentnetworks, of course with at least one irrational weight. Note that the latter results rely

    on an additional severe assumption: The operations on the real numbers are performed

    with infinite precision. Hence further investigation could naturally be put in a line with

    the theory of computation on the real numbers [3].

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    9/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 9

    Finite Automata and Languages

    The transition dynamics of recurrent networks directly correspond to finite automata,

    hence comparing to finite automata is a very natural question. For formal definition, afinite automaton with states computes a function X ench Y 6 7 & w C , where 6 is a

    finite alphabet,X Y 6 b C E E E C & C E E E C

    is a transition function mapping an

    input letter and a context state to a new state,f 2 C E E E C

    is the initial state, and

    is a projection of the states to w C

    . A language 6 7

    is accepted by an automaton

    if some automaton computing

    X

    ench exists such that

    @ 2 6 7 - @ $

    .

    Since neural networks are far more powerful, they are super-Turing universal, it is

    not surprising that finite automata and some context sensitive languages, too, can be

    simulated by recurrent networks. However, automata simulations have practical conse-

    quences: The constructions lead to effective techniques of automata rule insertion and

    extraction; moreover, the automaton behavior is even learnable from data as demon-

    strated in computer simulations. It has been shown in [27], for example, that finite

    automata can be simulated by recurrent networks of the form

    ench

    ,

    being a sim-ple projection, and

    being a standard feedforward network. The number of neurons

    which are sufficient in

    is upper bounded by a linear term in

    , the number of states of

    the automaton. Moreover, the perceptron activation function or the sigmoidal activation

    function or any other function with similar properties will do. One could ask whether

    less neurons are sufficient since one could encode

    states in the activation of only

    binary valued neurons. However, an abstract argumentation shows that at least

    for perceptron networks, a number of

    $neurons is necessary. Since this

    argumentation can be used at several places, we shortly outline the main steps:

    The set of finite automata with

    states and binary inputs defines the class of func-

    tions computable with such a finite automaton, say F

    . Assume a network with at most

    neurons could implement every

    -state finite automaton. Then the class of functions

    computable with

    -neuron architectures, say F , would be at least as powerful as F

    .

    Consider the sequences of length

    with an entry

    precisely at the(th position.

    Assume some arbitrary binary function

    is fixed on these sequences. Then there exists

    an

    -state automaton which implements

    on the sequences: We can use

    states

    for counting the positions of the respective input entry. We map to a specified final ac-

    cepting state whenever the corresponding function value is

    . As a consequence, we

    need to find for those

    sequences and every dichotomy some recurrent network

    which maps the sequences accordingly, too. However, the number of input sequences

    which can be mapped to arbitrary values is upper bounded by the so-called pseudodi-

    mension, a quantity measuring the richness of function classes as we will see later. In

    particular, this quantity can be upper bounded by a term O

    $

    $ $

    for perceptron networks with input sequences of length

    and

    weights. Hence

    the limit

    $follows.

    However, various researchers have demonstrated in theory as well as in practice thatsigmoidal recurrent networks can recognize some context sensitive languages as well:

    It is proved in [13] that they can perform counting, i.e., recognize languages of the form

    ~

    P P P

    - T 2 0 or, generally spoken, languages where the multiplicities of various

    symbols have to match. Approaches like [20] demonstrate that a finite approximation of

    these languages can be learned from a finite set of examples. This capacity is of partic-

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    10/19

    10 Barbara Hammer

    ular interest due to its importance for the capability of understanding natural languages

    with nested structures. The learning algorithms are usually standard algorithms for re-

    current networks which we will explain later. Commonly, they do not guarantee the

    correct long-term behavior of the networks, i.e., they lead only sometimes to the correct

    behavior for long input sequences, although they perform surprisingly well on short

    training samples. Learnability for example in the sense of identification in the limit as

    introduced by Gold is not granted. Approaches which explicitely tackle the long term

    behavior and which, moreover, allow for a symbolic interpretation of the connection-

    istic processing are automata rule insertion or extraction: The possibly partial explicit

    knowledge of the automatons behavior can be directly encoded in a recurrent network

    used for connectionistic processing, if necessary with further retraining of the network.

    Conversely, automata rules can be extracted from a trained network which describe the

    behavior approximately and generalize to arbitrarily long sequences [5,26].

    However, all these approaches are naturally limited due to the fact that common

    connectionistic data are subject to noise. Adequate recursive processing relies to some

    extent on the accuracy of the computation and the input data. The capacity is different ifnoise is present: At most finite state automata can be found assumed the support of the

    noise is limited. Assumed the support of the noise is not limited, e.g. the noise is Gaus-

    sian, then the capacity reduces to the capacity of simple feedforward dynamics with a

    finite time window [21,22]. Hence recurrent networks can in a finite approximation

    algorithmically process symbolic data, the presence of noise limits their capacities.

    Learning Algorithms

    Naturally, an alternative point of view is the classical statistical scenario, i.e., possibly

    noisy data allow to learn an unknown regularity with high accuracy and confidence for

    data of high probability. In particular, the behavior need not be correct for every input;

    the learning algorithms are only guaranteed to work well in typical cases, in unlikelysituations the system may fail. The classical PAC setting as introduced by Valiant for-

    malizes this approach of learnability [42] as follows: Some unknown regularity

    for

    which only a finite set of exampless @ C @ $ $

    is available, is to be learned. A learning

    algorithm chooses a function from a specified class of functions, e.g. given by a neural

    architecture, based on the training examples. There are two demands: The output of

    the algorithm should nearly coincide with the unknown regularity, mathematically, the

    probability that the algorithm outputs a function which differs considerably from the

    function to be learned should be small. Moreover, the algorithm should run in polyno-

    mial time, the parameters being the desired accuracy and confidence of the algorithm.

    Usually, learning separates into two steps as depicted in Fig. 4: First, a function

    class with limited capacity is chosen, e.g. the number of neurons and weights is fixed,

    such that the function class is large enough to approximate the regularity to be learned

    and, at the same time, allows identification of an approximation based on the availabletraining set, i.e., guarantees valid generalization to unseen samples. This is commonly

    addressed by the term structural risk minimization and obtained via a control of the

    so-called pseudodimension of the function class. We will address this topic later. In

    a second step, a concrete regularity is actually searched for in the specified function

    class, commonly via so called empirical risk minimization, i.e., a function is chosen

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    11/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 11

    nested function classes

    of increasing complexity

    f

    g

    f

    First step: choose a function class

    ^f: empirical approximation^f: function to be learned

    Second step: minimize the empirical error,g is the output of the algorithm^

    generalization error

    Fig. 4. Structural and empirical risk minimization

    which nearly coincides with the regularity

    to be learned on the training examples

    @

    . According to these two steps, the generalization error divides into two parts: The

    structural error, i.e., the deviation of the empirical error on a finite set of data from

    the overall error for functions in the specified class, and the empirical error, i.e., the

    deviation of the output function from the regularity on the training set.We will shortly summarize various empirical risk minimization techniques for re-

    current neural networks: Assume @ C @ $ $

    are the training data and some neural ar-

    chitecture computing a function

    which is parameterized by the weights

    is chosen.

    Often, training algorithms choose appropriate weights

    by means of minimizing the

    quadratic error

    @ $ C @ $ $

    $,

    being some appropriate distance, e.g. the Eu-

    clidian distance. Since in popular cases the above term is differentiable with respect

    to the weights, a simple gradient descent can be used. The derivative with respect to

    one weight decomposes into various terms according to the sequential structure of the

    inputs and outputs, i.e., the number of recursive applications of the transition functions.

    A direct recursive computation of the single terms has the complexity O

    $,

    being the number of weights and

    being the number of recurrent steps. In so-called

    real time recurrent learning, these weight updates are performed immediately after thecomputation such that initially unlimited time series can be processed. This method can

    be applied in online learning in robotics, for example. In analogy to standard backprop-

    agation, the most popular learning algorithm for feedforward networks, one can speed

    up the computation and obtain the derivatives in time O

    $via first propagating the

    signals forward through the entire network and recursive steps and afterwards propagat-

    ing the error signals backwards through the network and all recursive steps. However,

    the possibility of online adaptation while a sequence is still processed is lost in this so

    called backpropagation through time [28,44]. There exist combinations of both meth-

    ods and variations for training continuous systems [33]. The true gradient is sometimes

    substituted by a truncated gradient in earlier approaches [6]. Since theoretical investi-

    gation suggests that pure gradient descent techniques will likely suffer from numerical

    instabilities the gradients will either blow up or vanish at propagation through the

    recursive steps alternative methods propose a random guessing, statistical approacheslike the EM algorithm, or an explicit normalization of the error like LSTM [1,12].

    Practice shows that training recurrent networks is harder than training feedforward

    networks due to numerically ill-behaved gradients as shown in [2]. Hence the com-

    plexity of training recurrent networks is a very interesting topic; moreover, the fact that

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    12/19

    12 Barbara Hammer

    the empirical error can be minimized efficiently is one ingredient of PAC learnabil-

    ity. Unfortunately, precise theoretical investigations can be found only for very limited

    situations: It has been proved that fixed recurrent architectures with the perceptron ac-

    tivation function can be trained in polynomial time [11]. Things change if architectural

    parameters are allowed to vary. This means that the number of input neurons, for exam-

    ple, may change from one training problem to the next since most learning algorithm

    are uniform with respect to the architectural size. In this case, almost every realistic

    situation is NP-hard already for feedforward networks, although this has not yet been

    proved for a sufficiently general scenario. One recent result reads as follows: Assume

    there is given a multilayer perceptron architecture where the number of input neurons

    is allowed to vary from one instance to the next instance, the input biases are dropped,

    and no solution without errors exist. Then it is NP-hard to find a network such that

    the number of misclassified points of the network compared to the optimum achievable

    number is limited by a term which may even be exponential in the network size [4].

    People are working on adequate generalizations to more general or typical situations.

    Approximation Ability

    The ability of recurrent neural networks to simulate Turing machines manifests their

    enormous capacity. From a statistical point of view, we are interested in a slightly dif-

    ferent question: Given some finite set of examples @ C f $

    , the inputs@

    or outputsf

    may be sequences, does there exist a network which maps each@

    approximately onto

    the correspondingf

    ? Which are the required resources? If there is an underlying map-

    ping, can it be approximated in an appropriate sense, too? The difference to the previous

    argumentation consists in the fact that there need not be a recursive underlying regular-

    ity producingf

    from@

    . At the same time we do not require to interpolate or simulate

    the underlying possibly non-recursive behavior precisely in the long term limit.

    One way to attack the above questions consists in a division of the problem intothree parts: It is to be shown that sequences can be encoded or decoded, respectively,

    with a neural network, and that the induced mapping on the connectionistic represen-

    tation can be approximated with a standard feedforward network. There exist two nat-

    ural ways of encoding sequences in a finite dimensional vector space: Sequences of

    length at most|

    can be written in a vector space of dimension|

    , filling the empty

    spaces, if any, with entriesw

    ; we refer to this coding as vector-coding. Alternatively,

    the single entries in a sequence can be cut to a fixed precision and concatenated in a

    single real number; we refer to this method as real-value-coding. Hence the sequence

    9 w E C w E C w E C w E Rbecomes

    w E C w E C w E C w E C w C w C w $or

    w E w w, as an

    example. One can show that both codings can be computed with a recurrent network.

    Vector-encoding and decoding can be performed with a network whose number of

    neurons is linear in the maximum input length and which possesses an appropriate ac-

    tivation function. Real-value-encoding is possible with only a fixed number of neuronsfor purely symbolic data, i.e., inputs from

    6 7with

    - 6 - . Sequences in

    4

    P

    $ 7re-

    quire additional neurons which compute the discretization of the real values. Naturally,

    precise decoding of the discretization is not possible since this information is lost in the

    coding. Encoding such that unique codes result can be performed with O

    $ neurons

    being the number of sequences to be encoded. Decoding real-value codes is possible,

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    13/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 13

    too. However, a standard activation function requires a number of neurons increasing

    with the maximum length even for symbolic data [11].

    It is well known that feedforward networks with one hidden layer and appropriate

    activation function are universal approximators. Hence one can conclude that approx-

    imation of general functions is possible if the above encoding or decoding networks

    are combined with a standard feedforward network which approximates the induced

    mappings on the connectionistic codes. To be more precise, approximating measurable

    functions on inputs of arbitrary high probability is possible through real-value encoding.

    Each continuous function can be approximated for inputs from a compact set through

    vector-encoding. In the latter case, the dimension used for the connectionistic represen-

    tation necessarily increases for increasing length of the sequences [11].

    Learnability

    Having settled the universal approximation ability, we should make sure that the struc-

    tural risk can be controlled within a fixed neural architecture. I.e., we have to show thata finite number of training examples is sufficient in order to nearly specify the unknown

    underlying regularity. Assume there is fixed some probability measure

    on the inputs.

    For the moment assume that we deal with real-valued outputs only. Then one standard

    way to guarantee the above property for a function class Fis via the so called uniform

    convergence of empirical distances (UCED) property, i.e.,

    @ -

    F- C $

    C C @ $ - $ & w & $

    holds for every w

    where C $ - - @ $

    is the real errorand

    C C @ $

    - @

    $ @

    $ - is the empirical error. The UCED property guarantees that the

    empirical error of any learning algorithm is representative for the real generalization

    error. We refer to the above distance as the risk. A standard way to prove the UCED

    property consists in an estimation of a combinatorial quantity, the pseudodimension.

    Definition 5. The pseudodimension of a function class F, VC F$ is the largest car-

    dinality (possibly infinite) of a set of points which can be shattered. A set of points

    @A

    C E E E C @

    is shattered if reference points

    2 4

    exist such that for every

    Y

    @A

    C E E E C @

    & w C some function

    2 Fexists with @

    $

    @

    $ .

    The pseudodimension measures the richness of a function class. It is the largest set of

    points such that every possible binary function can be realized on these points. No gen-

    eralization can be expected if a training set can be shattered. It is well known that the

    UCED property holds if the pseudodimension of a function class is finite [43]. More-

    over, the number of examples required for valid generalization can be explicitely limited

    by roughly the order , being the pseudodimension and the required accuracy.Assume F is given by a recurrent architecture with

    weights. Denote by FV

    the

    restriction to inputs of length at most X . Then one can limit VC FV

    $ by a polynomial in

    X and

    . However, lower bounds exist which show that the pseudodimension necessar-

    ily depends onX

    in most interesting cases [17]. Hence VCF

    $is infinite for unrestricted

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    14/19

    14 Barbara Hammer

    sequences. As a consequence, the above argumentation proves learnability only for re-

    stricted inputs. Moreover, since a finite pseudodimension (more precisely, a finite so-

    called fat-shattering dimension) is necessary for distribution independent learnability

    under realistic conditions, distribution independent bounds for the risk cannot exist in

    principle [43]. Hence one has to add special considerations to the standard argumenta-

    tion for recurrent architectures. Mainly two possibilities can be found in the literature:

    One can either take specific knowledge about the underlying probability into consider-

    ation, or one can derive posterior bounds which depend on the specific training set. The

    results are as follows [11]:

    Assume w

    and one can findX

    such that the probability of sequences of length

    Xis bounded from above by

    . Then the risk is limited by

    provided that the

    number of examples is roughly of order

    V

    ,

    V

    being the (finite) pseudodimension of

    the architecture restricted to input sequences of length X

    .

    Assume training on a set of size

    and maximum lengthX

    has been performed. Then

    the risk can be bounded by a term of roughly order V

    $ ,

    Vbeing the (finite)

    pseudodimension of the architecture restricted to input sequences of length at most X . Amore detailed analysis even allows to drop the long sequences before measuring

    X[10].

    Hence one can guarantee valid generalization, although only with additional con-

    siderations compared to the feedforward case. Moreover, there may exist particularly

    ugly situations for recurrent networks where training is possible only with an exponen-

    tially increasing number of training examples [11]. This is the price one has to pay for

    the possibility of dealing with structured data, in particular data with a priori unlimited

    length. Note that the above argumentation holds only for architectures with real val-

    ues as outputs. The case of structured outputs requires a more advanced analysis via so

    called loss functions and yields to similar results [10].

    4 Advanced Architectures

    The next step is to go from sequences to tree structured data. Since trees cover terms

    and formulas, this is a fairly general approach. The network dynamics and theoretical

    investigations are direct generalizations of simple recurrent networks. One can obtain

    a recursive neural encoding

    ench

    Y 6

    $ 7 & 4

    P

    and a recursive neural decoding

    decY

    4

    P

    & 6

    $7 of trees if

    and

    are computed by standard networks. These codings

    can be composed with standard networks for the approximation of general functions.

    Depending on whether the inputs, the outputs, or both may be structured and depending

    on which part is trainable, we obtain different connectionistic mechanisms. A sketch of

    the first two mechanisms which are described in the following can be found in Fig. 5.

    Recursive Autoassociative Memory

    The recursive autoassociative memory (RAAM) as introduced by Pollack and gener-

    alized by Sperduti and Starita [30,40] consists of a recursive encoding

    ench , a recursive

    decoding dec , and being standard feedforward networks, and a standard feedforward

    network . An appropriate composition of these parts can approximate mappings where

    the inputs or the outputs may be | -trees or vectors, respectively. Training proceeds in

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    15/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 15

    encoding

    a

    b c

    d fe

    a

    b c

    d fe

    x

    Folding networks:

    decodingencoding

    a

    b c

    d fe

    RAAM:

    Fig. 5. Processing tree structures with connectionistic methods.

    two steps, first the composition

    dec

    ench is trained on the identity on a given training set

    with truncated gradient descent such that the two parts constitute a proper encoding or

    decoding, respectively. Afterwards, a standard feedforward network is combined with

    either the encoding or decoding and trained via standard backpropagation where the

    weights in the recursive coding are fixed. Hence arbitrary mappings on structured datacan be approximated. Note that the encoding is fitted to the specific training set. It is

    not fitted to the specific approximation task. In all cases encoding and decoding must

    be learned even if only the inputs or only the outputs are structured.

    In analogy to simple recurrent networks the following questions arise: Can any map-

    ping be approximated in principle? Do the respective parts show valid generalization?

    Is training efficient? We will not consider the efficiency of training in the following

    since the question is not yet satisfactorily answered for feedforward networks. The other

    questions are to be answered for both, the coding parts and the feedforward approxima-

    tion on the encoded values. Note that the latter task only deals with standard feedfor-

    ward networks whose approximation and generalization properties are well established.

    Concerning the approximation capability of the coding parts we can borrow ideas from

    recurrent networks: A natural encoding of tree structured data consists in the prefix rep-

    resentation of a tree. For example, the

    -tree~

    A

    A C

    A C

    $ $ C

    $can uniquely be

    represented by the sequence9 ~ C

    A C

    A C } C } C

    C A C } C } C C } C } C

    C } C } Rincluding

    the empty tree}

    . Depending on whether real-labeled trees are to be encoded precisely,

    or only symbolic data, i.e., labels from6 7

    where- 6 -

    , are dealt with or a finite

    approximation of the real values is sufficient, the above sequence can be encoded in

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    16/19

    16 Barbara Hammer

    a real-value code with a fixed dimension or a vector code whose dimension depends

    on the maximum height of the trees. The respective encoding or decoding can be com-

    puted with recursive architectures induced by a standard feedforward network [11]. The

    required resources are as follows: Vector-coding requires a number of neurons which

    increases exponentially with the maximum height of the trees. Real-value-encoding re-

    quires only a fixed number of neurons for symbolic data and a number of neurons which

    is quadratic in the number of patterns for real-valued labels. Real-value-decoding re-

    quires a number of neurons which increases exponentially with increasing height of

    the trees; the argument consists in a lower bound of the pseudodimension of function

    classes which perform proper decoding. This number increases more than exponentially

    in the height. Learning the coding yields valid generalization provided prior information

    about the input distribution is available. Alternatively, one can derive posterior bounds

    on the generalization error which depend on the concrete training set. These results

    follow in the same way as for standard recurrent networks. Hence the RAAM consti-

    tutes a promising and in principle applicable mechanism. Due to the difficulty of proper

    decoding, applications can be found for small training examples only [40].

    Folding Networks

    Folding networks use ideas of the LRAAM [19]. They focus on clustering symbolic

    data, i.e., the outputs are not structured, but real vectors. This limitation makes decoding

    superfluous. For training, the encoding part and the feedforward network are composed

    and simultaneously trained on the respective task via a gradient descent method, so-

    called backpropagation through structure, a generalization of backpropagation through

    time. Hence the encoding is fitted to the data and the respective learning task.

    It follows immediately from the above discussion that folding networks can approx-

    imate every measurable function in probability using real-value codes, and they can

    approximate every continuous function on compact input domains with vector codes.Additionally, valid generalization can be guaranteed with a similar argumentation as

    above with bounds depending on the input distribution or the concrete training set.

    Due to the fact that the difficult part, proper decoding, is dropped, several applica-

    tions of folding networks for large data sets can be found in the literature: classification

    of terms and formulas, logo recognition, drug design, support of automatic theorem

    provers, . . . [19,34,35]. Moreover, they can be related to finite tree automata in analogy

    to the correlation of recurrent networks and finite automata [18].

    Holographic Reduced Representation

    Holographic reduced representation (HRR) is identical to RAAM with a fixed encod-

    ing and decoding: a priori chosen functions given by so-called circular correlation or

    convolution, respectively [29]. Correlation (denoted by

    ) and convolution (denotedby

    ) constitute a specific way to relate two vectors to a vector of the same dimen-

    sion such that correlation and convolution are approximately inverse to each other, i.e.,

    ~

    ~

    $ ~ . Hence one can encode a tree ~ XA

    C X

    $ via computing the convolution of

    each entry with a specific vector indicating the role of the component and adding these

    three vectors: A ~

    X A

    X ,

    being the roles. The single entries can be

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    17/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 17

    approximately restored via correlation: A A ~ X A X $ ~ . One can

    compute the deviation in the above equation under statistical assumptions. Commonly,

    the restored values are accurate provided the dimension of the vectors is sufficiently

    high, the height of the trees is limited, and the vectors are additionally cleaned up in an

    associative memory. It follows immediately from our above argumentation that these

    three conditions are necessary: Decoding is a difficult task which requires for standard

    computations exponentially increasing resources. HRR is used in the literature for stor-

    ing and recognizing language [29]. Since encoding and decoding are fixed, no further

    investigation of the approximation or generalization ability is necessary.

    5 Conclusions

    Combinations of symbolic and connectionistic systems, more precisely, connectionistic

    systems processing symbolic data have been investigated. A particular difficulty con-

    sists in the fact that the informational content of symbolic data is not limited a priori.Hence a priori unlimited length is to be mapped to a connectionistic vector represen-

    tation. We have focused on recurrent systems which map the unlimited length to a

    priori unlimited processing time. Simple recurrent neural networks constitute a well

    established model. Apart from the simplicity of the data they process, sequences, the

    main theoretical properties are the same as for advanced mechanisms. One can inves-

    tigate algorithmic or statistical aspects of learning, the first ones being induced by the

    nature of the data, the second ones by the nature of the connectionistic system. We

    covered algorithmic aspects mainly in comparison to standard mechanisms. Although

    being of merely theoretical interest, the enormous capacity of recurrent networks has

    turned out. Concerning statistical learning theory, satisfactory results for the universal

    approximation capability and the generalization ability have been established, although

    generalization can only be guaranteed if specifics of the data are taken into account.

    The idea of coding leads to an immediate generalization to tree structured data.

    Well established approaches like RAAM, HRR, and folding networks fall within this

    general definition. The theory established for recurrent networks can be generalized to

    these advanced approaches immediately. The in-principle statistical learnability of these

    mechanisms follows. However, some specific situations might be extremely difficult:

    Decoding requires an increasing amount of resources. Hence the RAAM is applicable

    for small data only, decoding in HRR requires an additional cleanup, whereas folding

    networks can be found in real world applications.

    Nevertheless, the results are encouraging since they prove the possibility to process

    symbolic data with neural networks and constitute a theoretical foundation for the suc-

    cess of some of the above mentioned methods. Unfortunately, the general approaches

    do neither generalize to cyclic structures like graphs, nor do they provide biological

    plausibility and could help explaining human recognition of these data. For both as-pects fully dynamic approaches would be more promising although it would be more

    difficult to find effective training algorithms for practical applications.

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    18/19

    18 Barbara Hammer

    References

    1. Y. Bengio and P. Frasconi. Credit assignment through time: Alternatives to backpropaga-

    tion. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information

    Processing Systems, Volume 5. Morgan Kaufmann, 1994.

    2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient de-

    scent is difficult. IEEE Transactions on Neural Networks, 5(2), 1994.

    3. L. Blum, F. Chucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer,

    1998.

    4. B. DasGupta, and B. Hammer. On approximate learning by multi-layered feedforward

    circuits. In: H. Arimura, S. Jain, A. Sharma (eds.), Algorithmic Learning Theory2000,

    Springer, 2000.

    5. M. W. Craven and J. W. Shavlik. Using sampling and queries to extract rules from trained

    neural networks. In: Proceedings of the Eleventh International Conference on Machine

    Learning, Morgan Kaufmann, 1994.

    6. J. L. Elman. Finding structure in time. Cognitive Science, 14, 1990.

    7. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of datasequences. IEEE Transactions on Neural Networks, 9(5), 1997.

    8. C. L. Giles, G. M. Kuhn, and R. J. Williams. Special issue on dynamic recurrent neural

    networks. IEEE Transactions on Neural Networks, 5(2), 1994.

    9. M. Gori, M. Mozer, A. C. Tsoi, and R. L. Watrous. Special issue on recurrent neural networks

    for sequence processing. Neurocomputing, 15(3-4), 1997.

    10. B. Hammer. Approximation and generalization issues of recurrent networks dealing with

    structured data. In: P. Frasconi, M. Gori, F. Kurfes, and A. Sperduti, Proceedings of the ECAI

    workshop on Foundations of connectionist-symbolic integration: representation, paradigms,

    and algorithms, 2000.

    11. B. Hammer. Learning with recurrent neural networks. Lecture Notes in Control and Infor-

    mation Sciences 254, Springer, 2000.

    12. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8),

    1997.

    13. S. Holldobler, Y. Kalinke, and H. Lehmann. Designing a Counter: Another case study of Dy-

    namics and Activation Landscapes in Recurrent Networks. In G. Brewka and C. Habel and

    B. Nebel (eds.): KI97: Advances in Artificial Intelligence, Proceedings of the 21st German

    Conference on Artificial Intelligence, LNAI 1303, Springer, 1997.

    14. J.J. Hopfield and D.W. Tank. Neural computation of decisions in optimization problems.

    Biological Cybernetics, 52, 1985.

    15. J.E. Hummel and K.L. Holyoak. Distributed representation of structure: a theory of analog-

    ical access and mapping. Psychological Review, 104, 1997.

    16. J. Kilian and H. T. Siegelmann. The dynamic universality of sigmoidal neural networks.

    Information and Computation, 128, 1996.

    17. P. Koiran and E. D. Sontag. Neural networks with quadratic VC dimension. Journal of

    Computer and System Sciences, 54, 1997.

    18. A. Kuchler. On the correspondence between neural folding architectures and tree automata.

    Technical report, University of Ulm, 1998.19. A. Kuchler and C. Goller. Inductive learning symbolic domains using structure-driven neural

    networks. In G. Gorz and S. Holldobler, editors, KI-96: Advances in Artificial Intelligence.

    Springer, 1996.

    20. S. Lawrence, C.L. Giles, and S. Fong. Can recurrent neural networks learn natural language

    grammars?. In: International Conference on Neural Networks, IEEE Press, 1996.

  • 8/3/2019 Barbara Hammer- Perspectives on Learning Symbolic Data with Connectionistic Systems

    19/19

    Perspectives on Learning Symbolic Data with Connectionistic Systems 19

    21. W. Maass and P. Orponen. On the effect of analog noise in discrete-time analog computation.

    Neural Computation, 10(5), 1998.

    22. W. Maass and E. D. Sontag. Analog neural nets with Gaussian or other common noise

    distributions cannot recognize arbitrary regular languages. In M. C. Mozer, M. I. Jordan,

    and T. Petsche, editors, Advances in Neural Information Processing Systems, Volume 9. The

    MIT Press, 1998.

    23. M. Masters. Neural, Novel, & Hybrid Algorithms for Time Series Prediction. Wiley, 1995.

    24. T. Mitchel. Machine Learning. McGraw-Hill, 1997.

    25. M. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and

    N. Gershenfeld, editors, Predicting the future and understanding the past. Addison-Wesley,

    1993.

    26. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent neural net-

    works. Neural Networks, 9(1), 1996.

    27. C. Omlin and C. Giles. Constructing deterministic finite-state automata in recurrent neural

    networks. Journal of the ACM, 43(2), 1996.

    28. B. A. Pearlmutter. Gradient calculations for dynamic recurrent neural networks: A survey.

    IEEE Transactions on Neural Networks, 6(5), 1995.29. T. Plate. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3),

    1995.

    30. J. Pollack. Recursive distributed representation. Artificial Intelligence, 46, 1990.

    31. M. Reczko. Protein secondary structure prediction with partially recurrent neural networks.

    SAR and QSAR in environmental research, 1, 1993.

    32. M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation: The

    RPROP algorithm. In Proceedings of the Sixth International Conference on Neural Net-

    works. IEEE, 1993.

    33. J. Schmidhuber. A fixed size storage O(

    ) time complexity learning algorithm for fully

    recurrent continually running networks. Neural Computation, 4(2), 1992.

    34. T. Schmitt and C. Goller. Relating chemical structure to activity with the structure processing

    neural folding architecture. In Engineering Applications of Neural Networks, 1998.

    35. S. Schulz, A. Kuchler, and C. Goller. Some experiments on the applicability of folding

    architectures to guide theorem proving. In Proceedings of the 10th International FLAIRSConference, 1997.

    36. H. T. Siegelmann. The simple dynamics of super Turing theories. Theoretical Computer

    Science, 168, 1996.

    37. H. T. Siegelmann and E. D. Sontag. Analog computation, neural networks, and circuits.

    Theoretical Computer Science, 131, 1994.

    38. H. T. Siegelmann and E. D. Sontag. On the computational power of neural networks. Journal

    of Computer and System Sciences, 50, 1995.

    39. L.Shastri. Advances in Shruti A neurally motivated model of relational knowledge repre-

    sentation and rapid inference using temporal synchrony. Applied Intelligence, 11, 1999.

    40. A. Sperduti. Labeling RAAM. Connection Science, 6(4), 1994.

    41. J. Suykens, B. DeMoor, and J. Vandewalle. Static and dynamic stabilizing neural controllers

    applicable to transition between equilibrium point. Neural Networks, 7(5), 1994.

    42. L. Valiant. A theory of the learnable. Communications of the ACM, 27, 1984.43. M. Vidyasagar. A Theory of Learning and Generalization. Springer, 1997.

    44. R. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and

    their computational complexity. In Y. Chauvin and D. Rumelhart, editors, Back-propagation:

    Theory, Architectures and Applications. Erlbaum, 1992.