sup ervised neural net w orks for the - hong kong baptist ...markus/teaching/comp7650/rmlp.pdf ·...

Supervised Neural Networks for theClassi�cation of StructuresAlessandro Sperduti, Antonina StaritaUniversity of Pisa, Dipartimento di InformaticaCorso Italia 40, 56125 PISA, ItalyE-mail:[email protected]/95AbstractUp to now, neural networks have been used for classi�cation of un-structured patterns and sequences. Dealing with complex structures,however, standard neural networks, as well as statistical methods, areusually believed to be inadequate because of their feature-based ap-proach. In fact, feature-based approaches usually fail to give satisfactorysolutions because of the sensitiveness of the approach to the a priori se-lected features and the incapacity to represent any speci�c informationon the relationships among the components of the structures. On thecontrary, we show that neural networks can represent and classify struc-tured patterns. The key idea underpinning our approach is the use ofthe so called \complex recursive neuron". A complex recursive neuroncan be understood as a generalization to structures of a recurrent neu-ron. By using complex recursive neurons, basically all the supervisednetworks developed for the classi�cation of sequences, such as Back-Propagation Through Time networks, Real-Time Recurrent networks,Simple Recurrent Networks, Recurrent Cascade Correlation networks,1

and Neural Trees can be generalized to structures. The results obtainedby some of the above networks (with complex recursive neurons) onclassi�cation of logic terms are presented.1 IntroductionStructured domains are characterized by complex patterns usually representedas lists, trees, and graphs of variable sizes and complexity. The ability torecognize and to classify these patterns is fundamental for several applica-tions, such as medical and technical diagnosis (discovery and manipulationof structured dependencies, constraints, explanations), molecular biology andchemistry (classi�cation of chemical structures, DNA analysis, quantitativestructure-property relationship (QSPR), quantitative structure-activity rela-tionship (QSAR)), automated reasoning (robust matching, manipulation oflogical terms, proof plans, search space reduction), software engineering (qual-ity testing, modularisation of software), geometrical and spatial reasoning(robotics, structured representation of objects in space, �gure animation, lay-outing of objects), speech and text processing (robust parsing, semantic dis-ambiguation, organising and �nding structure in texts and speech), and otherapplications that use, generate or manipulate structures.While neural networks are able to classify static information or temporalsequences, the current state of the art does not allow the e�cient classi�cationof structures of di�erent sizes. Some advances on this issue have been recentlypresented in [SSG95], where preliminary work on the classi�cation of logicalterms, represented as labeled graphs, was reported. The aim, in that case, wasthe realization of a hybrid (symbolic/connectionist) theorem prover, where theconnectionist part had to learn a heuristic for a speci�c domain in order tospeed-up the search of a proof. Other related techniques for dealing with struc-tured patterns can be found, for example, in [Pol90, Spe94b, Spe95, Spe94a].The aim of this report is to show, at least in principle, that neural net-works can deal with structured domains. Some basic concepts, that we believevery useful for expanding the computational capabilities of neural networks tostructured domains, are given. Speci�cally, we propose a generalization of a re-current neuron, i.e., the complex recursive neuron which is able to build a mapfrom a domain of structures to the set of reals. This new de�ned neuron al-lows the formalization of several supervised models for structures which stem2

very naturally from well known models, such as Back-Propagation ThroughTime networks, Real-Time Recurrent networks, Simple Recurrent Networks,Recurrent Cascade Correlation networks, and Neural Trees.The report is organized as follows. In Section 2 we introduce structureddomains and some preliminary concepts on graphs and neural networks. Thecomplex recursive neurons are de�ned in Section 3, where some related con-cepts are discussed. Several supervised learning algorithms, derived by stan-dard and well-known learning techniques, are presented in Section 4. Simula-tion results for a subset of the proposed algorithms are reported in Section 5and conclusions drawn in Section 6.2 Structured DomainsIn this report, we deal with structured patterns which can be represented aslabeled graphs. Some examples of these kind of patterns are shown in Figures1-4, where we have reported some typical examples from structured domainssuch as conceptual graphs representing medical concepts (Figure 1), chemicalstructures (Figure 2), bubble chamber events (Figure 3), and logical terms(Figure 4). All these structures can also be represented by using a feature-based approach. Feature-based approaches, however, have the drawbacks ofbeing very sensitive to the features selected for the representation and in-capable to represent any speci�c information about the relationships amongcomponents. Standard neural networks, as well as statistical methods, areusually bound to this kind of representation and thus considered not adequatefor dealing with structured domains. On the contrary, we will show that neuralnetworks can represent and classify structured patterns.2.1 Preliminaries2.1.1 GraphsWe consider �nite directed node labeled graphs without multiple edges. For aset of labels �, a graph X (over �) is speci�ed by a �nite set VX of nodes, a setEX of ordered couples of VX � VX (called the set of edges) and a function �Xfrom VX to � (called the labeling function). Note that graphs may have loops,and labels are not restricted to be binary. Speci�cally, labels may also be real-valued vectors. A graph X 0 (over �) is a subgraph of X if �X 0 = �X , VX 0 � VX ,3

TEST_PROCfibroscopy

BODY_PARTvocal_cord

REGIONleft

BODY_PARTlarynx

ACTORphysician

DISEASEparalysis

[HUMAN_PROCESS: statement]

-> (AGENT) -> [ACTOR: physician]

-> (ORIGIN) -> [TEST_PROC: fibroscopy]

-> (THEME) -> [BODY_PART: larynx]

-> (THEME) -> [DISEASE: paralysis]

-> (LOC) -> [BODY_PART: vocal_cord]

-> (PREC) -> [REGION: left]

HUMAN_PROCstatement

Conceptual Graph

RepresentationLinear

GraphRepresentation

Figure 1: Example of conceptual graph encoding a medical concept4

NH2

1X

2X

X3

NN

N

Template

NH2

SO F2

NN

N

Cl

Cl

2XX31X

fixed part

voidCl (CH2)4

Cl SO2F

fixed partFigure 2: Chemical structures represented as trees.a

b bb

aa

a a a a a aa a a a

a a aa a a a a aa a

b b b b

bab

b bbb

b b bb b

(a) (b)Figure 3: Bubble chamber events: (a) coded event; (b) corresponding treerepresentation. 5

f(a,g(X)) f(X,f(a,g(X)))

ga

X

f

ga

X

f

fFigure 4: Example of logic terms represented as labeled directed acyclic graphs.and EX 0 � EX . For a �nite set V , #V denotes its cardinality. Given a graphX and any node x 2 VX , the function out degreeX(x) returns the number ofedges leaving from x, i.e., out degreeX(x) = #f(x; z) j (x; z) 2 EX ^ z 2 VXg.Given a total order on the edges leaving from x, the node y = outX(x; j) inVX is the node pointed by the jth pointer leaving from x. The valence of agraph X is de�ned as maxx 2 VXfout degreeX(x)g. A labeled directed acyclic graph(labeled DAG) is a graph, as de�ned above, without loops. A node s 2 VX iscalled a supersource for X if every node in X can be reached by a path startingfrom s. The root of a tree (which is a special case of directed graph) is alwaysthe (unique) supersource of the tree.2.1.2 Structured Domain, Target Function, and Training SetWe de�ne a structured domain D (over �) as any (possibly in�nite) set ofgraphs (over �). The valence of a domain D is de�ned as the maximum amongthe valences of the graphs belonging to D. Since we are dealing with learning,we need to de�ne the target function we want to learn. In approximationtasks, a target function �() over D is de�ned as any function � : D ! <k,were k is the output dimension, while in (binary) classi�cation tasks we have� : D ! f0; 1g (or � : D ! f�1; 1g.) A training set T over a domain D isde�ned as a set of couples (X; �(X)), where X 2 U � D and �() is a targetfunction de�ned over D. 6

2.1.3 Standard and Recurrent NeuronThe output o(s) of a standard neuron is given byo(s) = f(Xi wiIi); (1)where f() is some non-linear squashing function applied to the weighted sum ofinputs I1. A recurrent neuron with a single self-recurrent connection, instead,computes its output o(r)(t) as followso(r)(t) = f(Xi wiIi(t) + wso(r)(t� 1)); (2)where f() is applied to the weighted sum of inputs I plus the self-weight, ws,times the previous output. The above formula can be extended both con-sidering several interconnected recurrent neurons and delayed versions of theoutputs. For the sake of presentation, we skip these extensions.3 The First Order Complex Recursive Neu-ronUsing standard and recurrent neurons, it is very di�cult to deal with complexstructures. This is because the former was devised for processing of unstruc-tured patterns, while the latter can only naturally process sequences. Thus,neural networks using these kind of neurons can face approximation and clas-si�cation problems in structured domains only using some complex and veryunnatural encoding schemewhich maps structures onto �xed-size unstructuredpatterns or sequences. We propose to solve this inadequacy of the present stateof the art in neural networks by introducing the complex recursive neuron. Thecomplex recursive neuron is an extension of the recurrent neuron where insteadof just considering the output of the unit on the previous time step, we haveto consider the outputs of the unit for all the nodes which are pointed by thecurrent input node (see Figure 5). The output o(c)(x) of the complex recursive1The threshold of the neuron is included in the weight vector by expanding the inputvector with a component always to 1. 7

t

pn

p1

p1

pn

Sequence Complex Structure (Tree, Graph)

Structured Pattern

Unstructured Pattern

Standard Neuron Recurrent Neuron Complex Recursive Neuron

Tree of Patterns

Single Pattern Sequence of PatternsFigure 5: Neuron models for di�erent input domains. The standard neuron issuited for the processing of unstructured patterns, the recurrent neuron for theprocessing of sequences of patterns, and �nally the proposed complex recursiveneuron can deal very naturally with structured patterns.8

neuron to a node x of a graph X is de�ned aso(c)(x) = f(NLXi=1wili + out degreeX (x)Xj=1 wjo(c)(outX(x; j))); (3)where NL is the number of units encoding the label l = �X(x) attached to thecurrent input x, and wj are the weights on the recursive connections. Thus,the output of the neuron for a node x is computed recursively on the outputcomputed for all the nodes pointed by it.Note that, if the valence of the considered domain is n, then the complexrecursive neuron will have n recursive connections, even if not all of them willbe used for computing the output of a node x with out degreeX(x) < n.When considering Nc interconnected complex recursive neurons, eq. (3)becomes o(c)(x) = F (Wl + out degreeX (x)Xj=1 dW jo(c)(outX(x; j))); (4)where F i(v) = f(vi), l 2 <NL , W 2 <Nc�NL , o(c)(x), o(c)(outX(x; j)) 2 <Nc,dW j 2 <Nc�Nc.In the following, we will refer to the output of a complex neuron droppingthe upper index.3.1 Generation of Neural Representations for GraphsTo understand how complex recursive neurons can generate representationsfor directed graphs, let us consider a single complex recursive neuron u and asingle graph X. The following conditions must hold:Number of Connections: the complex recursive neuron u must have asmany recursive connections as the valence of the graph X;Supersource: the graph X must have a reference supersource.Note that, if the graph X does not have a supersource, then it is alwayspossible to de�ne a convention for adding to the graph X an extra node s(with a minimal number of outgoing edges) such that s is a supersource forthe new graph;If the above conditions are satis�ed, we can adopt the convention that thegraph X is represented by o(s), i.e., the output of u to s. Consequently, due9

c

Encoding Network

b

d

a

Graph X

Neural Representation for X

a

d

b cFigure 6: The Encoding network for an acyclic graph. The graph is representedby the output of the Encoding network.to the recursive nature of eq. (3), it follows that the neural representationfor an acyclic graph is computed by a feedforward network (encoding network)obtained by replicating the same complex recursive neuron u and connectingthese copies according to the topology of the structure (see Figure 6). If thestructure contains cycles then the resulting encoding network is recurrent (seeFigure 7) and the neural representation is considered to be well-formed only ifo(s) converges to a stationary value.The encoding network fully describes how the representation for the struc-ture is computed and it will be used in the following to derive the learningrules for the complex recursive connections.When considering a structured domain, the number of recursive connec-tions of u must be equal to the valence of the domain. The extension to aset of Nc complex neurons is trivial: if the valence of the domain is k, eachcomplex neuron will have k groups of Nc recursive connections each.3.2 Optimized Training SetWhen considering DAGs, the training set can be organized so to improve thecomputational e�ciency of both the reduced representations and the learningrules. In fact, given a training set T of DAGs, if there are graphs X1;X2 2 T10

Encoding Network

b c

d e

aa

d

b c

e

Graph X

Neural Representation for X

Figure 7: The Encoding network for a cyclic graph. In this case, the Encodingnetwork is recurrent and the graph is represented by the output of the Encodingnetwork at a �xed point of the network dynamics.which share a common subgraph X, then we need to explicitly represent Xonly once. The optimization of the training set can be performed in two stages(see Fig. 8 for an example):1. all the DAGs in the training set are merged into a single minimal DAG,i.e., a DAG with minimal number of nodes;2. a topological sort on the nodes of the minimal DAG is performed todetermine the updating order on the nodes for the network.Both stages can be done in linear time with respect to the size of all DAGs andthe size of the minimal DAG, respectively2. Speci�cally, stage (1.) can be doneby removing all the duplicates subgraphs through a special subgraph-indexingmechanism (which can be implemented in linear time). The advantage of hav-ing a sorted training set is due to the fact that all the reduced representations(and also their derivatives with respect to the weights, as we will see whenconsidering learning) can be computed by a single ordered scan of the trainingset.2This analysis is due to Christoph Goller.11

g

f

b

a

f

b

a

g

f

b

f

a g

b

g

a

f

g

f

b

f#

0123456

label

fffggab nil

nilnilnil100

left

nilnil01234

right

Set of Trees Minimal DAG Sorted Training Set

Figure 8: Optimization of the training set: the set of structures (in this case,trees) is transformed into the corresponding minimal DAG, which is then usedto generate the sorted training set. The sorted training set is then transformedinto a set of sorted vectors using the numeric codes for the labels and used astraining set for the network.3.3 Well-formedness of Neural RepresentationsWhen considering cyclic graphs, to guarantee that each encoded graph getsa proper representation through the encoding network, we have to guaranteethat for every initial state the trajectory of the encoding network convergesto an equilibrium, otherwise it would be impossible to perform the process-ing of a nonstationary representation. This is particularly important whenconsidering cyclic graphs. In fact, acyclic graphs are guaranteed to get a con-vergent representation because the resulting encoding network is feedforward,while cyclic graphs are encoded by using a recurrent network. Consequently,the well-formedness of representations can be obtained by de�ning conditionsguaranteeing the convergence of the encoding network. Regarding that, itis well known that if the weight matrix is symmetric, an additive networkwith �rst order connections possess a Liapunov function and is convergent([Hop84, CG83]). Moreover, Almeida [Alm87] proved a more general symme-try condition than symmetry of the weight matrix, i.e., a system satisfyingdetailed balance wijf(netj) = wjif(neti); (5)is guaranteed to possess a Liapunov function as well. On the other hand, ifthe norm of the weight matrix (not necessarily symmetric) is su�ciently small,12

e.g., satisfying Xi Xj w2ij < 1maxi f 0(neti) ; (6)the network's dynamics can be shown to go to a unique equilibrium for a giveninput (Atiya [Ati88]).The above results are su�cient when the encoding network is static, how-ever we will see that in several cases, e.g., in classi�cation tasks, the encodingnetwork changes with learning. In this cases, these results can be exploited tode�ne the initial weight matrix, but there is no guarantee that learning willpreserve the stability of the encoding network.4 Supervised ModelsIn this section, we discuss how several standard supervised algorithms forneural networks can be extended to structures.4.1 Back-propagation Through StructureThe task addressed by back-propagation through time networks is to produceparticular output sequences in response to speci�c input sequences. Thesenetworks are, in general, fully recurrent, in which any unit may be connectedto any other. Because of that, they cannot be trained by using plain back-propagation. A trick, however, can be used to turn an arbitrary recurrentnetwork into an equivalent feedforward network when the input sequenceshave a maximum length T . In this case, all the units of the network canbe duplicated T times (unfolding of time), so that the state of a unit in therecurrent network at time � is hold from the � th copy of the same unit in thefeedforward network. By preserving the same weight values through the layersof the feedforward network, it is not di�cult to see that the two networks willbehave identically for T time steps. The feedforward network can be trainedby back-propagation, having care to preserve the identity constraint betweenweights of di�erent layers. This can be guaranteed by adding together theindividual gradient contributions of corresponding copies of the same weightand then changing all copies by the total amount.Back-propagation Through Time can be extended without di�culties tostructures. The basic idea is to use complex recursive neurons for encod-13

ing the structures; the obtained representations are then classi�ed or used toapproximate an unknown function by a standard feedforward network. Anexample of this kind of network for classi�cation is given in Figure 9.Input Neuron

Standard Neuron

Complex Neuron

Encoder

Classifier

Pointer PointerLabelFigure 9: A possible architecture for the classi�cation of structures: the com-plex recursive neurons generate the neural representation for the structures,which are then classi�ed by a standard feedforward network.Given an input graph X, the network output o(X) can be expressed as thecomposition of the encoding function () and the classi�cation (or approxi-mation) function �() (see Figure 10):o(X) = �((X)): (7)Learning for the set of weights W � can be implemented by plain back-propagation on the feedforward network realizing �()�W � = ��@Error(�(y))@W � ; (8)where y = (X), i.e., the input to the feedforward network, while learning for14

(classifier, approx.)

FeedforwardNetwork

EncoderNetwork

forwardbackward

input

output

Φ( )

Ψ( )Figure 10: The functional scheme of the proposed network. The Encoder andthe Classi�er are considered as two distinct entities which exchange informa-tion: the Encoder forwards the neural representations of the structures to theClassi�er; in turn the Classi�er returns to the Encoder the deltas which arethen used by the Encoder to adapt its weights.15

the set of weights, W, realizing () can be implemented by�W = ��@Error(�(y))@y @y@W ; (9)where the �rst term represents the error coming from the feedforward networkand the second one represents the error due to the encoding network.How the second term is actually computed depends on the family of struc-tures in the training set. If the training set is composed by DAGs, then plainback-propagation can be used on the encoding network. Otherwise, i.e., ifthere are graphs with cycles, then recurrent back-propagation [Pin88] must beused. Consequently, we treat separately these two cases.4.1.1 Case I: DAGsThis case has been treated by Goller and K�uchler in [GK95]. Since the train-ing set contains only DAGs, the computation of @y@W can be realized bybackpropagating the error from the feedforward network through the encodingnetwork of each structure. As in back-propagation through time, the gradientcontributions of corresponding copies of the same weight are collected for eachstructure. The total amount is then used to change all the copies of the sameweight. If the learning is performed by structure then the weights are updatedafter the presentation of each single structure, otherwise, the gradient contri-butions are collected through the whole training set and the weights changedafter that all the structures in the training set have been presented to thenetwork.4.1.2 Case II: Cyclic GraphsWhen considering cyclic graphs, @y@W can be computed only resorting torecurrent back-propagation. In fact, if the input structure contains cycles, thenthe resulting encoding network is cyclic.In the standard formulation, a recurrent back-propagation network N withn units is de�ned aso(rbp)(t+ 1) = F (W (rbp)o(rbp)(t) + I(rbp)); (10)16

where I (rbp) 2 <n is the input vector for the network, and W (rbp) 2 <n �<nthe weight matrix. The learning rule for a weight of the network is given by�w(rbp)rs = �o(rbp)s o0(rbp)r Xk ek(L�1)kr (11)where all the quantities are taken at a �xed point of the recurrent network,ek is the error between the current output of unit k and the desired one,Lji = �ji � o0(rbp)j wji (�ji is a Kronecker delta), and the quantityy(rbp)j =Xk ek(L�1)kj =Xi o0(rbp)i wijy(rbp)i + ei (12)can be computed by relaxing the adjoint network N 0, i.e., a network obtainedfrom N by reversing the direction of the connections. The weight wji fromneuron i to neuron j in N is replaced by o(rbp)i wij from neuron j to neuron iin N 0. The activation functions of N 0 are linear and the output units in Nbecome input units in N 0 with ei as input.Given a cyclic graph X, let m = #VX , Nc be the number of complexneurons and oi(t) the output at time t of these neurons for xi 2 VX . Then wecan de�ne o(t) = 266664 o1(t)o2(t)...om(t) 377775 (13)as the global state vector for our recurrent network, where for convention weimpose o1(t) to be the output of the neurons representing the supersource ofX. To account for the labels, we have to slightly modify eq. (10)o(t+ 1) = F (dWXo(t) +W XIX); (14)where IX = 2664 l1...lm 3775 (15)17

with li = �X(xi), i = 1; : : : ;m,WX = 26664 W 0. . .0 W 37775| {z }m repetitions of W (16)and dWX 2 <mNc�mNc is de�ned according to the topology of the graph.In this context, the input of the adjoint network is ek = (@Error(�(y))@y )k, andthe learning rules become:�wrs = �oso0rXk ek(L�1)kr; (17)�wrs = �Iso0rXk ek(L�1)kr: (18)To hold the constraint on the weights, all the changes referring to the sameweight are added and then all copies of the same weight are changed by thetotal amount. Note that each structure gives rise to a new adjoint networkand independently contributes to the variations of the weights. Moreover,the above formulation is more general than the one previously discussed andit can be used also for DAGs, in which case the adjoint network becomes afeedforward network representing the back-propagation of the errors.4.2 Extension of Real-Time Recurrent LearningThe extension of Real-Time Recurrent Learning [WZ89] to complex recursiveneurons does not present particular problems when considering graphs withoutcycles. Cyclic graphs, however, present problems, and in general yield to atraining algorithm that can be considered only loosely in real time. In thefollowing, we will discuss only acyclic graphs. In order to be concise, wewill only show how derivatives can be computed in real time, leaving to thereader the development of the learning rules, according to the chosen networkarchitecture and error function.Let Nc be the number of complex neurons, s the supersource of X, andy = o(s) = F (Wls + out degreeX (s)Xj=1 dW jo(outX(s; j))) (19)18

be the representation of X according to the encoding network, whereW = [W ;dW 1; : : : ;dW n]; (20)and n is the valence of the domain. The derivatives of y with respect to Wand dW i (i 2 [1; : : : ; n]) can be computed from eq. (19):@yt@Wtk = o0t(s)((ls)k + out degreeX (s)Xj=1 (dW j)t@o(outX(s; j))@Wtk ); (21)where t = 1; : : : ; Nc, k = 0; : : : ; NL and (dW j)t is the tth row of dW j;@yt@(dW i)tq = o0t(s)(oq(outX(s; i)) + out degreeX (s)Xj=1 (dW j)t@o(outX(s; j))@(dW i)tq ); (22)where t = 1; : : : ; Nc, and q = 1; : : : ; Nc.These equations are recursive on the structure X and can be computed bynoting that if v is such that out degreeX(v) = 0, then@ot(v)@Wtk = (lv)k; and @ot(v)@(dW i)tq = 0: (23)This allows the computation of the derivatives in real time alongside with thecomputation of the reduced descriptors for the graphs.4.3 LRAAM-based Networks and Simple RecurrentNetworksIn this section, we present a class of networks which are based on the LRAAMmodel [SSG95]. Learning in these networks is implemented via the combinationof a supervised procedure with an unsupervised one and, since it is based ontruncated gradients, it is suited for both acyclic and cyclic graphs. We will seethat this class of networks contains, as a special case, a network which resultsto be the extension of a Simple Recurrent Network [Elm90] to structures.In this type of networks (see Figure 11, the �rst part of the network isconstituted by an LRAAM (note the double arrows on the connections) whosetask is to devise a compressed representation for each structure. This com-pressed representation is obtained by using the standard learning algorithm19

. . . .

BA

LRAAM

Classifier

LRAAM

ClassifierFigure 11: LRAAM-based networks for the classi�cation of structures.for LRAAM [SSG94]. The classi�cation task is then performed in the secondpart of the network through a multi-layer feed-forward network with one ormore hidden layers (network A) or a simple sigmoidal neuron (network B).Several options for the training of networks A and B are available. The dif-ferent options we have for the training can be characterized by the proportionof the two di�erent learning rates (for the classi�er and the LRAAM) and bythe di�erent degrees x, and y of presence of the following two basic features:� the training of the classi�er is started not until x percent of the trainingset is correctly encoded and successively decoded by the LRAAM;� the error coming from the classi�er is backpropagated across y levels ofthe structures encoded by the LRAAM3.Note that, even if the training of the classi�er is started only when all thestructures in the training set are properly encoded and decoded, still the clas-si�er's error can change the compressed representations which, however, aremaintained consistent4 by learning in the LRAAM.The reason for allowing di�erent degrees of interaction between the classi-�cation and the representation tasks is due to the necessity of having di�erentdegrees of adaptation of the compressed representations to the requirements3The backpropagation of the error across several levels of the structures can be imple-mented by unfolding the encoder of the LRAAM (the set of weights from the input to thehidden layer) according to the topology of the structures.4A consistent compressed representation is a representation of a structure which containsall the information su�cient for the reconstruction of the whole structure.20

LRAAM + CLASSIFIERSRN

copy

INPUT UNITS(label) (pointer 1)

copy copy

CONTEXT UNITS(pointer 2)

HIDDEN UNITS

OUTPUT UNITS

CONTEXT UNITSCONTEXT UNITS

OUTPUT UNITS

HIDDEN UNITS

INPUT UNITSFigure 12: The network B, with x= 0 and y= 1 (right side), can be consideredas a generalization of the Simple Recurrent Network by Elman (left side).of the classi�cation task. If no interaction at all is allowed, i.e., the LRAAMis trained �rst and then its weights frozen (y= 0), the compressed represen-tations will be such that similar representations will correspond to similarstructures, while if full interaction is allowed, i.e., the LRAAM and the clas-si�er are trained simultaneously, the compressed representations will be suchthat structures in the same class will get very similar representations5. Onthe other hand, when considering DAGs, by setting y= max depth, wheremax depth is the number of nodes traversed by the longest path in the struc-tures, the classi�er error will be backpropagated across the whole structure,thus implementing the backpropagation through structure de�ned in Section4.1.It is interesting to note that the SRN by Elman can be obtained as a specialcase of network B. In fact, when considering network B (with x= 0 and y= 1)for the classi�cation of lists (sequences) the same architecture is obtained, withthe di�erence that there are connections from the hidden layer of the LRAAMback to the input layer6, i.e., the decoding part of the LRAAM. Thus, whenconsidering lists, the only di�erence between a SRN and network B is in theunsupervised learning performed by the LRAAM. However, when forcing thelearning parameters for the LRAAM to be null, we obtain the same learningalgorithm as in SRN. Consequently, we can claim that SRN is a special case of5Moreover, in this case, there is no guarantee that the LRAAM will be able to encodeand to decode consistently all the structures in the training set, since the training is stoppedwhen the classi�cation task is performed correctly.6The output layer of the LRAAM can be considered the same as the input layer.21

networkB. This can be better understood by looking at the right side of Figure12, where we have represented network B in terms of elements of a SRN. Ofcourse, the copy function for network B is not as simple as the one used in aSRN, since the right relationships among components of the structures to beclassi�ed must be preserved7.4.4 Cascade-Correlation for StructuresThe Cascade-Correlation algorithm [FL90] grows a standard neural networkusing an incremental approach for classi�cation of unstructured patterns. Thestarting network N0 is a network without hidden nodes trained with a LeastMean Square algorithm; if network N0 is not able to solve the problem, ahidden unit u1 is added such that the correlation between the output of theunit and the residual error of network N0 is maximised8. The weights of u1are frozen and the remaining weights are retrained. If the obtained networkN1 cannot solve the problem, the network is further grown, adding new hid-den units which are connected (with frozen weights) with all the inputs andpreviously installed hidden units. The resulting network is a cascade of nodes.Fahlman extended the algorithm to classi�cation of sequences, obtaining goodresults [Fah91]. In the following, we show that Cascade-Correlation can furtherbe extended to structures by using complex recursive neurons. For the sake ofsimplicity, we will discuss the case of acyclic graphs, leaving to the reader theextension to cyclic graphs (see Section 4.1, cyclic graphs).The output of the kth hidden unit, in our framework, can be computed aso(k)(x) = f(NLXi=1 w(k)i li + kXv=1 out degreeX (x)Xj=1 w(k)(v;j)o(v)(outX(x; j)) + k�1Xq=1 �w(k)q o(q)(x));(24)where w(k)(v;j) is the weight of the kth hidden unit associated to the output ofthe vth hidden unit computed on the jth component pointed by x, and �w(k)qis the weight of the connection from the qth (frozen) hidden unit, q < k, and7The copy function needs a stack for the memorization of compressed representations.The control signals for the stack are de�ned by the encoding-decoding task.8Since the maximization of the correlation is obtained using a gradient ascent techniqueon a surface with several maxima, a pool of hidden units is trained and the best one selected.22

the kth hidden unit. The output of the output neuron u(out) is computed aso(out)(x) = f( kXi=1 ~wio(i)(x)); (25)where ~wi is the weight on the connection from the ith (frozen) hidden unit tothe output unit.Learning is performed as in standard Cascade-Correlation, with the di�er-ence that, according to equation (24), the derivatives of o(k)(x) with respectto the weights are now:@o(k)(x)@w(k)i = (li + out degreeX (x)Xj=1 w(k)(k;j)@o(k)(outX(x; j))@w(k)i )f 0 (26)@o(k)(x)@ �w(k)q = (o(q)(x) + out degreeX (x)Xj=1 w(k)(k;j)@o(k)(outX(x; j))@ �w(k)q )f 0 (27)@o(k)(x)@w(k)(v;t) = (o(v)(outX(x; t)) +out degreeX (x)Xj=1 w(k)(v;j)@o(k)(outX(x; j))@w(k)(v;t) )f 0 (28)where i = 1; : : : ; NL, q = 1; : : : ; (k�1), v = 1; : : : ; k, t = 1; : : : ; out degreeX(x),f 0 is the derivative of f(). The above equations are recurrent on the structuresand can be computed by observing that for a leaf node y equation (26) reducesto @o(k)(y)@w(k)i = li, and all the remaining derivatives are null. Consequently, weonly need to store the output values of the unit and its derivatives for eachcomponent of a structure. Figure 13 shows the evolution of a network with twopointer �elds. Note that, if the hidden units have self-recurrent connectionsonly, the matrix de�ned by the weights w(k)(v;j) is not triangular, but diagonal.4.4.1 Computational ComplexityThe algorithm implements a gradient technique (local search) and thus it isnot possible to guess how many iterations are necessary for convergence. Inthe following, to have a feeling of the complexity, we suppose that each hiddenunit is trained for a bounded number of iterations9. Let P be the number of9In practice, a bound on the number of iterations for the training of a single hidden unitis always used. 23

Label Label Label

1 o (x)

2

3

o (x)

o (x)

Figure 13: The evolution of a network with two pointer �elds. The units inthe label are fully connected with the hidden units and the output unit.nodes in the minimal DAG. The number of free parameters (connections andthreshold) nk for the kth hidden unit isnk = NL|{z}label connections+ NV (k(k + 1)2 )| {z }recursive connections+ (k � 1)| {z }frozen units connections+ 1|{z}threshold (29)where NV is the valence of the domain. Consequently, observing that dur-ing learning the output of the frozen hidden units computed on the train-ing set can be stored, the output of the kth hidden unit on a single pattern(i.e., graph node) can be calculated in O(k2NV ). This complexity dominatesthe cost of computing the output of the output unit which is proportionalto ( NL|{z}label0s bits+ k|{z}frozen units). Concerning learning, it is trivial to note that thecost of training a single hidden unit dominates the cost of training the out-put unit. According to eq.s (26)-(28), the derivatives for the kth hiddenunit with respect to a single pattern can be computed in O(k2N2V ) in time,since out degreeX(x) � NV . Thus, considering the full training set it takesO(Pk2N2V ) in time. Finally, when building a network with k hidden units, thecomputation of all the derivatives takes O(Pk3N2V ) in time10. The complexityin space is dominated by the space necessary for storing the derivatives of thecurrent trained unit, i.e., O(Pk2NV ).10The total number of computations in one step is proportional to Pki=1 Pi2N2V =PN2V Pki=1 i2 = PN2V (k(k+1)(2k+1)6 ). 24

When considering a diagonal connection matrix, i.e., hidden units withself-recursive connections only, the complexity of learning becames O(Pk2N2V )in time and O(PkNV ) in space.4.5 Extension of Neural TreesNeural Trees (NT) have been recently proposed as a fast learning methodin classi�cation tasks. They are decision trees [BFOS84] where the splittingof the data for each node, i.e., the classi�cation of a pattern according tosome features, is performed by a perceptron [SN90] or a more complex neuralnetwork [SM91]. After learning, each node at every level of the tree correspondsto an exclusive subset of the training data and the leaf nodes of the treecompletely partition the training set. In the operative mode, the internalnodes route the input pattern to the appropriate leaf node which representsthe class of it. An example of how a binary neural tree splits the training setis shown in Figure 14.1

43

2

1

2

34Figure 14: An example of neural tree.One advantage of the neural tree approach is that the tree structure isconstructed dynamically during learning and not linked to a static structure25

like a standard feed-forward network. Moreover, it allows incremental learning,since subtrees can be added as well as deleted to recognize new classes ofpatterns or to improve generalization. Both supervised [SN90, Set90, Aa92,SM91] and unsupervised [PI92, LFJ92, Per92] splitting of the data have beenproposed. We are interested in supervised methods.The learning and classi�cation algorithms for binary neural trees is de-scribed in Figure (15), respectively. Algorithms for general trees can be foundin [SM91]. The extension of these algorithms to structures is straightforward:the standard discriminator (or network) associated to each node of the treeis replaced by a complex recursive discriminator (or network) which can betrained with any of the learning algorithms we have presented so far.5 Experimental ResultsWe have tested some of the proposed architectures on several classi�cationtasks involving logic terms. In the next subsection, we present the classi�cationproblems. Then, we discuss the results obtained with LRAAM-based networksand Cascade-Correlation for structures.5.1 Description of the Classi�cation ProblemsWe have summarized the characteristics of each problem in Table 1. The �rstcolumn of the table reports the name of the problem, the second one the setof symbols (with associated arity) compounding the terms, the third columnshows the rule(s) used to generate the positive examples of the problem11,the fourth column reports the number of terms in the training and test setrespectively, the �fth column the number of subterms in the training and testset, and the last column the maximum depth12 of terms in the training andtest set. For each problem about the same number of positive and negativeexamples is given. Both positive and negative examples have been generatedrandomly. Training and test sets are disjoint and have been generated by thesame algorithm.11Note that the terms are all ground.12We de�ne the depth of a term as the maximum number of edges between the root andleaf nodes in the term's LDAG-representation.26

Learning AlgorithmLet f(Ik ; lk)g be the training set, where Ik is a real valued feature vector and lk thelabel of the associated class, k = 1; � � � ; P . Let T (t) be the current training set fornode t, Dt the discriminator associated to t, if it is a nonterminal node, or the labelof a class if it is a leaf.1. Set t = root and T (root) = f(Ik; lk)g;2. If T (t) = ; then stop;3. Train the discriminator Dt to split T (t) in two sets: Tl(t), wherethe output of Dt is 0, and Tr(t), where the output of Dt is 1;4. (a) Add to t a left node tl; if Tl(t) contains only patterns withlabel lj for any j, then tl is a leaf with training set T (tl) = ;,otherwise it is an internal node with training set T (tl) = Tl(t);(b) Add to t a right node tr ; if Tr(t) contains only patterns withlabel lj for any j, then tr is a leaf with training set T (tr) = ;,otherwise it is an internal node with training set T (tr) =Tr(t);5. repeat the same algorithm for tl and tr, starting from step 2.The discriminators can be implemented by a simple perceptron or by a feed-forwardnetwork. Classi�cation AlgorithmThe class of a pattern Ik can be established by the following algorithm:1. Set t = root;2. If the node t is a leaf, classify Ik by the associated label lk andstop, otherwise:(a) If Dt(Ik) = 0 then set t = tl;(b) otherwise set t = tr ;3. go to step 2.Figure 15: The Learning and Classi�cation Algorithms for Binary NeuralTrees. 27

Classi�cation ProblemsProblem Symbols Positive Examples. #terms #subterms depth(tr.,test) (tr.,test) (pos.,neg.)lblocc1 f/2 i/1 a/0 b/0 c/0 no occurrence of label c (259,141) (444,301) (5,5)longtermocc1 f/2 i/1 a/0 b/0 c/0 the (sub)terms i(a) or (173,70) (179,79) (2,2)f(b,c) occur somewheretermocc1 f/2 i/1 a/0 b/0 c/0 the (sub)terms i(a) or (280,120) (559,291) (6,6)very long f(b,c) occur somewhereinst1 f/2 a/0 b/0 c/0 instances of f(X,X) (200,83) (235,118) (3,2)inst1 f/2 a/0 b/0 c/0 instances of f(X,X) (202,98) (403,204) (6,6)longinst4 f/2 a/0 b/0 c/0 instances of f(X,f(a,Y)) (175,80) (179,97) (3,2)inst4 f/2 a/0 b/0 c/0 instances of f(X,f(a,Y)) (290,110) (499,245) (7,6)longinst7 t/3 f/2 g/2 i/1 j/1 instances of t(i(X),g(X,b),b) (191,109) (1001,555) (6,6)a/0 b/0 c/0 d/0Table 1: Description of a set of classi�cation problems involving logic terms.It must be noted that the set of proposed problems range from the detectionof a particular atom (label) in a term to the satisfaction of a speci�c uni�cationpattern. Speci�cally, in the uni�cation patterns for the problems inst1, andinst1 long the variable X occurs twice making these problems much moredi�cult than inst4 long, because any classi�er for these problems would haveto compare arbitrary subterms corresponding to X.5.2 LRAAM-based networksBest Results for an LRAAM-based NetworkProblem #L-unit #H-unit Learning Par. % Dec.-Enc. % Tr. % Ts. #epochslblocc1 8 35 � = 0:2; � = 0:001; � = 0:5 4.25 100 98.58 11951longtermocc1 8 25 � = 0:1; � = 0:1; � = 0:2 100 98.84 94.29 27796inst1 6 35 � = 0:2; � = 0:06; � = 0:5 100 97 93.98 10452inst1 6 45 � = 0:2; � = 0:005; � = 0:5 36.14 94.55 90.82 80000longinst4 6 35 � = 0:2; � = 0:005; � = 0:5 98.86 100 100 1759inst4 6 35 � = 0:2; � = 0:005; � = 0:5 8.97 100 100 6993longinst7 13 40 � = 0:1; � = 0:01; � = 0:2 1.05 100 100 6158Table 2: The best results obtained for almost all the classi�cation problemsby an LRAAM-based network. 28

In Table 2, we have reported the best result we obtained for almost all the prob-lems described in Table 1, over 4 di�erent network settings (both in numberof hidden units for the LRAAM and learning parameters) for the LRAAM-network with a single unit as classi�er. The simulations were stopped after30,000 epochs, apart for problem inst1 long for which we used a bound of80,000 epochs, or when the classi�cation problem over the training set wascompletely solved. We made no extended e�ort for optimizing the size of thenetwork and the learning parameters, thus it should be possible to improveon the reported results. The �rst column of the table shows the name of theproblem, the second one the number of units used to represent the labels, thethird the number of hidden units, the fourth the learning parameters (� is thelearning parameter for the LRAAM, � the learning parameter for the classi�er,� the momentum for the LRAAM), the �fth the percentage of terms in thetraining set which the LRAAM was able to properly encode and decode, thesixth the percentage of terms in the training set correctly classi�ed, the sevenththe percentage of terms in the test set correctly classi�ed, and the eighth thenumber of epochs the network emploied to reach the reported performances.From the results, it can be noted that some problems get a very satisfac-tory solution even if the LRAAM performs poorly. Moreover, this behaviordoes not seem to be related with the complexity of the classi�cation prob-lem, since both problems involving the simple detection of an atom (label) inthe terms (lblocc1 long) and problems involving the satisfaction of a spe-ci�c uni�cation rule (inst4 long, inst7) can be solved without the need ofa fully developed LRAAM. Thus, it is clear that the classi�cation of the termsis exclusively based on the encoding power of the LRAAM's encoder which isshaped both by the LRAAM error and the classi�cation error. However, evenif the LRAAM's decoder is not directly involved in the classi�cation task, ithelps the classi�cation process since it forces the network to generate di�erentrepresentations for terms in di�erent classes13.In order to give a feeling of how learning proceeds, the performance ofthe networks during training is shown for the problems termocc1, inst4 andinst4 long in Figures 16-18, where the encoding-decoding performance curveof the LRAAM on the training set is reported, together with the classi�cation13Actually, the decoder error forces the LRAAM network to develop a di�erent represen-tation for each term, however, when the error coming from the classi�er is very strong, itcan happen that terms in the same class get almost identical representations.29

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

0 5000 10000 15000 20000 25000 30000

succ

ess

(%)

epochs

termocc1

encoding-decodingtraining set

test setFigure 16: Performance curves for termocc1.30

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

0 200 400 600 800 1000 1200 1400 1600 1800

succ

ess

(%)

epochs

inst4


test setFigure 17: Performance curves for inst4.curves on the training and test set.5.2.1 Reduced Representations for Classi�cationIn this section, we brie y discuss the representational di�erences between abasic LRAAM (without classi�er) and the architecture we used. The basicLRAAM organizes the representational space in such a way that similar struc-tures get similar reduced representations (see [Spe94b] for more details). Thishappens because, even if the LRAAM is trained in supervised mode both overthe output of the network and over the relationships among components (i.e.,the information about the pointers), the network is auto-associative and thus itdecides by itself the representation for the pointers. Consequently, the learningmode for the LRAAM is, mainly, unsupervised. When a classi�er is introduced(as in our system), the training of the LRAAM is no longer mainly unsuper-vised, since the error of the classi�er constrains the learning. The resultinglearning regime is somewhat between an unsupervised and a supervised mode.31

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

0 1000 2000 3000 4000 5000 6000 7000

succ

ess

(%)

epochs

inst4_long


test setFigure 18: Performance curves for inst4 long.32

inst1 (basic LRAAM)

train-pos

-1

-0.5

0

0.5

1-1

-0.50

0.51

1.5

-1.5-1

-0.50

0.51

1.52

first principal component

second principal component

third principal component

train-negtest-postest-neg

Figure 19: The �rst, second, and third principal component of the reducedrepresentations, devised by a basic LRAAM on the training and test sets ofthe inst1 problem, yield a nice 3D view of the term's representations.33

inst1

train-pos

-3-2.5

-2-1.5

-1-0.5

00.5 -1

-0.50

0.51

1.5

-2-1.5

-1-0.5

00.5

11.5


second principal component

third principal component

train-negtest-postest-neg

Figure 20: Results of the principal components analysis (�rst, second, andthird principal component) of the reduced representations developed by theproposed network (LRAAM + Classi�er) for the inst1 problem.34

In order to understand the di�erences between representations devised bya basic LRAAM and the ones devised in the present paper, we trained a ba-sic LRAAM (of the same size of the LRAAM used by our network) over thetraining set of inst1, then we computed the �rst, second, and third principalcomponent of the reduced representations obtained both for the training andtest set14. These principal components are plotted in Figure 19. It can benoted that the obtained representations mainly cluster themselves in speci�cpoints of the space. Terms of the same depth constitute a single cluster, andterms of di�erent depth are in di�erent clusters. The same plot for the reducedrepresentations devised by our network (as from Table 2, row 3) is presentedin Figure 20. The overall di�erences with respect to the basic LRAAM plotconsists in a concentration of more than half (57%) of the positive examples ofthe training and test sets in a well de�ned cluster, while the remaining repre-sentations are spread within two main subspaces. The well de�ned cluster canbe understood as the set of representations for which there was no huge inter-ference between the decoder of the LRAAM and the classi�er (this allowed theformation of the cluster), while the remaining representations do not preservethe cluster structure since they have to satisfy competitive constraints comingfrom the classi�er and the decoder. Speci�cally, the classi�er tends to clusterthe representations into two well de�ned clusters (one for each class), while theLRAAM decoder tends to develop well distinct reduced representations sincethey must be decoded to di�erent terms.The above considerations on the �nal representations for the terms arevalid only if the LRAAM reaches a good encoding-decoding performance onthe training set. However, as we have reported in Table 2, some classi�cationproblems can be solved even if the LRAAM performs poorly. In this case,the reduced representations contain almost exclusively information about theclassi�cation task. In Figure 21 and Figure 22 we have reported the results ofa principal components analysis on the representations developed for the prob-lems inst4 long and inst7, respectively. In the former, the �rst and secondprincipal components su�ce for a correct solution of the classi�cation problem.In the latter, the second principal component alone gives enough informationfor the solution of the problem. Moreover, notice how the representationsdeveloped for inst7 clustered with smaller variance than the representations14We considered only the reduced representations for terms. No reduced representationfor subterms was included in the analysis. 35

-0.5

0

0.5

1

1.5

2

2.5

-2 -1.5 -1 -0.5 0 0.5 1

seco

nd p

rinci

pal c

ompo

nent


inst4_long

train-postrain-negtest-postest-negFigure 21: Results of the principal components analysis (�rst, and secondprincipal component) of the reduced representations developed by the proposednetwork (LRAAM + Classi�er) for the inst4-long problem. The resultingrepresentations are clearly linearly separable.developed for inst4 long, and how this is in accordance with the better perfor-mance in encoding-decoding of the latter than the former. Of course, this doesnot constitute enough evidence for concluding that the relationship betweenthe variance of the clusters and the performance of the LRAAM is demon-strated. However, it seems to be enough for calling a more accurate study onthis issue.5.3 Cascade-Correlation for StructuresThe results obtained by Cascade-Correlation for structures, shown in Table 3,are obtained for a subset of the problems using a pool of 8 units. The networksused have both triangular and diagonal recursive connections matrices and no36

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

seco

nd p

rinci

pal c

ompo

nent


inst7

train-postrain-negtest-postest-neg

Figure 22: Results of the principal components analysis (�rst, and secondprincipal component) of the reduced representations developed by the pro-posed network (LRAAM + Classi�er) for the inst7 problem. The resultingrepresentations can be separated using only the second principal component.37

very-longtermocc1

longinst4

longinst1

very-longtermocc1

longinst4

longinst1

# Hidden UnitsMean (Min - Max)

11 (8 - 16)

12.66 (9 - 15)

12.4 (7 - 21)

18.66 (17 - 21)

7.81 (6 - 11)

13.25 (8 - 19)

7.2 (4 - 12)

11.33 (7 - 19)

Problem # Label Units

# Trials

inst1

8

6

6

6

3

5

3

3

inst1

8

6

6

6

4

5

16

9

% TestMean (Min - Max)

82.99 (75.51 - 91.84)

98.48 (97.27 - 100)

96.94 (95 - 99.17)

91.65 (88.87 - 94.89)

97.70 (95 - 100)

99.64 (99.09 - 100)

(87.95 - 92.77)

90.09 (86.75 - 92.77)

90.6

Results(Networks without connections between hidden units)

Dia

gona

lT

rian

gula

r

Table 3: Results obtained on the test sets for each classi�cation problem usingboth networks with triangular and diagonal recursive connection matrices. Thesame number of units (8) in the pool was used for all networks.connection between hidden units. We decided to remove the connectionsbetween hidden units to reduce the probability of over�tting.We made no extended e�ort for optimizing the learning parameters and thenumber of units in the pool, thus it should be possible to signi�cantly improveon the reported results.6 ConclusionWe have proposed a generalization of the standard neuron, namely the com-plex recursive neuron, for extending the computational capability of neuralnetworks to processing of structures. On the basis of the complex recursive38

neuron, we have shown how several of the learning algorithms de�ned for stan-dard neurons, can be adapted to deal with structures. We believe, that otherlearning procedures, which are not covered by this report, can be adapted aswell.The proposed approach to learning in structured domains can be adoptedfor automatic inference in syntactic and structural pattern recognition. Specif-ically, in this report, we demonstrated the possibility to perform classi�cationtasks involving logic terms. It must be noted that automatic inference can alsobe obtained by using Inductive Logic Programming [MR94]. The proposed ap-proach, however, has its own speci�c peculiarity, since it can approximate func-tions from a structured domain (possibly with real valued vectors as labels)to the reals. Speci�cally, we believe that the proposed approach can fruit-fully be applied to molecular biology and chemistry (classi�cation of chemicalstructures, quantitative structure-property relationship (QSPR), quantitativestructure-activity relationship (QSAR)), where it can be used for the auto-matic determination of topological indexes [Rou90], which are usually designedthrough a very expensive trial and error approach.In conclusion, the proposed architectures extends the processing capabili-ties of neural networks, allowing the processing of structured patterns whichcan be of variable size and complexity. However, it must be pointed out thatsame of the proposed architectures have computational limitations. For exam-ple, Cascade-Correlation for structures has computational limitations due tothe fact that frozen hidden units cannot receive input from hidden units in-troduced after their insertion into the network. These limitations, in the con-text of standard Recurrent Cascade-Correlation (RCC), have been discussedin [GCS+95], where it is demonstrated that certain �nite state automata can-not be implemented by networks built up by RCC using monotone activationfunctions. Since our algorithm reduces to standard RCC when consideringsequences, it follows that it has limitations as well.AcknowledgmentWe would like to thank Christoph Goller for the generation of the training andtest sets used in this paper. 39

References[Aa92] L. Atlas and al. A performance comparison of trained multilayerperceptrons and trained classi�cation trees. Proceedings of theIEEE, 78:1614{1619, 1992.[Alm87] L.B. Almeida. A learning rule for asynchronous perceptrions withfeedback in a combinatorial environment. In M. Caudil and C. But-ler, editors, Proceedings of the IEEE First Annual InternationalConference on Neural Networks, pages 609{618. CA: IEEE, 1987.[Ati88] A. Atiya. Learning on a general network. In D.Z. Anderson, editor,Neural Information Processing Systems, pages 22{30. New York:AIP, 1988.[BFOS84] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi�cationand Regression Trees. Wadsworth International Group, 1984.[CG83] M. A. Cohen and S. Grossberg. Absolute stability of global pat-tern formation and parallel memory storage by competitive neuralnetworks. IEEE Trans. on Systems, Man, and Cybernetics, 13:815{826, 1983.[Elm90] J. L. Elman. Finding structure in time. Cognitive Science, 14:179{211, 1990.[Fah91] S. E. Fahlman. The recurrent cascade-correlation architecture.Technical Report CMU-CS-91-100, Carnegie Mellon, 1991.[FL90] S. E. Fahlman and C. Lebiere. The cascade-correlation learningarchitecture. In D. S. Touretzky, editor, Advances in Neural In-formation Processing Systems 2, pages 524{532. San Mateo, CA:Morgan Kaufmann, 1990.[GCS+95] C.L. Giles, D. Chen, G.Z. Sun, H.H. Chen, Y.C. Lee, and M.W.Goudreau. Constructive learning of recurrent neural networks: Lim-itations of recurrent cascade correlation and a simple solution. IEEETransactions on Neural Networks, 1995. To appear.40

[GK95] C. Goller and A. K�uchler. Learning Task-Dependent DistributedStructure-Representations by Backpropagation Through Structure.AR-report AR-95-02, Institut f�ur Informatik, Technische Univer-sit�at M�unchen, 1995.[Hop84] J. J. Hop�eld. Neurons with graded response have collective com-putational properties like those of two-state neurons. In Proc. Natl.Acad. Sci., pages 3088{3092, 1984.[LFJ92] T. Li, L. Fang, and A. Jennings. Structurally adaptive self-organizing neural trees. In International Joint Conference on NeuralNetworks, pages 329{334, 1992.[MR94] S. Muggleton and L. De Raedt. Inductive login programming: The-ory and methods. Journal of Logic Programming, 19,20:629{679,1994.[Per92] M. P. Perrone. A soft-competitive splitting rule for adaptive tree-structured neural networks. In International Joint Conference onNeural Networks, pages 689{693, 1992.[PI92] M. P. Perrone and N. Intrator. Unsupervised splitting rules forneural tree classi�ers. In International Joint Conference on NeuralNetworks, pages 820{825, 1992.[Pin88] F. J. Pineda. Dynamics and architecture for neural computation.Journal of Complexity, 4:216{245, 1988.[Pol90] J. B. Pollack. Recursive distributed representations. Arti�cial In-telligence, 46(1-2):77{106, 1990.[Rou90] D. H. Rouvray. Computational Chemical Graph Theory, page 9.Nova Science Publishers: New York, 1990.[Set90] I. K. Sethi. Entropy nets: From decision trees to neural networks.Proceeding of the IEEE, 78:1605{1613, 1990.[SM91] A. Sankar and R. Mammone. Neural Tree Networks, pages 281{302.Neural Networks: Theory and Applications. Academic Press, 1991.41

[SN90] J. A. Sirat and J-P. Nadal. Neural trees: a new tool for classi�cation.Network, 1:423{438, 1990.[Spe94a] A. Sperduti. Encoding of Labeled Graphs by Labeling RAAM. InJ. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances inNeural Information Processing Systems 6, pages 1125{1132. SanMateo, CA: Morgan Kaufmann, 1994.[Spe94b] A. Sperduti. Labeling RAAM. Connection Science, 6(4):429{459,1994.[Spe95] A. Sperduti. Stability properties of labeling recursive auto-associative memory. IEEE Transactions on Neural Networks,6(6):1452{1460, 1995.[SSG94] A. Sperduti, A. Starita, and C. Goller. Fixed length representa-tion of terms in hybrid reasoning systems, report i: Classi�cationof ground terms. Technical Report TR-19/94, Dipartimento di In-formatica, Universit�a di Pisa, 1994.[SSG95] A. Sperduti, A. Starita, and C. Goller. Learning distributed repre-sentations for the classi�cation of terms. In Proceedings of the Inter-national Joint Conference on Arti�cial Intelligence, pages 509{515,1995.[WZ89] R. J. Williams and D. Zipser. A learning algorithm for continu-ally running fully recurrent neural networks. Neural Computation,1:270{280, 1989.42

sup ervised neural net w orks for the - hong kong baptist ...markus/teaching/comp7650/rmlp.pdf ·...

Documents