dynamic node architecture learning: an information theoretic approach

12
Pergamon Neural Networks, Vol 7, No 1, pp 129-140, 1994 Copyright © 1994 Elsevier Science Ltd Printed in the USA All rights reserved 0893-6080/94 $6 00 + 00 CONTRIBUTED ARTICLE Dynamic Node Architecture Learning: An Information Theoretic Approach ERIC B. BARTLETT Iowa State University ( Recewed 13 August 1992: revtsed and accepted 17 May 1993 ) Abstract--Typwally, artifictal neural network (ANN) trammg schemes requtre network stze to be set before learnmg ts tnmated The learning speed and generahzation charactenstws of ANNs are, however, dependent on thts pretraming selection of the network archttecture The trammg and generalizatton vtabthty of a spectfic network can, therefore, only be evaluated posttrammg Thts work presents an mformatton theoretw method that allevmtes this predwament by buddmg the approprtate network archttecture dynamwally durmg the trammg process. The method, called dynamtc node archttecture learnmg ( DNAL ), ehmmates the need to select network size before trammg Examples illustrate the use and advantages of the mformatlon theorettc DNAL approach over stattc architecture learning ( SAL ). Keywords--Connectionism,Dynamic node allocation, Neural network computing, Network architectures, Opti- mization, Supervised learning. 1. INTRODUCTION Many researchers have recognized the potential of ar- tificial neural networks (ANNs) for pattern recognition, system modeling, and other uses. Although promising, ANN applications may be restricted by limitations such as slow learning, inefficient scaling to large problems, and uncertainty about their recall generalization results (Gallant, 1990; Judd, 1990; Mclnerney, Haines, Bia- fore, & Hecht-Nielsen, 1989; Rumelhart, McClelland, & PDP Research Group, 1986; Werbos, 1989; Wolpert, 1992). One of the causes of these symptoms is the in- ability to predetermine appropriate network sizes or architectures for given problems before training is at- tempted. A typical question might be, for example: How many hidden nodes are needed to minimize training time or maximize performance in some way? One sim- ple approach is to train many networks, each with a different number of hidden nodes, and then employ the network with the best posttraining characteristics. This approach significantly increases training time because Acknowledgments: This work was made possible by the generous support of the Umted States Department of Energy under Special Research Grant No. DE-FG02-92ER75700, enUtled "Neural Network Recognition of Nuclear Power Plant Transients" This support does not constitute an endorsement by DOE of the views expressed in this article. Requests for reprints should be sent to Prof. E. B Bartlett, 104 Nuclear Engineering Lab., Dept. of Mechanical Engineering, Iowa State Umverslty, Ames, IA 50011. 129 many ANNs must be trained on the same data set. Furthermore, a suboptimal architecture will most likely be obtained because the correct architecture may not be one of the initial selections. Other somewhat more sophisticated approaches have been used that rely on empirical relations to determine network architectures prior to training (Cotter, 1990; Hecht-Nielsen, 1987, 1989, 1990; Lippmann, 1987; Upadhyaya & Eryurek, 1992; Widrow & Lehr, 1990). These relationships are derived from theoretical con- siderations of the minimum network size needed to memorize a given training set. Hecht-Nielson uses Ko- lomogorov's theorem to imply that any continuous function can be approximated with as few as 2. J( 1) + 1 hidden nodes, where J( 1) is the number of network inputs (Kolomogorov, 1957). Upadhyaya and Eryurek (1992) assert that the minimum quantity of binary coded bits necessary to give a unique code to each pat- tern in the training set is log2(N) where Nis the number of exemplars in the training set. Furthermore, Upa- dhyaya, and Eryurek contend that if there is more than one input node, this number is multiplied by the num- ber of input nodes. Regardless of the elegance of these theories, experience has shown that these relationships may not always provide appropriate ANN architectures. The objective of this paper is to present a systematic architecture construction method, called dynamic node architecture learning (DNAL), that eliminates the need to preselect network size. Others have performed re- search in related areas; however, their approaches tend

Upload: eric-b-bartlett

Post on 26-Jun-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Dynamic node architecture learning: An information theoretic approach

Pergamon Neural Networks, Vol 7, No 1, pp 129-140, 1994

Copyright © 1994 Elsevier Science Ltd Printed in the USA All rights reserved

0893-6080/94 $6 00 + 00

CONTRIBUTED ARTICLE

Dynamic Node Architecture Learning: An Information Theoretic Approach

ERIC B. BARTLETT

Iowa State University

( Recewed 13 August 1992: revtsed and accepted 17 May 1993 )

Abstract--Typwally, artifictal neural network (ANN) trammg schemes requtre network stze to be set before learnmg ts tnmated The learning speed and generahzation charactenstws of ANNs are, however, dependent on thts pretraming selection of the network archttecture The trammg and generalizatton vtabthty of a spectfic network can, therefore, only be evaluated posttrammg Thts work presents an mformatton theoretw method that allevmtes this predwament by buddmg the approprtate network archttecture dynamwally durmg the trammg process. The method, called dynamtc node archttecture learnmg ( DNAL ), ehmmates the need to select network size before trammg Examples illustrate the use and advantages of the mformatlon theorettc DNAL approach over stattc architecture learning ( SAL ).

Keywords--Connectionism, Dynamic node allocation, Neural network computing, Network architectures, Opti- mization, Supervised learning.

1. INTRODUCTION

Many researchers have recognized the potential of ar- tificial neural networks (ANNs) for pattern recognition, system modeling, and other uses. Although promising, ANN applications may be restricted by limitations such as slow learning, inefficient scaling to large problems, and uncertainty about their recall generalization results (Gallant, 1990; Judd, 1990; Mclnerney, Haines, Bia- fore, & Hecht-Nielsen, 1989; Rumelhart, McClelland, & PDP Research Group, 1986; Werbos, 1989; Wolpert, 1992). One of the causes of these symptoms is the in- ability to predetermine appropriate network sizes or architectures for given problems before training is at- tempted. A typical question might be, for example: How many hidden nodes are needed to minimize training time or maximize performance in some way? One sim- ple approach is to train many networks, each with a different number of hidden nodes, and then employ the network with the best posttraining characteristics. This approach significantly increases training time because

Acknowledgments: This work was made possible by the generous support of the Umted States Department of Energy under Special Research Grant No. DE-FG02-92ER75700, enUtled "Neural Network Recognition of Nuclear Power Plant Transients" This support does not constitute an endorsement by DOE of the views expressed in this article.

Requests for reprints should be sent to Prof. E. B Bartlett, 104 Nuclear Engineering Lab., Dept. of Mechanical Engineering, Iowa State Umverslty, Ames, IA 50011.

129

many ANNs must be trained on the same data set. Furthermore, a suboptimal architecture will most likely be obtained because the correct architecture may not be one of the initial selections.

Other somewhat more sophisticated approaches have been used that rely on empirical relations to determine network architectures prior to training (Cotter, 1990; Hecht-Nielsen, 1987, 1989, 1990; Lippmann, 1987; Upadhyaya & Eryurek, 1992; Widrow & Lehr, 1990). These relationships are derived from theoretical con- siderations of the minimum network size needed to memorize a given training set. Hecht-Nielson uses Ko- lomogorov's theorem to imply that any continuous function can be approximated with as few as 2. J( 1 ) + 1 hidden nodes, where J( 1 ) is the number of network inputs (Kolomogorov, 1957). Upadhyaya and Eryurek (1992) assert that the minimum quantity of binary coded bits necessary to give a unique code to each pat- tern in the training set is log2(N) where Nis the number of exemplars in the training set. Furthermore, Upa- dhyaya, and Eryurek contend that if there is more than one input node, this number is multiplied by the num- ber of input nodes. Regardless of the elegance of these theories, experience has shown that these relationships may not always provide appropriate ANN architectures.

The objective of this paper is to present a systematic architecture construction method, called dynamic node architecture learning (DNAL), that eliminates the need to preselect network size. Others have performed re- search in related areas; however, their approaches tend

Page 2: Dynamic node architecture learning: An information theoretic approach

130

to emphasize network pruning, weight elimination, or other network reduction techniques; are rule based in nature; or rely on genetic algorithms (Ash, 1989; Born- holt & Graudenz, 1992; Hirose, Yamashita, & Hijiya, 1991; Karnin, 1990; Sietsma & Dow, 1988; Vaario & Ohsuga, 1991; Weigend, Rumelhart, & Huberman, 1991; Won & Pimmel, 1991 ). Not only does the present work emphasize the ideas of dynamic network con- struction rather than pruning or genetic trial and error but it also introduces a formal definition of nodal im- portance based on information theory (Watanabe, 1969). By starting the network with a small number of hidden units and building an ANN architecture, DNAL allows the network to learn the gross features of the desired mapping when the structure of the net- work is small and the finer details of the mapping as the number of hidden units is increased (Kirkpatrick, Gelatt, & Vecchl, 1983). In addition, the use of the nodal importance function allows the network to de- termine precisely which nodes are contributing to the network output and which are not. The results of this work show that DNAL requires little computational overhead, provides viable trained ANN architectures, reduces noise sensitivity, and actually reduces learning time m many cases.

The next section of this paper describes the network paradigm used to demonstrate the DNAL approach. Section 3 provides the theoretical bases for DNAL. Section 4 shows results from the DNAL method and compares them to similar results obtained using static

E B Bartlett

architecture learning (SAL) networks. Section 5 con- tains the concluding remarks.

2. NETWORK AND NODAL ARCHITECTURES

The networks used for the demonstration of DNAL utilize layered continuous perceptrons and a self-opti- mizing stochastic learning algorithm (SOSLA) de- scribed and applied elsewhere (Bartlett, 1990, 1991, 1992; Bartlett & Basu, 1991; Bartlett & Uhrig, 1991 a,b, 1992a,b). A brief review of the SOSLA approach is, however, given below.

A mapping M, which may be continuous or discrete such that

X~+~. = M(X , ,o ) ( 1 )

is modeled by a network of layered nodes as shown in Figure 1, where

Xl,n : ( X I I n , Xl,2 . . . . . XI,J(I),n) ( 2 )

is the input vector,

XI+ I n : ( X l + l 1 n, -'~1+1 2 . . . . X l+ l ,J ( l ) , r l ) ( 3 )

is the output vector, which corresponds to the output of the Ith layer of active (hidden or output) nodes, and J( 1 ) and J ( I + 1 ) are the dimension of the input and output vectors, respectively. Note that the input nodes are inactive in that their input is equal to their output; therefore, we have I layers of active nodes. Also note

x3, 1, n x3, 2, n X3, J( l+ l ) , n

I I Output Nodes

w3, 1, 1 w3, J( l+ l ) , J(2)

Hidden Nodes

w2, 1, 1 w2, J(2), J(1)

Input Nodes

Xl, 1, n Xl , 2, n Xl , 3, n Xl , J(1), n

F I G U R E 1. An example network showing input, hidden, and output nodes, as well as the indexing notation for the nodes, activations, and weights. Note that I = 2 in this example.

Page 3: Dynamic node architecture learning: An information theoretic approach

Dynamtc Node Architecture

that n is the training set exemplar (input-output pat- tern) number. Each active node has the following input- output relation

/JO-l) ) x,.,.,=(ll~r).arctan I ~ (W,,.h,.X,_,.k.,.,) + ½. (4)

The trainable parameter set is { W,.j,k }. The artificial neurons (nodes, units) used are very similar to those used in the typical backpropagation paradigm (Hecht- Nielsen, 1989). These nodes do, however, use the arc- tangent rather than the usual exponential sigmoid function.

The error (cost) function to be minimized has the form

c(W) { ' 11,,2 N.J ( I+ I ) .= i L j=t

where N is the number of training exemplars in the training set, { fl~, ~I+l }. This cost function is the root mean square (RMS) error of the network output over the training set. Note that { fll } is a subset of all possible inputs {XI }, and { fll+l } is a subset of all correct or desired outputs {XD1+I } associated with {X~ }. The problem is to reconstruct or approximate the unknown desired mapping Z, such that

XDI+I = Z(XI) (6)

from { ~21, f~l+~ }. There are, however, many solutions M, which satisfy the training set

~21+t = M ( f l l ) ( 7 )

none of which are necessarily the desired solution

~-~1+1 : Z(~I) (8)

An outline of SOSLA with DNAL is as follows. 1. Make random guesses to pick values for each mem-

ber of the weight set and evaluate c(W) °. This is the starting point, t = 0.

2. Make a small change to each weight in the set and reevaluate c(W). This yields c(W) t+t . If c(W) t+t < c(W) t continue to 3, if not, increment t and repeat Step 2.

3. Store the best weight set; discard the weight set with the larger c(W).

4. Change the parameter selection criteria based on information gained during Step 2.

5. If the network learning is slow, expand the network by adding a node to the hidden layer.

6. If the total cost is acceptable, such that c(W) t < e for some desired e, or any hidden node has very low importance or is redundant, then reduce the network size by deleting the node.

7. If the network structure oscillates about some fixed architecture, stop; otherwise, go to Step 2.

131

The SOSLA approach used for SAL is identical to the repetition of the first four steps of this algorithm. The key to SAL is Step 4, where the challenge is to determine the best method to adapt the selection cri- teria so that the result is an increased probability of successful future weight selections. Step 4 is accom- plished by utilizing the theory of Monte Carlo trials. Once the training problem is posed in integral form (Bartlett, 1990) and the stochastic evaluation of inte- grals procedure (Ripley, 1987) is used, the theory of Monte Carlo importance function biasing (Stevens, 1984) is applied to determine the optimal probability density function (OPDF) from which to select the weight changes. Learning is adapted by the algorithm through changes to the parameters that define OPDF. The learning dynamics result from continually updat- ing the system estimate of the OPDF supplied by the theory. The estimate of these parameters is updated during code execution only when better information is gained about their appropriate values.

During the DNAL phase of the learning cycle (Steps 5, 6, and 7), the algorithm stops to find the importance of each node and may execute an architecture change only after the ANN is stuck on a learning plateau. In this way the required computer time for the DNAL phase is kept to a minimum.

3. DYNAMIC NODE ARCHITECTURE THEORY

ANNs must be able to generalize information gained through training (Wolpert, 1990). Without the ability to generalize, neural networks would be of little interest. Therefore, network paradigms must be evaluated on not only their speed and depth of convergence but also their generalization capabilities, which may be difficult to evaluate. Minimization of the recall set error is the ultimate goal; however, the recall set is rarely known a priori, and if it were, an ANN would be unnecessary. Minimization over an unknown recall set implies an inductive generalization of things learned from the training set. More than a simple optimization over an arbitrary training set is required if good generalization characteristics are to be obtained. Correct methods for choosing the training set, the training control param- eters, and the network size are also required.

Generalization is the ability to quantitatively esti- mate the characteristics of a phenomenon never en- countered before on the basis of its similarities with things already known (Li, 1985; Stone, 1977; Tishby & Levin, 1989; Wolpert, 1990). This ability implies that the generalizer can separate out the similar char- acteristics of two phenomena and also has the ability to distinguish between specifics and generalities. How- ever, if an event is not learned in detail, these details will not be confused with the general characteristics of the phenomena. The objective, therefore, is to teach

Page 4: Dynamic node architecture learning: An information theoretic approach

132 E B Bartlett

TABLE 1 Four Pattern Exclusive-Nor Training Data Set

Input Input Pattern Number 1 2 Desired Output

1 0 0000 0.0000 1 0000 2 0.0000 1.0000 0 0000 3 1 0000 0 0000 0 0000 4 1.0000 1.0000 1.0000

the ANN only the important characteristics of the de- sired task, those that set apart the classes of mterest. The mathematical analogy to this is to reduce the number of ways to interpret details. In network imple- mentation this means reducing the number of weights and nodes (Ash, 1989: Ishikawa, 1989; Kruschke, 1989). However, because the appropriate number of weights and nodes is unknown at the beginning of the training process and may even vary as the learning task is accomphshed, a variable node architecture scheme should be used during trainmg. The network to be trained can be inmalized with small numbers of nodes: the network raze can then be increased or decreased on the basis of the importance of each individual node m the network until the network can successfully identify all classes m the training set with the minimum number of nodes and mterconnectlons. The final network should be a better generalizer because it has the fewest ways to dmtinguish the classes learned.

Information concerning the contribution each node makes toward the demred network mapping is needed ffthe number of nodes is to be controlled during train- ing. One defimtlon for the nodal contribution or im- portance is based on the partial derivatives of the net- work outputs with respect to each hidden nodal output (Bartlett & Basu, 1991 ); however this definition is based on the learned network mapping. Information theory (Hyvarmen, 1970: Kullback, 1959: Shannon & Weaver, 1971 ) is the basis of the more sophisticated approach presented m this work. We define the importance of a hidden node as a funcUon of the nonredundant inter- dependency of the output activation of the hidden node with respect to the network output. If there is a func- Uonal relationship between the output of a hidden node and the desired output of an output node, provided the relationship is not duphcated by some other node (HI- rose, Yamashita, & Hijiya, 1991 ), that node is impor- tant to the function mapped by the network.

A few words about information theory and inter- dependency analysis (Press, Flannery, Teukolsky, & Vetterhng, 1986; Watanabe, 1969 ) are m order before Information theoretic DNAL can be discussed in detail. Let us, for the moment, assume discrete binary nodal activations for each node in the network of interest. Further, let us assume that the activation values are treated as a stochastic process as each node responds to the mput patterns in the trainmg set. Then the output

of each node ( l ,J) IS the set { XI,j, n } of activations where N is the number of training exemplars, n = 1, 2, 3 . . . N Because the output of the node can take on only one value per exemplar, the individual output activa- tions are disjoint. These outputs are also exhaustive because some activation must occur. Now, define the probabihty of occurrence in the training set of any par- ticular actwation, in this case either zero or one or in general :q j g, to be p, j,g. An estimate of this probability for each node ( t, !) is simply the number of occurrences of each particular x,,j,g in the training set divided by the total number of patterns in the training set. The entropy, or information, in the activation of any node ( z, !) is then

I

H ( v , j ) = __,~ p , i g ' l O g e ( P , j , g ) (9) g 0

If we are interested m the amount of information dis- played by two nodes, say x,,j and Xk m, the equation ~s modified to

I 1

H(-Lj, \km) = __~,. ~X" P, Jk mgh'log2(P, jkm,sh) (10) g 0 h-O

TABLE 2 Excluslve-Nor Training Dynamic Node Architecture History

for DNAL Network Initiated as 2 × 1 × 1

Network Archflecture

RMS Training Error Input Hidden Output

0.597704 2 1 1 0.409520 2 1 1 0.429996 2 2 1 0.015243 2 2 1 0.019032 2 3 1 0 009998 2 3 1 0.420151 2 2 1 0 102109 2 2 1 0 107212 2 3 1 0 100095 2 3 1 0 105712 2 4 1 0 057716 2 4 1 0 408556 2 3 1 0.004912 2 3 1 0.363971 2 2 1 0 008031 2 2 1 0.408278 2 1 1 0 408270 2 1 1 0.428683 2 2 1 0.408259 2 2 1 0.429722 2 3 1 0.009979 2 3 1 0.381533 2 2 1 0.006383 2 2 1 0.699138 2 1 1 0 408263 2 1 1 0.428676 2 2 1 0 408258 2 2 1 0.429720 2 3 1 0 005394 2 3 1 0 415976 2 2 1 0.003867 2 2 1

Page 5: Dynamic node architecture learning: An information theoretic approach

Dynamic Node Architecture

TABLE 3 Exclusive-Nor Training Dynamic Node Architecture History

for DNAL Network Initiated as 2 × 10 × 1

Network Architecture

RMS Training Error Input Hidden Output

0.592550 2 10 1 0118489 2 10 1 0.008947 2 9 1 0.004950 2 9 1 0.008973 2 8 1 0.005205 2 8 1 0.004093 2 7 1 0.003426 2 7 1 0.007887 2 6 1 0.004611 2 6 1 0.005000 2 5 1 0.004261 2 5 1 0.004388 2 4 1 0.003745 2 4 1 0.004395 2 3 1 0.004235 2 3 1 0.004187 2 2 1 0.004074 2 2 1 0.700369 2 1 1 0.408279 2 1 1 0.428693 2 2 1 0.408258 2 2 1 0.429721 2 3 1 0.007981 2 3 1 0.696395 2 2 1 0.014591 2 2 1 0.015320 2 3 1 0.014572 2 3 1 0.016348 2 4 1 0.009877 2 4 1 0 014603 2 3 1 0.014555 2 3 1 0 015282 2 4 1 0.014539 2 4 1 0.016312 2 5 1 0.009731 2 5 1 0.014557 2 4 1 0.014510 2 4 1 0.015236 2 5 1 0.014499 2 5 1 0.016299 2 6 1 0 009926 2 6 1 0.009650 2 5 1 0.009379 2 5 1 0.009979 2 4 1 0.009977 2 4 1 0.009929 2 3 1 0.009927 2 3 1 0.009997 2 2 1 0.009986 2 2 1

where P,.j,k.m,g,h is the joint probability that both acti- vation x,,j.g and Xk, m,a occur. If x,,j and Xk, m are inde- pendent in the training set, then

Pi,j.k, rn,g,h = P,,j,g" Pk, m,h ( 1 1 )

and therefore

H(x,,,. Xk,,.) = H(X,,j) + H(Xk.m)- (12)

133

However, ifx,.j and Xk, m are completely interdependent

H(x,.,, Xk, m) = H(x,,j) = H ( X k , m ) . (13)

In order to determine the important information exhibited by a hidden node, we seek to know which bits of information presented by the hidden node's out- put are related to which bits of information present in the output vector of the training set. For this we invoke interdependency analysis as a measure of association based on entropy. If the hidden node output is x,.j and the desired network output of interest is ~2~+1.k then the symmetric interdependency between x,,j and ft~+l,k is

U(XI,j, ~'~I+ I ,k )

= 2. [ H(x,,j) + H(ftl+l k) -- H(Xlj, ~/_~Xl,5+,H_~_2 ~ f~H,k)]. (14)

This relation can be seen to be at least a reasonable measure of dependency since if xLj and ~l÷l,k are in- dependent then U(x,.j, ~H,k) is zero. If, on the other hand, x,,j and f~l÷l,k are completely interdependent then U(XI.j, ~"~I+l,k) is unity.

The redundancy of node (i, j ) with respect to node (l, k) can be defined in a similar fashion as

R(x,,,, X,,k) = 2.[ H(X''J) + H(X'k) - H(X' j' X,,k)] ~ ~ ' / ~ x,-~k~ ' . (15)

This relation is only applied to nodes in the same hidden layer because all nodes in upper layers have obvious functional relationships to nodes below them.

The above discussion can be applied to continuous activation networks provided a large number of discrete bins are used to approximate the probabilities. In this case the probability of occurrence in the training set is modified such that if the activation of node (i, j ) falls within some finite range or bin, then an event has taken place in that bin. Thus we can estimate the probability of an occurrence in bin g as the number of times that x,,j falls between the minimum and maximum of bin g; thus

# of occurrences of Bg-l < x,,j < Bg (16) P"J'g = total # of patterns m the training set

TABLE 4 Comparison of Typical Recall Performance Results for Four

ANNs Trained on Exclusive-Nor Problem

Network Learning Training Set Recall Set Architecture Mode RMS Error RMS Error

2 × 1 × 1 --~ DNAL 0 00803 0.00846 2 × 2 x 1

2 × 10 × 1 --* DNAL 0.00407 0.00408 2 x 2 × 1

2 × 2 × 1 SAL 0.00959 0.00981 2 × 4 × 1 SAL 0.00998 0.01033 2 X 5 × 1 SAL 0.00774 0.00775

The recall set contains the training set plus 196 exemplars w=th _+10% added uniform noise to the inputs

Page 6: Dynamic node architecture learning: An information theoretic approach

134

TABLE 5 Eight Pattern Binary One-of-Eight Decoder Training Data Set

Inputs Desved Outputs Pattern Number 1 2 3 1 2 3 4 5 6 7 8

1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 1 0 0 0 0 0 0 3 0 1 0 0 0 1 0 0 0 0 0 4 0 1 1 0 0 0 1 0 0 0 0 5 1 0 0 0 0 0 0 1 0 0 0 6 1 0 1 0 0 0 0 0 1 0 0 7 1 1 0 0 0 0 0 0 0 1 0 8 1 1 1 0 0 0 0 0 0 0 1

then, of course, the sums in eqs. (9) and (10) are cor- respondingly changed to include all of the bins. In effect, the probability density functions (pdfs) are approxi- mated as stairstep functions. If we set the total number of information bins to 100, for example, then any vari- ation in nodal activation above 0.01 (assuming a nor- malization on [0, 1]) will be scrutinized as possibly important information when the determination of in- terdependency, and therefore nodal importance, ~s made. However, it ~s crucial not to use too many in- formation bins because random fluctuations, such as measurement or thermal noise in the training set, may be considered as important functional variations in the data itself.

The importance of node (/, J), with respect to output (I + 1, k) can therefore be defined as

JO)

l(~21+,,k,A,.j) = U(~21+l,k.X,j)-- ~ R(X,,j,A-,,m) (17) m - I m#j

An estimate of the total importance of any hidden node can be obtained by summing the interdependence of that node with respect to all the desired output nodes minus the redundancy summed over all the other nodes in layer I. Thus, the total importance of node (l,J) is

J ( I+l ) JO)

l(t-,j)= S" I(~I+,.k,X,j)-- ~ R(A,.j,X,m). (18) k 1 m - I

m~J

The importance of a network layer can be similarly defined as the sum of the importance of each node in the layer of interest

J 0 )

l (x, ) = 5" l(x,,j). (19) J

Information theoretic DNAL can be summarized as follows. Start the network with a few nodes in the hidden layer. Train this network until it reaches a learning plateau. A learning plateau is reached if the network performance error does not decrease appre- ciably with time. At this point the network has learned as much as is possible with only a few hidden nodes in the hidden layer, and it is necessary to add other hidden

E B Bartlett

nodes if better performance is desired. Next, add a node to the hidden layer. The new node will have low random weights and therefore low importance. As the network continues to learn, the new node's weights will tend to increase from their very small initial values, and the new node will gain importance as it begins to affect the network output. Again, at some point the network will reach a learning plateau. And again, a node will be added to the hidden layer. This process of learning to a plateau and adding a node is repeated until the net- work learns the training set to the desired accuracy. At any one of these plateaus, the network is pruned of low-importance nodes, and thus learning time is re- duced by eliminating unneeded nodes. Once the net- work has learned the mapping to the desired accuracy, the network is forced to eliminate a node in an attempt to learn and perform the mapping with fewer hidden nodes. However, eliminating one of these nodes may increase the network error. Therefore, if the perfor- mance degradation is large, we retrain this smaller net- work and repeat the process. The smallest network that performs acceptably is then employed.

4. COMPUTER SIMULATION RESULTS

In this section SOSLA networks with DNAL are com- pared to SAL SOSLA networks. In these examples, a network error of 0.01 was arbitrarily used as the stop rule for training. Because SAL requires the predeter- mination of the network's size, a variety of static ar- chitectures are employed including those defined by the relations of Upadhyaya and Eryurek (1992) and Hecht-Nielsen ( 1987, 1990).

4.1. Exclusive-Nor

A straightforward example for network learning xs the exclusive-nor problem. The training data is shown in

TABLE 6 Eight Pattern Binary One-of-Eight Decoder Dynamic Node

Architecture History for DNAL Network Initiated as 3 × 1 × 8 Network

Network Architecture

RMS Training Error Input H ~ d d e n Output

0 320637 3 1 8 0 257262 3 1 8 0 257317 3 2 8 0 128103 3 2 8 0 128245 3 3 8 0 126879 3 3 8 0.126887 3 4 8 0.126867 3 4 8 0.126871 3 3 8 0.126856 3 3 8 0.127052 3 4 8 0.009995 3 4 8 0.042737 3 3 8 0.009999 3 3 8

Page 7: Dynamic node architecture learning: An information theoretic approach

Dynamtc Node Archltecture 135

TABLE 7 Eight Pattern Binary One-of-Eight Decoder Dynamic Node

Architecture History for DNAL Network Initiated as 3 x 9 × 8 Network

Network Architecture

RMS Training Error Input H i d d e n Output

0.312996 3 9 8 0.008587 3 9 8 0.007959 3 8 8 0.126211 3 7 8 0.009230 3 7 8 0.008197 3 6 8 0.007327 3 5 8 0.190040 3 4 8 0.009674 3 4 8 0 137967 3 3 8 0.010000 3 3 8

Table 1. Direct analysis shows that only two hidden nodes are required to provide the desired mapping. We find the recommended number of hidden units for the SAL networks for this problem by using the relation of Upadhyaya and Eryurek (1992), 2. log2(4) -- 4, and Hecht-Nielsen ( 1987, 1990) 2- 2 + 1 = 5. Therefore, SAL networks with two, four, or five hidden nodes, two input nodes, and one output node (2 X 2 × l, 2 X 4 X 1, or 2 X 5 X 1 ) are used for comparisons to the DNAL network. Two DNAL networks were initialized with 2 × 1 × 1 and 2 × 10 × 1 architectures. Although the 2 × 5 X 1 and 2 × 4 X 1 SAL networks had no problems learning the desired mapping, they are not the smallest architecture capable of doing so. On the other hand, the 2 × 2 X 1 network required significantly more computing time to converge than any of the other networks, including the DNAL networks. Tables 2 and 3 show the network architecture training history for the DNAL ANNs. These tables show the RMS training error obtained by the network for the architectures listed. The target cost is 0.01; therefore, if a DNAL network learns the training set to an error of this value or less, it is assumed to have learned the desired task, and a node is eliminated from the hidden layer.

Several interesting points are illustrated in Table 2. The first DNAL network, which was initiated as a 2 × 1 × 1 network, first requires three hidden nodes to learn the desired input-output relation. A node is then eliminated, and the desired relation is not learned with the smaller, two-hidden-node architecture before a learning plateau is reached. A node is then added to yield a 2 x 3 × 1 network, and, curiously, this config- uration is also unable to learn the desired mapping before a learning plateau is encountered. Again, a node is added to the hidden layer, yielding a 2 X 4 X 1 net- work. This configuration learns to a lower error plateau but does not attain the desired accuracy. At this plateau, the newly added fourth node is eliminated due to its

extremely low importance, and the remaining three- hidden-node network then successfully learns the de- sired mapping. Again, a node is eliminated, yielding a 2 X 2 X 1 network, and this network is able to perform the desired relation to the desired accuracy. Thus, the DNAL approach finds the theoretical minimum num- ber of hidden nodes required to perform the desired mapping. Next the network eliminates one of the re- maining two hidden nodes and is unable to learn the training set, so another node is added. Again two hidden nodes are unable to perform the mapping before a pla- teau is reached, and a third hidden node is added. Again and again a third node is added and taken away each time the two-hidden-node architecture is capable of performing the mapping to the desired accuracy. Thus, the network exhibits the typical oscillatory behavior of the DNAL approach. This oscillation only occurs once a low-error plateau has been established, and it is a good indication that a reasonable ANN solution has been obtained. At this point the training process can be stopped, and the smallest architecture with the lowest error is selected as the recall network. Note that the increase in error when adding a node is due to the small but finite importance of the new node, which has small random weights when added.

Exploring the DNAL network size reduction or pruning abilities is of interest. Table 3 gives the learning history of a DNAL network that was initiated with l0 hidden nodes. The network quickly eliminates one of the hidden nodes without reaching the desired RMS error because this node has very low importance. With nine hidden units, the network learns the mapping to below the target of 0.01 RMS error, and a low impor- tance node is eliminated. After the elimination of this node, the network is still able to recall the training set to better than the desired accuracy, but the algorithm only checks the output error periodically; it continues to learn in this configuration. Once the network dis- covers that it has an output error below the desired threshold, another node is eliminated. This is repeated until the network has only one hidden node and cannot learn the training set.

TABLE 8 Comparison of Typical Recall Perlormance Results for

Four ANNs Trained on Eight Pattern Binary One-of-Eight Decoder Problem

Network Training Training Error Recall Error Architecture Method (RMS) (RMS)

3 × 1 X 8 -~ D N A L 0.01000 0.07348 3 × 3 × 8

3 × 9 X 8 -* D N A L 0.01000 0.12317 3 X 3 × 8

3 X 7 x 8 SAL 0.00923 0.19761 3 x 9 x 8 SAL 0.00970 0.20671

The recall set contains the training set plus 192 exemplars w=th _+10% added uniform no=se to the ~nputs.

Page 8: Dynamic node architecture learning: An information theoretic approach

136 E B Bartlett

15

t - O 1

::3 Q . O t:L

N

t~

0 0 5 z

0 IJ I J J I 0 50 100 150 200 250

Time Step FIGURE 2. Verhulst logistic function model of an animal population dynamic as a function of time step. The growth parameter is 2.7 and the initial condition is 0.0001.

Table 4 lists the excluswe-nor training results for the DNAL and SAL learning approaches. The recall set used includes the training set and 49 noisy repetitions of the training set. Each repetition of the original set includes in the input, uniform pseudorandom noise at plus or minus 10% of full scale. The desired outputs remain the same. Thus, we use a total of 200 patterns in the recall set. It can be seen that the noise tolerance of each of the ANNs is quite good regardless of archi- tecture.

4.2. Binary One-of-Eight Decoder

This Is a simple problem used in discussions of back- propagaUon learning (Eaton & Olivler, 1992). The training set is shown in Table 5. Two DNAL and two SAL ANNs are trained is this section for comparison and discussion. The DNAL networks are initiated as 3 X 1 X 8 and 3 X 9 X 8 networks. The SAL networks use 3. log2(8) = 9 and 2. (3) + 1 = 7 hidden nodes in their hidden layers. Tables 6 and 7 show the dynamic node architecture training history for the DNAL ANNs. Note that the DNAL approach obtains a hidden layer size of three nodes for both of the initial conditions. The resultant network is s~gmficantly smaller than the size predicted by either Upadhyaya and Eryurek (1992) or Hecht-Nielsen ( 1987, 1990). Note that Eaton and Olivier ( 1992 ) use seven hidden nodes without expla- nation in their work.

Typical recall performance for these networks is given m Table 8. Results include the recall of a noisy data set. Th~s noisy data set contains 25 repetitions of the training set with added uniform input noise at plus or minus 10%. Although generalization and noise tol- erance depend on many things such as number of hid- den nodes, magnitude of the individual weights, and training scheme, the smaller networks tend to have bet- ter performance when recalled on the noisy data set. In any case, the smaller networks provide a faster and more compact model.

4.3. Chaotic Population Dynamics

Many phenomena exhibit chaotic behavior, including certain animal populations, for example. In spite of the

TABLE 9 Chaotic Time Series Dynamic Node Architecture History for

DNAL Network Initiated as 10 x 1 x 1 Network

Network Architecture

RMS Training Error Input Hidden Output

0 794111 10 1 1 0 057121 10 2 1 0 072739 10 3 1 0.010336 10 3 1 0.012919 10 4 1 0.010000 10 4 1

Page 9: Dynamic node architecture learning: An information theoretic approach

Dynamic Node Architecture 137

C ._o - i

O 13- "10

N

E O

Z

15

05

Predicted

\Actual

-0 5 ~ .I .L I. .L 0 50 100 150 200 250

Time Step FIGURE 3. Results of the 10-point chaotic time series prediction. The first 100 time steps are the training set.

1 5

t- O 4--*

'-I

0 0_ "O 0.5 a) N

E 0

Z

~Predicted

~Actual

~i~A~;;~A ,A A ..... ~ ^ . . . . L^A^^ A A "T"-'~'~'v'V"-" v "V' W ~ ~ ' r r ~ r ' ~ "~ r .....

-0 5 I I I I I

0 50 100 150 200 250

Time Step FIGURE 4. Results of the 10-point chaotic time series prediction with 1% added uniform iterated noise. None of these patterns were included in the training set.

Page 10: Dynamic node architecture learning: An information theoretic approach

138 E B Bartlett

TABLE 10 Comparison of Recall Performance Results for Chaotic Time Series Prediction DNAL ANN

Training Error Recall Error Noisy Recall Error Network Architecture Training Method (RMS) (RMS) (RMS)

10 x 1 × 1 --,- 10 × 4 × 1 DNAL 0.01000 0.02840 0.04709

No~se data ~ncludes _+1% uniform pseudorandom ~terated noise The recall and noisy recall data sets contain 250 patterns

difficulty inherent in modeling systems of this kind, understanding or predicting their behavior is often nec- essary for analyses or strategic planning. Many models can be used to predict animal population dynamics (Scudo & Ziegler, 1978). One of the more widely known models of ammal population dynamics is the simple Verhulst logistic function (Peitgen & Richter, 1986),

yt+l = f(yt) = )'t + r(l - )'t)yt (20)

where Yt is the normalized population of a group of animals at time step t, r is the growth parameter such that the growth rate is r( 1 - Yt), and 1 is the normalized stable population. Equation (20) exhibits chaotic dy- namics for values of the growth parameter above 2.570. Thus, the process exhibits sensitive dependence on im- tlal conditions and is aperiodic (Devaney, 1987).

Training ANNs to predict the logistic function in- volves the following. As input, the ANN is given the system's population at 10 previous time steps. The ANN is then trained to output the value of the popu- lation at the next time step (Bartlett, 1992 ). Thus

XI = (YI i , .].'t 2, Yt-3 . . . . k'l i0) (21)

IS the input vector, and

Xl+l = yt (22)

is the output. The mapping desired is

Yt = M(yt 1, )'t-2, )'t 3 . . . . . Yt 10) (23)

for all t. The training set consisted of 100, 10-input, one-output patterns derived from eq. (20), with growth parameter r = 2.7 and initial condition x0 = 0.0001. Figure 2 shows the time series that results from this choice of the initial condition and growth parameter. The training set is chosen in sequence starting at t = 11 such that in each case the network is, in effect, trained to predict the value of the population at time t given only the tapped, time-delayed values of the pop- ulation at times t - 1, t - 2, t - 3, t - 4 . . . . . t - 10.

A DNAL ANN was initiated as a 10 × 1 × 1 network for training on the data shown in Figure 2. For com- parison SAL networks require either 10. logz(100) 67, or 2 . ( 1 0 ) + 1 = 21 hidden nodes (Upadhyaya & Eryurek, 1992; Hecht-Nielsen, 1990). Neither of these architectures were attempted. Table 9 contains a dy- namic node architecture history for the DNAL net-

work. Figure 3 shows the recall performance of this network with no added noise. The first 100 of these patterns is the training set. Figure 4 shows the effect of 1% added iterated uniform noise of the time series as well as the 10 × 4 × 1 DNAL ANN performance. No- tice that because this series is chaotic, 1% iterated noise significantly changes the character of the series. Table 10 summarizes the ANN performance.

5. CONCLUSIONS

A dynamic node architecture scheme is presented that varies the number of hidden nodes in a layered feed- forward neural network during training. This new method obtains a relative minimum number of nodes and interconnections in the network consistent with the learning objective. Results show that the method can obtain an appropriate network architecture for a given training task m a systematic way. The method therefore ehminates the need to use empirical rules of thumb to select network architectures before training or pruning techniques to reduce network size post- training. Factors that affect the final network size in- clude the depth of training desired, the length of time required to determine that a learning plateau has been reached, and the minimum allowed importance that a node must have before it is eliminated, all of which are set by the user. It should also be stated that this method is not guaranteed to give a better generalizing network, but it does give a usable minimal architecture that gen- eralizes quite well, and in most cases performs better than empirically sized networks.

REFERENCES Ash, T (1989) Dynamic node creation m backpropagaUon networks

1JCNN international conference on neural networks, 2, 623, New York IEEE

Bartlett, E. B (1990) Nuclear power plant status diagnostics using simulated condensaUon, an auto-adaptwe computer learning techmque. Ph.D. thesis, Umverstty of Tennessee, Knoxville, TN.

Bartlett, E. B. ( 1991 ) Chaotic Ume series predtctlon using artificial neural networks Proceedings of the 2nd government neural network apphcattons workshop Huntsville, AL

Bartlett, E. B. ( 1992 ). Analysis of chaotic population dynamics using artlfic]al neural networks Chaos, Soluttons and Fractals Apph- cattons m Scwnce and Engineering. 2 ( 5 ), 413-421.

Bartlett, E B., & Basu, Anujit ( 1991 ) A dynam]c node architecture scheme for backpropagation neural networks. In C. I. Dagh, et al (Eds.), Intelhgent engmeenng systems through arttfictal neural networks (pp 101-106) New York: ASME Press.

Page 11: Dynamic node architecture learning: An information theoretic approach

D y n a m w Node Architecture 139

Bartlett, E. B., & Uhng, R. E. (1991a). Nuclear power plant status diagnostics using artificial neural networks. Proceedmgs of the Amerwan Nuclear Soctety meeting on frontters in mnovatlve corn- putmg for the nuclear mdustr)A 644-653, Jackson Lake, WY.

Bartlett, E B., & Uhrig, R. E. (1991b). A self-optimizing stochastic dynamic node learning algorithm for layered neural networks. IJCNN mternattonal jomt conference on neural networks, 2, A- 947, New York: IEEE.

Bartlett, E. B., & Uhrig, R. E. (1992a). Nuclear power plant status dlagnostms using an artificaal neural network. Nuclear Technology, 97, 272-281.

Bartlett, E B., & Uhrig, R. E (1992b). A self-optimizing stochastic dynamic node learning algorithm for layered neural networks Proceedtngs of WNN-AIND-92, 79-84. Auburn, AL" Auburn University.

Bornholt, S., & Graudenz, D. (1992). General asymmetric neural networks and structure design by genetic algorithms. Neural Net- works, 5(2), 327-334.

Cotter, N. E. (1990). The Stone-Weierstrass theorem and its appli- cataon to neural networks. IEEE Transactions on Neural Networks, 1(4), 290-295.

Devaney, R. L. (1987) An mtroductton to chaottc dynarntcal systems Redwood City, CA" Addison-Wesley.

Eaton, H. A. C, & Oiivmr, T L. (1992). Learning coeffioent depen- dence on training set size. Neural Networks, 5(2), 283-288.

Gallant, S I. (1990). A connectaomst learning algorithm wRh provable generalization and seahng bounds. Neural Networks, 3(2), 191- 201.

Hecht-Nielsen, R. (1987). Kolmogorov's mapping neural network existence theorem. IJCNN mternattonal conference on neural net- works, 2, 11-14, New York: IEEE.

Hecht-Nmlsen, R. (1989). Theory of the backpropagatlon neural network IJCNN international conference on neural networks, 1, 593-605, New York: IEEE.

Hecht-Nlelsen, R. (1990), Neurocornputmg Reachng, MA: Addison- Wesley.

Hlrose, Y., Yamashita, K., & Hljiya, S. (1991). Back-propagation algorithm which varies the number of hidden units. Neural Net- works, 4, 61-66

Hyvarlnen, L P (1970). Inforrnatton theory for systems engtneers New York: Sprlnger-Verlag.

Ishlkawa, M. (1989). A structural learning algorithm with forgetting of hnk weights. IJCNN mternattonal conference on neural net- works, 2, 626, New York: IEEE.

Judd, J S. (1990). Neural network destgn and the cornplextty of learmng Cambridge, MA: MIT Press.

Karnm, E. D, (1990). A simple procedure for pruning back propa- gation trained neural networks. IEEE Transacttons on Neural Networks, 1 (2), 239-242.

Kirkpatrick, S, Gelatt, C. D., Jr., & Vecchl, M. P. (1983). Optimi- zation by simulated annealing. Scwnce, 220, 671-680.

Kolomogorov, A. N. (1957). On the representation of continuous functions of many variables by superpositlon of continuous func- tions of one variable and addition (m Russian). Dokl Akad Nauk USSR, 114, 953-956.

Kruschke, J. K. (1989). Improving generalization in back-propagation networks with distributed bottlenecks. IJCNN tnternattonal con- ference on neural networks, 1,443--447, New York" IEEE.

Kullback, S. (1959) Inforrnatton theory and stattsttcs New York: John Wiley & Sons.

Li, K. C. (1985) From Stem's unbiased risk estimates to the method of generalized cross validation. The Annals of Stattstics, 13(4), 1352-1377.

L~ppmann, R. P. (1987). An introduction to computing with neural nets. IEEE, Acoustics Speech and Stgnal Processing Magaztne, 4-22.

Mclnerney, J. M., Haines, K. G., Biafore, S., & Hecht-Nielsen, R.

( 1989 ) Back propagation error surfaces can have a local minima. IJCNN International conference on neural networks, 2, 627, New York: IEEE.

Peltgen, H. O., & Richter, P. H. (1986). The beauty offractals" Images of complex dynarnwal systems New York. Springer-Verlag.

Press, W H., Flannery, B. P, Teukolsky, S. A, & Vetterhng, W. T (1986). Numerical recipes The art of scwnttfic computing New York. Cambridge University Press

Rlpley, B. D. (1987). Stochastic simulation New York: John Wiley & Sons.

Rumelhart, D. E., McClelland, J. L., & the PDP Research Group, Institute for Cognitive Science, Umverslty of Cahfornla, San Daego (1986). Parallel dlstrlbuted processtng Explorattons tn the rnv crostructure ofcogmtton, Vols 1 and 2, Cambridge, MA: MIT Press.

Scudo, E M., & Zwgler, J. R. (1978) The golden age of theoretical ecology 1932-1940: A collection of works by V Volterra, V A Kostttzm, A J Lotka, and A N Kolrnogoroff Vol. 22 of Lecture notes tn btornathernatws New York: Springer-Verlag.

Shannon, C. E, & Weaver, W. ( 1971 ). The rnathernattcal theory of cornrnumcatlon Urbana, IL: University of Illinois Press.

Sletsma, J., & Dow, R. J. E (1988). Neural net pruning why and how. IEEE mternattonal conference on neural networks, l , 325- 332, San Diego, CA.

Stevens, P. N. (1984). Monte Carlo analysts Umversity of Tennessee at Knoxville. Seminar notebook for the nineteenth annual Ten- nessee industries week.

Stone, M. (1977) Asymptotics for and against eross-vahdation Btornemka, 64( 1 ), 29-35.

Tishby, N., & Levin, E (1989) Consistent inference ofprobabihtles in layered networks' Predictions and generalization. IJCNN m- ternattonal conference on neural networks, 2, 403-409, New York: IEEE.

Upadhyaya, B R., & Eryurek, E. (1992). Application of neural net- works for sensor validation and plant monitoring Neural Tech- nologg 97, 170-176

Vaario, J., & Ohsuga, S. ( 1991 ). Adaptive neural architectures through growth control. In C I. nagll et al. (Eds.), Intelhgent engineering systems through art tfictal neural networks (pp. 11-16). New York: ASME Press.

Watanabe, S. ( 1969 ). Knowing and guessing A quantttattve study of reference and reformation New York: John Wiley & Sons.

Welgend, A. S., Rumelhart, D. E., & Huberman, B. A ( 1991 ). Gen- erahzatlon by weight elimination applied to currency exchange rate prediction 1JCNN internattonal conference on neural net- works, 1,837-841, New York: IEEE.

Werbos, P J. (1989). Backpropagatlon and neural control: A review and prospectus. IJCNN international conference on neural net- works, 1, 209-216, New York: IEEE.

Widrow, B., & Lehr, M (1990) 30 years of adaptive neural networks: Perception, madaline, and backpropagatlon Proceedings of the IEEE, 78, 1415-1441.

Wolpert, D. H (1990). A mathematical theory of generahzatlon Part 1 and 2 Complex Systems, 4, 151-249.

Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241-259.

Won, Y, & Pimmel, R. L (1991) A comparison of connection pruning algorithms with backpropagation. In C. I. Dagh, et al. (Eds.), Intelhgent engmeerrng systems through arttfictal neural networks (pp. 113-119). New York: ASME Press.

NOMENCLATURE

Bg c(W) H(x,,j)

maximum value for information bin g network cost function information exhibited by node (t, j )

Page 12: Dynamic node architecture learning: An information theoretic approach

140 E B Bart le t t

H( r~.j, Xk.m)

l(.~,) 1(~,,~) J(l) 1, M N • /31 J g

~1 J k,m,g h

R (.~,,. X,,k)

U( t, ~, fli+l,k)

reformation exhibited by both node (t, j ) and node (k, m) ~mportance of layer t ~mportance of node t, J number of nodes m layer t index of nodes m layer I - - 1 network input-output mapping number of exemplars in the training set probablhty of a particular x,.j being m b m g joint probability of x, j being m b m g and Xk r~ bemg an bm h redundancy measure between node (t, j ) and node (t, k) interdependency measure between node (t, j ) and output k

X,

Xl j

Xl , j ,n

X D, )'t Z,

( . ) '

log2( • ) IT

~ I + l

vector of nodal outputs m layer t vector of nodal outputs m layer t from exemplar n the output of node (t, j ) output of node t, j from exemplar n vector of desired nodal outputs in layer t chaotic population at t~me t vector of desired network mappings to nodes m layer

( • ) at Ume t base 2 logarithm value pl summation operator vector of training set inputs vector of training set outputs training set outputs for each output node j