an analytical framework for optimizing neural networks

19
Neural Networks, Vol. 6, pp. 79-97, 1993 0893-6080/93 $6.00 + .00 Printed in the USA. All rights reserved. Copyright © 1993 Pergamon Press Ltd. ORIGINAL CONTRIBUTION An Analytical Framework for Optimizing Neural Networks ANDREW H. GEE, SREERAM V. B. AIYER, AND RICHARD W. PRAGER Cambridge UniversityEngineeringDepartment (Received 13 September 1991 ; accepted 2 June 1992 ) Abstract--There has recently been much research interest in the use of feedback neural networks to solve combinatorial optimization problems. Although initial results were disappointing, it has since been demonstrated how modified network dynamics and better problem mapping can greatly improve the solution quality. The aim of this paper is to buiM on this progress by presenting a new analytical framework in which problem mappings can be evaluated without recourse to purely experimental means. A linearized analysis of the Hopfield network's dynamics forms the main theory of the paper, ./b/Iowed by a series of experiments in which some problem mappings are investigated in the contex't of these dynamics. The experimental results are seen to be compatibh, with the linearized theory, and observed weaknesses in the mappings are.fulh, explained within the framework. What emerges is a largely anaO~tical technique for evahtating candidate problem mappings, without recourse to the more usttal trial and error. Keywords--Hopfield network, Neurodynamics, Eigenvector analysis, Combinatorial optimization, Problem map- pings, Valid subspace, Hamilton paths, Point matching. 1. INTRODUCTION Ever since the Hopfield network was first proposed as a means of approximately solving NP-complete com- binatorial optimization problems (Hopfield & Tank, 1985 ), there has been a considerable research effort to improve the network's performance, which at first was fairly disappointing (Kamgar-Parsi & Kamgar-Parsi, 1990; Wilson & Pawley, 1988). Most of this effort has been directed towards improving the reliability with which the network finds valid solutions, while at the same time developing various annealing schedules to facilitate local minimum escape and force eventual convergence to a hypercube corner. Perhaps the most successful modification has been the mean field an- nealing (MFA) algorithm with neuron normalization (Peterson & SSderberg, 1989; Van den Bout & Miller lII, 1989), which has proved to be very successful with the classic benchmark traveling salesman and graph partitioning problems (Peterson & Anderson, 1988; Peterson & S6derberg, 1989; Van den Bout & Miller S. V. B. Aiyeris currentlywith the Laboratoirede Rechercheen Informatique, Universit6 de Paris Sud, 91405 Orsay,France. A. H. Gee and S. V. B. Aiyer were funded by the Science and EngineeringResearchCouncilof Great Britain while preparing this study. Requests for reprints should be sent to A. H. Gee, Cambridge UniversityEngineering Department, TrumpingtonStreet,Cambridge CB2 IPZ, England. III, 1990), as well as with some more practical sched- uling problems (Gislen, Peterson, & S6derberg, 1989 ). The considerable improvement in performance is at the expense of analogue hardware implementability: Unlike the original Hopfield network (Hopfield, 1984), the MFA algorithm relies on a recursive update rule as well as the neuron normalization division operation, ruling out any simple analogue circuit implementation. However, the MFA algorithm is still highly parallel and could greatly benefit from implementation in parallel digital hardware; even when run on conventional serial machines, it remains highly competitive with the other contending optimization techniques (Peterson & S6d- erberg, 1989; Van den Bout & Miller III, 1990). In addition, some researchers have persevered with the original network dynamics, demonstrating that better problem mapping can greatly improve the overall quality of solutions. One key step towards this im- provement lay in the realization that, for a broad class of problems, all the points in the network's output space corresponding to valid solutions lie on a particular afo fine subspace (Aiyer, Niranjan, & Fallside, 1990). Using this fact, the validity constraints can be grouped to- gether into a single penalty term which no longer frus- trates the optimization objective while the network state vector lies on this "valid subspace." The new mapping has been applied with considerable success to the trav- eling salesman problem (Aiyer et al., 1990; Aiyer, 1991 ), a point matching problem in pattern recognition (Gee, Aiyer, & Prager, 1991b) and a novel implemen- 79

Upload: andrew-h-gee

Post on 04-Jul-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: An analytical framework for optimizing neural networks

Neural Networks, Vol. 6, pp. 79-97, 1993 0893-6080/93 $6.00 + .00 Printed in the USA. All rights reserved. Copyright © 1993 Pergamon Press Ltd.

ORIGINAL CONTRIBUTION

An Analytical Framework for Optimizing Neural Networks

ANDREW H. GEE, SREERAM V. B. AIYER, AND RICHARD W. PRAGER

Cambridge University Engineering Department

(Received 13 September 1991 ; accepted 2 June 1992 )

Abstract--There has recently been much research interest in the use of feedback neural networks to solve combinatorial optimization problems. Although initial results were disappointing, it has since been demonstrated how modified network dynamics and better problem mapping can greatly improve the solution quality. The aim of this paper is to buiM on this progress by presenting a new analytical framework in which problem mappings can be evaluated without recourse to purely experimental means. A linearized analysis of the Hopfield network's dynamics forms the main theory of the paper, ./b/Iowed by a series of experiments in which some problem mappings are investigated in the contex't of these dynamics. The experimental results are seen to be compatibh, with the linearized theory, and observed weaknesses in the mappings are.fulh, explained within the framework. What emerges is a largely anaO~tical technique for evahtating candidate problem mappings, without recourse to the more usttal trial and error.

Keywords--Hopfield network, Neurodynamics, Eigenvector analysis, Combinatorial optimization, Problem map- pings, Valid subspace, Hamilton paths, Point matching.

1. INTRODUCTION

Ever since the Hopfield network was first proposed as a means of approximately solving NP-complete com- binatorial optimization problems (Hopfield & Tank, 1985 ), there has been a considerable research effort to improve the network's performance, which at first was fairly disappointing (Kamgar-Parsi & Kamgar-Parsi, 1990; Wilson & Pawley, 1988). Most of this effort has been directed towards improving the reliability with which the network finds valid solutions, while at the same time developing various annealing schedules to facilitate local minimum escape and force eventual convergence to a hypercube corner. Perhaps the most successful modification has been the mean field an- nealing (MFA) algorithm with neuron normalization (Peterson & SSderberg, 1989; Van den Bout & Miller lII, 1989), which has proved to be very successful with the classic benchmark traveling salesman and graph partitioning problems (Peterson & Anderson, 1988; Peterson & S6derberg, 1989; Van den Bout & Miller

S. V. B. Aiyer is currently with the Laboratoire de Recherche en Informatique, Universit6 de Paris Sud, 91405 Orsay, France.

A. H. Gee and S. V. B. Aiyer were funded by the Science and Engineering Research Council of Great Britain while preparing this study.

Requests for reprints should be sent to A. H. Gee, Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 I PZ, England.

III, 1990), as well as with some more practical sched- uling problems (Gislen, Peterson, & S6derberg, 1989 ). The considerable improvement in performance is at the expense of analogue hardware implementability: Unlike the original Hopfield network (Hopfield, 1984), the MFA algorithm relies on a recursive update rule as well as the neuron normalization division operation, ruling out any simple analogue circuit implementation. However, the MFA algorithm is still highly parallel and could greatly benefit from implementation in parallel digital hardware; even when run on conventional serial machines, it remains highly competitive with the other contending optimization techniques (Peterson & S6d- erberg, 1989; Van den Bout & Miller III, 1990).

In addition, some researchers have persevered with the original network dynamics, demonstrating that better problem mapping can greatly improve the overall quality of solutions. One key step towards this im- provement lay in the realization that, for a broad class of problems, all the points in the network's output space corresponding to valid solutions lie on a particular afo fine subspace (Aiyer, Niranjan, & Fallside, 1990). Using this fact, the validity constraints can be grouped to- gether into a single penalty term which no longer frus- trates the optimization objective while the network state vector lies on this "valid subspace." The new mapping has been applied with considerable success to the trav- eling salesman problem (Aiyer et al., 1990; Aiyer, 1991 ), a point matching problem in pattern recognition (Gee, Aiyer, & Prager, 1991b) and a novel implemen-

79

Page 2: An analytical framework for optimizing neural networks

80 .4. H. Gee. S. I ~ B. Ao,et; and R. I,E Prager

tation of the Viterbi algorithm for Hidden Markov Models (Aiyer & Fallside, 1991 ). More recently, the mapping technique has been extended to cover all qua- dratic 0-1 programming problems with linear equality and inequality constraints (Gee & Prager, 1992).

In this paper, we use the structure of the valid sub- space to shed new light on the relationship between the network's dynamics and the problem mapping. In Sec- tion 2 we introduce the matrix notation relevant to the rest of the paper. In particular, we make extensive use of Kronecker (or tensor) products (Graham, 1981 ), which simplify the mapping mathematics considerably. In Section 3 we review the mathematical background of the subject in hand, introducing the form of the valid subspace along the way. Section 4 contains a linearized analysis of the network's dynamics in the context of the valid subspace. We identify a matrix A, the eigen- vectors of which are most influential in the convergence of the network's state vector towards a hypercube cor- ner. We also identify a vector b which is solely respon- sible for the initial direction of the state vector. In Sec- tions 5 and 6 we test this theory against two different classes of optimization problem, namely the Euclidean Hamilton path problem and the point matching prob- lem. For each case we associate b with a linear opti- mization problem, which we call the "auxiliary linear problem": the relationship between the auxiliary linear problem and the parent optimization problem has considerable bearing on the eventual quality of solutions obtained. In both cases, the network's behavior is found to be as predicted by the theory, and observed short- comings of the problem mappings are fully explained within the theoretical framework. Section 6 also de- scribes an illustrative experiment in invariant pattern recognition which demonstrates the success of the Hopfield network in solving the point matching prob- lem. Finally, Section 7 presents a brief discussion of the results of the paper, as well as some general con- clusions. The appendices contain ancillary information and detailed derivations of relevance to the central ma- terial of the paper.

2. NOTATION AND DEFINITIONS

2.1. Kronecker Product N o t a t i o n and M a t r i x Ident i t i e s

Let A r denote the transpose of A. Let [A ]0 refer to the element in the i-th row andj- th column of the matrix A. Similarly, let [a]g refer to the i-th element of the vector a. Sometimes, where the above notation would appear clumsy, and there is no danger of ambiguity, the same elements will be alternatively denoted A 0 and a,. Let the modulus of an n × m matrix A be defined as follows:

IIAII-' = .-1~, = trace(ATA). (1) i = 1 =

Let A Q B denote the Kronecker product of two ma- trices. If A is an n × n matrix, and B is an m × m matrix, then A ® B in an n m × m n matrix given by:

.4liB AI_,B "" .41,B]

A Q B = .421B .422B " " A2 , ,B I (2) /

A.IB A.2B " " .4..B.]

Let vec(A) be the function which maps the n × m matrix A onto the nm-element vector a. This function is defined by:

a = v e c ( A ) = [.4tl..421 . . . . . . 4 ,1 . .412 . A , z . . . . . . "1,2.

. . . . . 41 ..... .l,_ . . . . . . . . . 4,,,,] r . ( 3 )

Throughout this paper we make use of the following identities (see Graham (1981) for proofs):

trace( AB ) = trace(BA) (4)

(A Q B) r = Arc) B r (5)

( A Q B ) ( X Q Y ) = (AXQBY) (6)

trace(AB) = [vec(Ar)]rvec(B)

(for A and B both n × n) (7)

vec(AYB) = ( B r Q A)vec(Y). (8)

Ifw and x are n-element column vectors, and h and g are m-element column vectors, then

(w O h)r(x Q g) = (wrx)(hrg). (9)

If the n X n matrix A has eigenvectors {w~ . . . w,} with corresponding eigenvalues [~,~ . . . %,}, and the m × m matrix B has eigenvectors {h~ . . . h,,} with corresponding eigenvalues {g~ . . . g,,,}, then the nm X nm matrix A ® B has eigenvectors [w, Q h/} with corresponding eigenvalues {%/aj}

{] . . . { l . . . m } ) .

2.2. O t h e r N o t a t i o n and D e f i n i t i o n s

Let I" be the n X n identity matrix. Let o" be the n- element column vector of ones:

o,"= 1. i E { I . . . . . n}. (10)

Let O" be the n × n matrix of ones:

O,~ = 1. i . j E { I . . . . . n } . (11)

Let 60 be the Kronecker impulse function:

60 = . (12) if i ~ j

Let R" be the n × n matrix given by

R" = I" = _1 O". ( 1 3 ) II

Multiplication by R" has the effect of setting the column sums of a matrix to zero:

Page 3: An analytical framework for optimizing neural networks

Optimizing Neural Networks 81

[R"a]~ = a - O " a = o "r a - - O " a i = l i = 1 i ?1

= o . r a _ 1 ( o . r o % . r ) a = o . r a _ o . r a = 0. n

(14)

Another way of considering R" is as a projection matrix which removes the o" component from any vector it premultiplies, since

( i ) , R % " = I . . . . O ~ o " = o ~ - - n o ' = 0 . ( 1 5 )

l1 t l

3. BACKGROUND AND I N T R O D U C T I O N TO T H E VALID SUBSPACE

A schematic diagram of the continuous Hopfield net- work (Hopfield, 1984) is shown in Figure 1. Neuron i has input ui, output vi, and is connected to neuron j with a weight T o. Associated with each neuron is also an input bias term i~. The dynamics of the network are governed by the following equations:

1 6 = - - u + T v + i I' ( 1 6 )

7"

Vi = g(ui) . (17)

Vi is a continuous variable in the interval 0 to 1, and g( t t i ) is a monotonically increasing function which constrains v~ to this interval, usually a hyperbolic tan- gent of the form

1 g(ui) - (18)

1 + exp(-~ui) "

The network has a Liapunov function (Hopfield, 1984)

t, t

E=-.vrTv-vrih+l~ ~" g I ( l ' ) d l ' . (19) T

The network was subsequently proposed as a means of solving combinatorial optimization problems which can somehow be expressed as the constrained mini- mization of

E ° r = - ½ v r T ° % - v r i °p, u i~{0 , 1}. (20)

The idea is that the network's Liapunov function, in- variably with r --* ~ , is associated with the cost func- tion to be minimized in the combinatorial optimization problem. The network is then run and allowed to con- verge to a hypercube corner, which is subsequently in- terpreted as the solution of the problem. In Hopfield and Tank (1985) it is demonstrated how the network output can be used to represent a solution to the trav- eling salesman problem, and how the interconnection weights and input biases can be programed appropri- ately for that problem: This process has subsequently been termed "mapping" the problem onto the network. The same authors later showed how a variety of other problems can be mapped onto the same network, and reported on the results of using analogue hardware im- plementations to solve the problems (Tank & Hopfield, 1986).

The difficulties with mapping problems onto the Hopfield network lie with the satisfaction of hard con- straints. The network acts to minimize a single Lia- punov function, and yet the typical combinatorial op- timization problem requires the minimization of a function subject to a number of constraints: If any of these constraints are violated then the solution is termed "invalid." The early mapping techniques coded the va- lidity constraints as terms in the Liapunov function which were minimized when the constraints were sat- isfied:

E = E °p+cjE~ "s+c.,E~ " ~ + . . . (21)

Unfortunately, the multiplicity of terms in the Liapunov function tend to frustrate one another, and the success of the network is highly sensitive to the relative values of the ci parameters: it is not surprising, therefore, that the network frequently found invalid solutions, let alone high quality ones (Kamgar-Parsi & Kamgar-Parsi, 1990: Wilson & Pawley, 1988).

In Aiyer, Niranjan, and Fallside (1990) an eigen- vector and subspace analysis of the network's behavior revealed how the E °p and E c"S terms in eqn (21 ) can

(a)

V

P

( a ) Non-linear threshold functions constrain- ing v to the unit hypercube.

(b) The change in u is specified by the differ- ential equation (16).

FIGURE 1. Schematic diagram of the continuous Hopfield network.

Page 4: An analytical framework for optimizing neural networks

82 ,4. H. Gee, S. V. B. Ai.l,er, and R. W. Prager

be effectively decoupled into different subspaces so that they no longer frustrate one another. For a wide variety of problems, it was realized that all the hypercube cor- ners corresponding to valid solutions lie on a particular affine subspace with equation

v = T~lv + s (22)

where T ~t is a projection matrix (i.e., T~IT ~al = T val)

and T ~ s = 0. This subspace was termed the "valid subspace." The constrained optimization can now be re-expressed using only a single constraint term as fol- lows ( Aiyer, 1991 ):

E = E °p + ½ collv - (T~'v + s)l[ 2. (23)

Expanding eqn (23) we obtain, to within a constant,

E = E °p - Co(~ vr(T v=l - I)v + sty) (24)

from which we see that the Hopfield network param- eters must be set as follows:

T = T °O + co(T ,~l - I) (25)

i b = i °p + CoS. (26)

In the limit of large Co, v will be pinned to the valid subspace throughout convergence, and the network dy- namics will minimize E = E °° as required.

The resulting system can be simulated with increased efficiency on a serial machine using an efficient algo- rithm (Aiyer, 1991 ) which directly enforces the validity constraints (see Figure 2). Each iteration has two dis- tinct stages. First, as depicted in box (3) , v is updated over a finite time step At using the gradient of the ob- jective term E °° alone. However, updating v in this manner will typically take v off the valid subspace and possibly out of the unit hypercube. Hence, after every update, v is directly projected back onto the valid sub- space and confined within the unit hypercube. This is an iterative process, requiring several passes through boxes ( 1 ) and (2) , in which v is first orthogonally pro- jected onto the valid subspace, and then thresholded so that its elements lie in the range 0 to 1. For more details of the algorithm the reader is referred to Aiyer ( 1991 ).

4. L I N E A R I Z E D A N A L Y S I S O F T H E N E T W O R K D Y N A M I C S

In this section we carefully analyze the operation o f the network in the early stages o f convergence. We shall find it useful to decompose v into two components , one lying in the valid subspace and the other orthogonal to it. We achieve this by writing v = v '~ + s, where v at = T ~ v . For all the problems considered in this paper, it transpires that s is a uniform vector, i.e., some mul- tiple of o". If we are not to bias the behavior o f the network, it would seem sensible to initialize v near this point; we shall therefore subsequently assume that at time t = 0, the values of v and V val a r e Vo ~ s and Vo al ~ 0. If we now consider small changes o f u and v around their initial positions, then we can linearize the dynamics by writing ~ = ,76, where ,7 is the gradient of the transfer functions (18) at the initial position. This gives a single dynamic equation ( for r ~ oo )

= r/(Tv + ib). (27)

Since v is constantly confined to the valid subspace, it follows that ~ = ~ val. We may derive a useful expression for ~'~' as follows:

~v~, = oTV~t(Tv + i b)

= r / T " a ' [ ( T °p + ~b(T v~l - l))(T"~lv + s) + i °p + ~bs]

= 7/[TVa;T°PT'alv + T'aJ(T°Ps + i°P)]. (28)

We see that ~'~ is made up of two parts, a constant term and a term which depends on v. Let us simplify these expressions by writing

T~aIT°PT ~l = A 29)

TVal(T°Ps + i ° p ) = b 30)

which gives

~,~ = r/(Av + b) = 7?(Av ~aj + b). 31 )

In Section 5.4 we will see that with v confined to the valid subspace, E °p may be expressed (to within a con- stant) as

E °p = - l v r A v - - brv. (32)

I ' , ................ i , " ~i:!̀!!iiiiiii~i~iiii::i::iii::i::i::ii?:i::i::i::i~i::i~::::::::i::i::~::iiiii

(1)

~ v (2)

(3)

Project ion of v onto valid subspace

Threshold functions constraining v to the unit hypercube.

Change in v given by the gradient of the objective term E °p alone.

FIGURE 2. Schematic diagram of the efficient simulation algorithm.

Page 5: An analytical framework for optimizing neural networks

Optimizing Neural Networks 83

It is apparent, therefore, that the dynamics + = + ~ = r/(Av + b) simply perform steepest descent on E °p within the valid subspace, which is consistent with our goal of finding a valid solution which minimizes E °p.

Suppose A is an N X N matrix, with eigenvalues k t. ~_~ . . . . . XN and associated normalized eigenvectors

v~ and b be decomposed u~, u2 . . . . , UN. Let v ~1, Vo along A's eigenvectors as follows:

N N N va[ v v"t= ~ aiui, Vo = ~ ot~u,, b = ~ Biui.

i=1 i=1 i=1

(33)

It is possible to solve the dynamic eqn ( 31 ), expressing v '~ (t) relative to the eigenvector basis of A. The detailed derivation may be found in Appendix A, in which we conclude that

v.. , , ,= ~=, x,/- ~ u,. (34)

It will be revealing to examine the expression for v val(t) in the limits of small and large t. For small t, we make the approximation e "x~' ~ 1 + TlXil, which gives

N

vV"qt) ~ Z [~?( 1 + ,X~t) + ofl, tlu~. i=l

val the a o are Further noting that for a small, random Vo often small in comparison with the fly, we obtain

N

vV"~(t) ~ 0t ~ /3,ui = 0tb. i= l

(35)

Therefore, it is apparent that v '~1 initially takes off in the direction of the vector b. In the limit of large t, eqn (34) indicates that v '~ will tend towards the eigenvector of A with the largest positive eigenvalue (let us call this eigenvector Umax, the corresponding eigenvalue Xmax, and the various components in this direction am~, a°ax and/3m~x). It should be noted at this point that the dominance of Umax is enhanced by the action of common annealing procedures (see Appendix B). Of course, before the large t limit is approached, v would have strayed far from its initial position, and the linear analysis above would no longer be valid. However, it is reasonable to assume that the vectors b and Urea x are the main influences on the initial evolution of v va), an assumption which will be justified experimentally in later sections. We would also expect that when v has successfully converged to a valid hypercube corner, the associated v '~j will invariably contain a significant component in the direction of Um~x; this expectation will also be experimentally verified in later sections of this paper.

5. T H E EUCLIDEAN H A M I L T O N PATH P R O B L E M

5.1. P r o b l e m Formulat ion

The Euclidean Hamilton path problem can be ex- pressed as follows:

Given n cities, all of which have to be visited once only on a single journey, find the best order in which to visit them such that the total journey length is min- imized. As such it is very similar to the Euclidean trav- eling salesman problem, except that in the latter, a closed tour is specified. It is therefore not suprising that the mathematical problem formulation and the map- ping onto the Hopfield network are also very similar (Aiyer, 1991 ).

The solution to the Hamilton path problem may be expressed by means of an n-element vector p: Suppose the cities are originally presented as an ordered list of Cartesian coordinates, C say. C is an n X 2 matrix, in which Cjl is the ~x=coordinate of city j , and Cj2 is the y-coordinate of the same city. Let py = j if the city initially in list position j is to be placed in position i of the journey. We must also enforce

pi#Pj, i, j E { l . . . . . n}, i # j (36)

if all the cities in the original list are to be present in the final journey. The reordering may be achieved di- rectly by means of an n X n permutation matrix V(p), which is constructed as follows:

{10 if P ' = J [V(p)]0 = (37) otherwise"

With the above definition of V(p) , the n X 2 matrix C' = V (p)C contains the same city coordinates as C, but reordered in the manner defined by p. Note that the elements of V(p) are all either l 's or O's, and each row and column contains only a single 1. Expressed mathematically, this means:

IV(p)]0 ~ {l,0} (38) n n

Z[V(P)]0 = ~ [V(P)]0= 1, i , j ~ { l . . . . . n}. (39) i= l j = l

By making use of the property of the matrix R" given in eqn (14), these conditions may be more neatly ex- pressed as (Aiyer, 1991 ):

IV(p)]0 E {1, 0} (40)

V(p) = R"V(p)R n + S (41)

I O" 1 where S . . . . o"o "r. (42) n l1

We must now find an expression involving V(p) for the total path length of a journey specified by ordering p. Let P be the n X n matrix of negated intercity dis- tances given by

Page 6: An analytical framework for optimizing neural networks

84 .4. H. Gee. S. If B. Aiyel; and R. 14: Prager

P, = -(dis tance between cities i and j ) . (43)

Let Q be the n × n matr ix given by

Qo= ~J-..~ +6j+~.g, i , j E {I . . . n } . (44)

Following this specification, for n = 5, Q would be

i0 l 0 1 0 0 0 I 0 1 0 0 o 1 0 0 o o I

Then the path length cor responding to order ing p is given by

d(p) = - ~ t race[V(p)Pv(p)~Q] (45)

and the opt imiza t ion p rob lem may be concisely ex- pressed as minp d ( p ) .

5.2. Mapp ing the Problem onto the Hopfield Network

Having established the mathemat ica l p rob lem formu- lation, it is now necessary to show how the p rob lem may be mapped onto the Hopfield network for solution. The converged state vector of this network, call it v ( p ) , is used to represent the desired permuta t ion matr ix , V(p) , by way of the vec function:

v(p) = vec(V(p)). (46)

Hence, if there are n cities in the problem, a network with N = n-" units is required to find the pe rmuta t ion p. We must also define a matr ix V which is the matr ix equivalent of the unconverged state vector v:

v = vec(V). (47)

Consider what happens if we confine V so that

V = R"VR " + S . (48)

If we subsequently force v into a hypercube corner, i.e., if we enforce l,;j E { 1, 0 }, then it is clear that V will satisfy the condi t ions for V(p) in eqns (40) and (41) , and will therefore represent a valid solution to the combina tor ia l opt imiza t ion problem.

Equation (48) cor responds to a valid subspace of the form given in eqn (22) . The parameters T "at and s are derived as follows:

V = R"VR n + S

==, vec(V) = vec(RnVR n) + vec(S)

,=, vec(V) = (R n Q R")vec(V) + vec(S) using (8)

,= ,v= ( R " Q R n ) v + v e c ( S ) using(47).

Hence, by compar i son with eqn (22) , we have

T TM = R " Q R" (49)

s = vec(S) = 1 (o n Q on). (50) I1

Since we are a t t empt ing to min imize the path length d ( p ) , if we set E °p = d ( p ) , then the Hopfield network will car ry out the desired opt imiza t ion . We are now in a posi t ion to derive expressions for T °p and i °p.

E °p = d(p)

= - ½ t race[V(p) rQV(p)P] using(4)

= - ½ [ v e c ( V ( p ) ) ] r v e c ( Q V ( p ) P ) using (7)

= - ~ [ v e c ( V ( p ) ) ] r ( P ® Q ) v e c ( V ( p ) ) using (8)

= - ½ v ( p ) r ( P Q Q ) v ( p ) using(46).

By compar i son with eqn (20) , we see that

T °p= ( P Q Q ) (51)

i °p = O. (52)

5 .3. Applying the Linearized A n a l y s i s o f the Network D y n a m i c s

If we are now to analyze the initial evolut ion o f the vector v, then we need to derive expressions for the vectors b and u . . . . which we have identified as the main influences on the initial direct ion of v. Now,

b = TV~T°Ps + TV~li °p

1 = - ( R , Q R , , ) ( p ® Q ) ( o n G o n)

I1

1 : - (Rnpo n Q RnQo ") using (6) (53)

I1

A = T'~JT°VT TM

= (R" Q R")(P Q Q)(R" G R")

= ( R n p R " Q R " Q R ") using(6). (54)

As we saw in Section 2, we can relate the eigenvectors of A to those of R " P R " and R " Q R " . We shall find it useful to start by looking at the eigenvectors of Q , which can be easily verified to be

sin( kTri ] [hk]' = \ n - - ~ / " i, k E { l . . . n } (55)

with corresponding eigenva]ues

U~-= 2cos , k E I l . . . n } . (56)

Now, those hk with even k (i.e., with k = 2 j for some integer./') exhibit the proper ty ~;f=~ [hk]i = 0, and so

M)o" = 0 (57)

and

R"h2j = h2j (58)

It follows that h2j is also an eigenvector o f R " Q R " , with the same eigenvalue/~2j, since

R"QR"h.,j = R"Qh2j using (58)

= u2iR"h2j

= u2jh2j using (58).

Page 7: An analytical framework for optimizing neural networks

Optimizing Neural Networks 85

This accounts for about half of the eigenvectors of R"QR" . The others seem to be more difficult to derive analytically, though experiments reveal that the eigen- vector of R"QR" with the largest positive eigenvalue is invariably one of those which is also an eigenvector of Q: Let us call this eigenvector hmax and its correspond- ing eigenvalue Ureas. Another interesting fact brought to light by the experiments is that all the eigenvalues of R " P R " are non-negative; this seems to hold for all negated distance matrices P under a Euclidean distance metric. We can now use the properties of Kronecker products to state that

Urnax = (#( ~'max ( ~ hmax ) ( 59 )

where wm~ is the eigenvector of R"PR" with the largest positive eigenvalue, and 4~ is a normalizing constant. It transpires that, for this problem, b and Um~x are mu- tually orthogonal, since

u ~a~ b = -~ ( wm~ (D hm~ ) r( R"Po" Q R"Qo ") tl

~P , r = - ( W m ~ R " P o " ) ( h r ~ R " Q o ") using(9)

II

= --$ (w~R"Po")(hTm~Qo ") using (58) H

_ $ / z ~ ( w ~ R"Po"~(h r , , ~n~ O"~, t l

= 0 using (57).

Referring back to eqn (34) , this means that/3m~ = 0, and so the evolution of v '~t in the direction of Umx is governed solely by a ° ~ , which has random sign. We have already established that v ""t evolves with a very significant component in the direction of Uma~. Now we see that the sign of this component depends wholly on the random ~°max, and so we conclude that v ~at can follow one of two very different paths, depending on the random starting vector Vo ~t. It is hoped that the two paths will lead to the two optimal solutions, corre- sponding to traversing the shortest path in opposite di- rections. To verify this we examine computer simula- tions of the network being used to solve a specific 10- city problem.

The 10 cities were placed randomly within the unit square, and then the shortest Hamil ton path was found by an exhaustive search. The cities were then rearranged in the order of the shortest path before being passed to the Hopfield network. In this way we know that the two optimal solutions V(p) are the identity and reversed identity matrix. Figure 3 shows the 10 cities in their optimal ordering, as well as the shortest Hamil ton path through them. The Hopfield network was initialized with a small random vector vfl t, and then allowed to converge to a hypercube corner. Figure 4 shows the network output at several stages of convergence. The plots confirm what the linearized theory predicted: As early as iteration 10 it is clear that Vval has taken off in

1

0.8

0 .6

0.4

0.2

0

1

4 7

10

0 0.2 0.4 0.6 o.a

FIGURE 3. The 10 city positions with their shortest Hamilton path.

the direction of b, by iteration 100 v vat contains a dom- inant component in the direction of--Um~, and when fully converged there are significant components along both b and u,,~. We note in passing that the network has converged to the identity matrix, and has therefore found the optimal solution. Figure 5 shows the same problem being solved again, only this time initializing the network with - v o ~j. As predicted, this time the component along u,,~ comes in with positive sign, leading to the other optimal solution, t

5 . 4 . T h e A u x i l i a r y L i n e a r P r o b l e m

Having seen that the linearized analysis of Section 4 does indeed predict the initial behavior of the network, we are now in a position to investigate the likely extent of the network's success on the Hamilton path problem. Clearly, the network may fail to find the optimal so- lution if either b or Urn,., takes v into a completely dif- ferent region of space from the optimal hypercube cor- ner. In order to gain a better understanding of what b does, we can associate b with an aux i l iao ' l inear prob- lem. First we rewrite the problem's objective function (20) for v lying on the valid subspace (22):

E °p = - ! , (TrUly + s)YT°~(TV~tv + s)

- (TV~v + s)ri °p. (60)

Simplifying eqn (60) , we obtain to within a constant

E °p = - !, vrAv - brv. (61)

So we can associate the action of b with the minimi- zation of an auxiliary linear problem with objective function

E atp = - b TV. ( 6 2 )

Any of the problems studied in this paper may be expressed in many different ways: For example, an n-city Hamilton path problem may be expressed in n! ways, each one corresponding to a different ordering of the cities in the matrix P. For the results of this paper to have any significance, it is necessary to prove that the decomposition of a particular solution's v '~ along b and the eigenvectors of A is independent of problem instance. While this is indeed the case, the proof is rather long and tedious: Interested readers should refer to Gee, Aiyer, and Prager (1991a).

Page 8: An analytical framework for optimizing neural networks

86

M e s h p l o t o f V

0~11

OJ[M

OD4

O~

0

~0~

~ 0

Iteration 10

O4

0.2

0

-04

-06

• O | o

Iteration 100

°if 0 2

4~,6

"10

Iteration 200

A. H. Gee. S. V. B. Aiyer, and R. W. Prager

Decomposition of v val

40 ~ SO I/X} 120

0 6 o4 j 0.2

0

~ .2

-O.4

- 06

4).11 •

-! 0 ZD 40 ~0 I 0 100 120

Fully converged

FIGURE 4. Ten city Hamilton path problem at various stages of convergence. In the decomposition of v":, the leftmost 100 bars represent the a~ of eqn (33) . The a~ are ordered such that those with assoc iated negative e igenvalues are to the left and those with positive eigenvalues to the right. Hence, the component along the dominant eigenvector Urn., appears in position 100. The bar

T val to the right of this, at position 110, indicates the component of v'*' along b, i.e., it measures (b v )/]lbJ].

Page 9: An analytical framework for optimizing neural networks

Optimizing Neural Networks 87

Mesh plot of V Decomposition of v val

0 29 40 ~11 IO I00 I I0

I tera t ion 10

0,1

0.7

0.6

0.5

0.4

0.5

0.2

0.I

0

"QI 4).2

I terat ion 100

I

O|

0.6

04

o., n . . . . fl . . . . n mn~

-Q4!

I terat ion 200

40 110 $0 Iflo 120

Fully converged

FIGURE 5. Ten city Hamilton path problem with negated starting vector. See legend to Figure 4 for an explanation of the decomposition plots.

Page 10: An analytical framework for optimizing neural networks

88 A. H. Gee, S. V B. Aiyer. and R. W. Prager

We have already established that the initial dynamics of the network are well characterized by ~ oc b, and so we conclude that the initial behavior of the network is concerned largely with minimizing E a~p. For the Ham- ilton path problem, substituting for b from eqn (53) gives

I E alp = - - (o"TpR" ® o"rQR)vec(V) using (47) tl

1 - vec(o"rQR"VR"Po ") using (8)

1 - trace(o "TQR"VR"Po')

n

since the argument is a scalar

= -trace(R"VR"PSQ) using (4), (42).

Adding a constant term, we obtain

E "~p = -trace ( ( R"VR" + S) PSQ)

= -trace(VPSQ) since V satisfies (48)

1 - trace [VPo"o"rQ]

i1

1 = - [(Qo")rV(-Po")] using (4). (63)

n

Now, element i of the vector - P o " gives the summed distances of all the other cities from city i, since [ -Po"] i = ~v~':l - Pij. Similarly, the vector Qo" is given by

{12 ff i @ { l ' n } [Qo"]i = if 2 < i - < ( n - 1)" (64)

As an example, for n = 5 the vector (Qo") r would be [ 1 2 2 2 1]. In order to minimize E alp, V must reorder the elements o f - P o " so that the largest elements appear in the first and last rows of V(Po"). This is equivalent to ensuring that the two cities with the largest summed distances from all the other cities are placed first and last in the journey. If this is done, then the contribution of these large distances to the cost of E alp is halved compared with what would have been the case had the cities not been placed first and last. In this way, we can associate with the auxiliary linear problem a certain strategy, that being to place the two most remote cities first and last in the tour; the network follows this strat- egy in the early stages of convergence.

Armed with this new insight, we can now address the question of how well the network can tackle the general Euclidean Hamilton path problem. The aux- iliary linear problem suggests that the network may fail for problems in which the optimal tour deviates sig- nificantly from starting and finishing at the most remote cities. Such a problem is illustrated in Figure 6. Here we see 16 cities arranged in a spiral pattern and num- bered so that the optimal solutions V are the identity and reverse identity matrices. The optimal path length

8

6

4

2

O

-2

-4

-6

-8. 8

13 ,12 11 l O

16 7

-6 -4 -2 0 2 4 6

FIGURE 6. The 16 city positions with their shortest Hamilton path.

is 54.0 units. Figure 7 shows that the network behaves as predicted by the linear analysis, putting in large b and Uma~ components from the start, only this time a suboptimal solution is reached (path length 56.2 units), with the journey starting and ending at outlying cities (see Figure 8). Clearly, this is a result of the auxiliary linear problem being poorly related to the parent Hamilton path problem in this case. Indeed, Figure 9 shows that the optimal solution's v TM contains a small negative component in the direction of b. While the network is sometimes capable of completely overturn- ing an initial bias towards b, this capability is highly dependent on the random starting vector VoVat, and more often than not, poor solutions like the one in Figure 8 are obtained.

6. T H E P O I N T M A T C H I N G P R O B L E M

6 . 1 . P r o b l e m F o r m u l a t i o n and M a p p i n g

The point matching problem as presented in this sec- tion forms part of an invariant pattern recognition sys- tem. The aim is to compare a set of unknown points to a set of template points and compute some measure of similarity between the two sets. However, before the similarity can be computed, each of the unknown points has to be matched with a particular template point. Suppose the template and unknown point sets contain the same number of points n: Let the template set be called (9 and the unknown point set be called 'P. Let us also define distance matrices for these two sets, P and Q, where P,j is the distance between un- known points i and j , and Qkt is the distance between template points k and/ . Then we can match unknown points with template points using a permutation p, where pi = j if unknown point j is to be matched with template point i. A distance measure which is invariant to translation and rotation of the point sets is then

d(p) = [IV(p)eV(p) r - Qll 2 (65)

where V(p) is the permutation matrix associated with p. The objective of the point matching problem is to find the permutation p which minimizes the distance

Page 11: An analytical framework for optimizing neural networks

Optimizing Neural Networks 89

Mesh plot of V Decomposition of v val

O.I

0 ~

0 a 6

-011~

Iteration 10

0.1

0 ,6

0

-O.4

.0.6

Iteration 200

I

0.1

0.6

0+4

0.2

0

.0.2

.0.4

41.1

~0 1130 I~10 200 2~0 300

4}.1

-I

Iteration 100

I

0.$

0 .6

0.4

0 ~

C

-0.]

-0A

-011

.I Inn 1.50 ~ ~ 300

Fully converged

FIGURE 7. Sixteen city Hamilton path problem at various stages of convergence. See legend to Figure 4 for an explanation of the decomposition plots,

Page 12: An analytical framework for optimizing neural networks

90

8

-8

116 /

-6 -4 -2 O 2 4 6

FIGURE 8. The 16 city positions with the solution found by the Hopfield network.

d(p) between the two sets, i.e., minp d(p). This problem is similar to the optimization problems found in the graph matching approaches to invariant pattern rec- ognition (Bienenstock & Doursat, 1989; Li & Nasra- badi, 1989 ), with the added condition that the two sets contain the same number of points. Note that invari- ance to scaling can be added by normalizing the moduli of the two distance matrices being compared, such that

lIP1[ = IIQII, We now show how this problem may be mapped

onto the Hopfield network. The valid subspace is exactly the same as for the Hamilton path problem, since in both cases we are trying to find a permutation matrix V(p), so T ''~ and s are as given in eqns (49) and (50), respectively. It is now necessary to derive expressions for T °o and i °p. We start by expanding the expression for d(p):

d(p) = IlV(p)eV(p) r - QII"

= trace[(V(p)PV(p) r - Q)T(V(p)PV(p)T-- Q)]

= IlPll 2 + IIQII-" - 2 trace[V(p)PV(p)rQ]. (66)

Eliminating the constants from the above expression, we can set E °° as

E °~ = - ~ trace[V(p)PV(p)rQ]. (67)

This is of exactly the same form as the expression for

A. H. Gee, S. V. B. A~ver, and R. W. Prager

E °p in the Hamilton path problem, and so we conclude that

T °p = ( P ® Q ) (68)

i °p = 0. ( 6 9 )

6 .2 . T h e A u x i l i a r y L i n e a r P r o b l e m

Since both P and Q are problem dependent, it is im- possible to say anything definite about their eigenvec- tors, and hence we cannot derive any interesting rela- tionship between Um,x and b as we did for the Hamilton path problem. However, we are in a position to examine the auxiliary linear problem. Making use of the con- gruence between this mapping and that of the Hamilton path problem, we can j ump straight to eqn (63):

E~ip = _ 1 [(Qo.)TV(Po.)] ' II

Now, the vectors Qo" and Po" simply represent the summed distances of all the other points from each point in the template or unknown, since

n n

[Qo"]i = • Qi,, [Po"]i : Z Pij- j = l j = l

In order to minimize E "Ip, V must reorder the elements of Po" so that large elements appear in the same rows as the large elements of Qo", while small elements ap- pear in the same rows as the small elements of Qo ". In other words, the auxiliary linear problem matches outlying points in ~o with outlying points in Q, and central points in 'P with central points in (2.

As with the Hamilton path problem, we must now address whether the auxiliary linear problem is in sym- pathy with the parent point matching problem. To do this, we study several typical 20-point matching prob- lems for which good solutions have been found (see Figure 10). For each problem, Figure 11 shows the decomposition of the solution's v "a~ along b and the eigenvectors of A; also shown is the decomposition of

Mesh plot of V

I

o$

o6

o.4

0.2

o

-o.2

"0.4

.06

Decomposition of v va]

I • 01~ SO

FIGURE 9. Optimal solution to the 16 city Hamilton path problem. See legend to Figure 4 for an explanation of the decomposition plot.

Page 13: An analytical framework for optimizing neural networks

Optimizing Neural Networks

93 )2 'it 04

95 ,16

.17

90

,2 4 9

T e m p l a t e 2

99 96

-12 .9

3 9 93 )

'15 91

$

.7

.6

.8

-10

97 2.O

,2o .n 99

14 98

,,11

.12 j3

J4

,13 j5

,9 -1o

95 96

.17

99 96

92 .9

.5

91

2.0 -19

U n k n o w n 2

,6 .4

98

J ,6

.s

9 10 97

.20

.4

2.

94 9 8

91

98 95

97

,9

3

T e m p l a t e 4

2.o 99

96

95 '10

.9

.1

,2 ,5

T e m p l a t e 5

93 94 92 91

.6 3

?

.? .5 ,2 J

) .4

.6 $

90

92 91

93 94

36

U n k n o w n 4

,14 06 03

92 ,11 jo

J .6

9 ?

,3

U n k n o w n 5

9 J~ ,2

y .9

.lO

16 ) s

J9 )3 j 4 -12 JX 2.o

j9

j7

.7 .9

j5

j s

j7

x9 ~ '15 •

T e m p l a t e 9 U n k n o w n 9

FIGURE 10. Template and unknown patterns with labeling for a good match.

Page 14: An analytical framework for optimizing neural networks

92 A. H. Gee. S. V. B. Aiyer, and R. W. Prager

D e c o m p o s i t i o n o f s o l u t i o n ' s v va]

I O.l

06

04

02

~ . 2

-O4

4).6

.! - - 3"KI 3"/5 310 385 ~ 0 395 400 4W 410 411

' 2 ' p r o b l e m

m

-~ ~o 5,5 ,,o ,,5 3,0 ,,5 ,~ ,~ ,,o ,,5

' 4 ' p r o b l e m

"~6~ 2170 3"/5 380 385 ~ 395 4 0 3 ~ 410 415

D e c o m p o s i t i o n o f b

1

0.8

0 6

0 4

0 2

0

~0.2

~ . 4

~ 6

~ 8

I

0,8

06

04

02

0

~ 2

~4

~.6

438

Ili-L. 1. 3R ] r / J N O 3 1 J 5gO 321 4 ~ 0 ~ 410 415

-H td ~ U

3 ~ ~ 3/5 3lO 315 3gO ~ ~ 405 410 41J

5

1 -

36,5 370 375 380 515 3 9 0 395 4 0 0 ,1415 4 1 0 41J

' 5 ' p r o b l e m

I

011

06

0.4

02

41.1

' 9 ' p r o b l e m

1 ~ 0 3"75 ~ 385 390 3 ~ 4~0 4W 410 41J

FIGURE 11. Decompositions for the point matching problems. See legend to Figure 4 for an explanation of the plots. For clarity, only those components corresponding to the largest positive eigenvalues are displayed, as well as the component along b.

Page 15: An analytical framework for optimizing neural networks

Optimizing Neural Networks 93

b along the same vectors. It seems that all the good solutions contain significant components along Umax, a strong prerequisite for a successful mapping. It is also apparent that the solutions contain significant com- ponents along h, indicating a good degree of sympathy between the auxiliary linear problem and the parent problem. However, the figure also shows that for some problems the vectors b and Uma x are nearly mutually orthogonal. This means, as in the case of the Hamilton path problem, that the evolution ofv va~ in the direction of Um~x is governed solely by a°ax, which has random sign. However, unlike the Hamilton path problem, in which each of the two possible routes could lead to an optimal solution, the point matching problem has only a single optimal solution, and so one of the routes is bound to lead to a suboptimal hypercube corner. This hypothesis is confirmed in Figure 12, which shows the result of solving the "4" problem of Figure 10 twice

val and then on a Hopfield network, starting once with Vo again with - v val. As expected, the component of v ~ along um,x develops different signs in the two runs, leading to two different solutions, one with d(p) = 0.157, the other with d(p) = 0.346. The un- known point labellings corresponding to these solutions are shown in Figure 13; these should be compared with the template in Figure 10.

It is reassuring to observe in Figure 11 that in none of the problems does the vector h contain a significant component along Urea x with the opposite sign to the good solution's component in that direction. If this were the case, then the action of b would invariably introduce the Um,x component into v va~ with the wrong sign, lead- ing most probably to poor solutions. So, unlike the Hamilton path problem, we cannot say here that the auxiliary linear problem is antagonizing the parent problem. Instead, the auxiliary linear problem simply doesn't contain enough information to steer v '~] in the right direction from the outset. This is not such a serious shortcoming as in the Hamilton path problem in that this time we can be sure ofobtaining one good solution if we run the network twice, negating the random v~, "~ between runs. This may not be a particularly ele- gant solution, but it is certainly very effective.

6.3. Simulation Results

As a small illustrative experiment, the network was used in a pattern recognition system to recognize instances of the 10 handwritten digits. The template and un- known digits used in the experiment are shown in Fig- ure 14. A point positioning algorithm was used to place 20 points around the outlines of the digits; each of the

Mesh plot of V

I

0.$

0.6

0.4

0.2

0

-0.2

-O4

~D6

-~-

Good solution (d(p) = !

O.|

0.4

0.2

O

.0.2

-OA

.0.6

-0.1

Poor solution (d(p) =

FIGURE 12. Mesh plots and decompositions for the two solutions to the the decomposition plots.

Decomposition of v val

o.1571

410 41J

T/0 T/J ~ 31J 390 ~ ¢00 ,lOS 410 4|~

0.346)

"4" problem. See legend to Figure 4 for an explanation of

Page 16: An analytical framework for optimizing neural networks

94 A. H. Gee. S. V. B. A~ver. and R. W: Prager

,13

Good solution

,16 j2

.9

,5

,it

Poor solution

,4

3

.1o j7

3o

j4 j8

J j5

30 js

j 7

j4

j 0 J

,4 .5

J .Jl

,,6 .9

.]2 j9

FIGURE 13. Point labelings for good and poor solutions to the "4" problem.

J3 ,16

10 unknowns were then compared with each of the 10 templates, using the Hopfield network to solve the point matching problems. The network was run twice on each comparison to ensure a good solution was found at least once. The results are displayed in Table 1. The only misclassification was that of the "6", which was mistaken for a "9"; this is clearly a result of the rota- tional invariance of the recognition scheme.

7. DISCUSSION AND C O N C L U S I O N S

There have been many papers in the past reporting the discovery of neat mappings of combinatorial optimi- zation problems onto the Hopfield network, but it is often the case that the solutions obtained from the net- work are not as good as might be expected given the integrity of the mapping. In this paper we have dem- onstrated how it is possible to estimate the likely efficacy of a particular mapping by way of an eigenvector anal- ysis of the network's dynamics and an investigation of the structure o f "good" solutions relative to this eigen- vector basis.

Although not observed in experiments reported in this paper, it seems likely that a major cause of failure would be that a good solution's v va~ contains an insig- nificant component along the dominant eigenvector Um~,. The other principle mode of failure arises through inappropriate action of the vector b. In the course of our argument, we have introduced the concept of an "auxiliary linear problem," which we have found useful in gaining a greater understanding of the "strategy" associated with b. It remains to be pointed out that

not all problems have an auxiliary linear problem: For example, the classic mappings of the traveling salesman and graph partitioning problems (Aiyer, 1991) both exhibit the property b = 0. The absence of an auxiliary linear problem is to be associated with a local maxi- mum of the objective function at the center of the valid subspace, and so the initial direction of v from this point is entirely random. For the aforementioned two problems, this is consistent with the distribution of de- generate optimal solutions around the unit hypercube. For problems which lack this degeneracy of solution, it is indeed essential that the mapping yields an aux- iliary linear problem or else the network output is most likely to head away from the optimal hypercube corner from the outset.

However, the presence of an auxiliary linear problem where it is required is not enough to guarantee the suc- cess of a mapping. It is also necessary that the auxiliary linear problem be in s ympa thy with the corresponding parent problem. By "in sympathy," we mean that the v vaj corresponding to good solutions must not contain negative components along b. Furthermore, for prob- lems with a unique optimal solution (e.g., the point matching problem), it is necessary that the auxiliary linear problem contain correct information as to the direction of a good solution's component along Urea x. If this information is incorrect, or lacking, then the Umax component may be introduced with the wrong sign, most probably leading to a poor quality solution. For cases in which the Umax information is lacking, as with some point matching problems, a good solution can be found by running the network twice, start-

0 O \

2 3 4 g 6 78 G 0,% p % o

FIGURE 14. Template (top) and unknown (bottom) digits used in the experiment.

Page 17: An analytical framework for optimizing neural networks

Optimizing Neural Networks

TABLE 1 Distance Measures d (p) for all Templates against all Unknowns; Shortest Distances are Shown in Bold Type

95

Unknown Template Digit

Digit 0 1 2 3 4 5 6 7 8 9

0 0.98 3.82 2.13 2.16 2.76 2.30 2.12 3.08 2.25 2.81 1 4.49 0.55 2.67 2.31 3.33 1.97 3.00 2.51 2.30 3.06 2 2.07 3.20 1.44 2.46 2.28 2.16 2.56 1.99 2.96 2.87 3 3.92 1.95 1.85 1.17 2.61 1.46 2.22 1.89 2.18 2.51 4 2.88 3.11 2.30 2.30 1.57 2.25 2.35 2.63 2.17 2.86 5 2.44 2.95 2.16 2.14 1.95 1.43 2.32 2.24 1.93 2.33 6 3.51 2.50 3.06 2.42 2.84 1.97 1.56 2.24 2.16 1.39 7 4.36 1.63 2.60 2.24 2.67 1.75 2.55 1.23 2.43 2.30 8 3.19 2.18 2.33 1.67 2.87 1.81 2.34 2.73 1.33 2.54 9 3.13 2.74 2.92 1.99 2.68 2.19 1.33 2.02 2.07 1.20

ing once wi th v ~ = Vo a~ and then aga in wi th V val : - - V o al .

To conc lude , the resul ts o f this pape r d e m o n s t r a t e

how it is possible to pred ic t the l ikely success o f a p rob-

lem m a p p i n g wi thou t having to resor t to trial and error.

Weaknesses in the m a p p i n g are h igh l igh ted by inves-

t iga t ing the d e c o m p o s i t i o n o f good so lu t ions a long the

e igenvec tors o f the m a t r i x A and the aux i l i a ry l inear

p r o b l e m vec tor b. T h i s e igenvec to r analysis sheds new

light on the p r o b l e m m a p p i n g by p lac ing it in the con-

text o f the n e t w o r k ' s d y n a m i c s . W h e r e there is s o m e

flexibil i ty in the way a pa r t i cu l a r p r o b l e m is m a p p e d

o n t o the ne twork , the analysis p resen ted in this pape r

offers a possible m e t h o d o f choos ing be tween the al-

ternat ives .

REFERENCES

Aiyer, S. V. B. ( 1991 ). Soh,ing combinatorial optimization prohh, ms using neural networks (Tech. Rep. CUED/F-INFENG/TR 89 ). Cambridge, England: Cambridge University Engineering Depart- ment.

Aiyer, S. V. B., & Fallside, E ( 1991 ). A Hopfield network implemen- tation of the Viterbi algorithm for Hidden Markov Models. Pro- ceedings of the International Joint Cot!feremz, on Neural Networks. Seattle, WA, 827-832.

Aiyer, S. V. B., Niranjan, M.. & Fallside, F. (1990). A theoretical investigation into the performance of the Hopfield model. IEEE Transactions on Neural Networks. l ( 2 ), 204-215.

Bienenstock, E., & Doursat, R. ( 1989 ). Elastic matching and pattern recognition in neural networks. In L. Personnaz, & G. Dreyfus (Eds.), Neural networks: From models to applications. Paris: ID- SET.

Eberhart, S. E, Daud, D., Kerns, D. A., Brown, T. X., & Thakoor, A. P. (1991). Competitive neural architecture for hardware so- lution to the assignment problem. Neural Networks. 4, 431-442.

Gee, A. H., Aiyer, S. V. B., & Prager, R. W. (1991a). Neuralnetworks and combinatorial optimization problems--the key to a successful mapping (Tech. Rep. CUED/F-INFENG/TR 77 ). Cambridge, England: Cambridge University Engineering Department.

Gee, A. H., Aiyer, S. V. B., & Prager, R. W. (1991b). A subspace approach to invariant pattern recognition using Hopfield networks. Proceedings of the International Joint ConJbrence on Neural Net- works. Singapore, 795-800.

Gee, A. H., & Prager, R. W. (1992). Polyhedral combmatorics and neural networks (Tech. Rep. CUED/F-INFENG/TR 100). Cambridge, England: Cambridge University Engineering Depart- ment.

Gislen. L.. Peterson, C., & SOderberg, B. (1989). "Teachers and Classes" with neural networks. International Jmonal ~)f Neural Systems, 1(2), 167-176.

Graham, A. ( 1981 ). Krom, cker products and matrix cah'uhts, with applications. Chichester, England: Ellis Horwood Ltd.

Hopfie[d, J. J. ( 1984 ). Neurons with graded response have collective computational properties like those of two-state neurons. Pro- ceedings ~?lthe National Academy t?l &qences US..I. 81, 3088- 3092.

Hopfield, J. J., & Tank, D. W. (1985). "Neural' computation of de- cisions in optimization problems. Biological Cybernetics. 52, 141- 152.

Kamgar-Parsi, B., & Kamgar-Parsi, B. (1990). On problem solving with Hopfield neural networks. Biological ()'berneti~w, 62, 415- 423.

Li, W., & Nasrabadi, N. M. (1989). Object recognition based on graph matching implemented by a Hopfield-style neural network. Proceedings of the International Joint Cot!ference on Neural Net- works. Washington DC Ii, 287-290.

Ogler, R. G., & Beyer, D. A. (1990). Neural network solution to the link scheduling problem using convex relaxation. Pro~z, edings ~71" the IEEE Gh~bal Teh,communications Cw!/i, rence. San Diego, CA, 1371-1376.

Peterson, C., & Anderson, J. R. (1988). Neural networks and NP- complete optimization problems: a performance study on the graph bisection problem. Comph:~ S)wtems. 2( I ), 59-89.

Peterson, C., & S6derberg, B. (1989). A new method for mapping optimization problems onto neural networks. International Jour- hal ~!lNeural Systems. ! ( I ), 3-22.

Rosenbrook, H. H., & Storey, C. (1970). Mathematics of dynamical systems. London: Thomas Nelson and Sons.

Tank, D. W., & Hopfield, J. J. (1986). Simple 'neural" optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit. IEEE Transactions on Circuits and Systems, 33(5), 533-541.

Van den Bout, D. E., & Miller Ill, T. K. (1989). Improving the per- formance of the Hopfield-Tank neural network through normal- ization and annealing. Biological Cybernetics. 62, 129-139.

Van den Bout, D. E., & Miller Ill, T. K. (1990). Graph partitioning using annealed neural networks. IEEE Transactions on Neural Net~tvrks. i (2), 192-203.

Wilson, V., & Pawley, G. S. ( 1988 ). On the stability of the TSP prob- lem algorithm of Hopfield and Tank. Biological Cybernetics, 58, 63-70.

Page 18: An analytical framework for optimizing neural networks

96 A. H. Gee, S. V. B. Aiyer, and R. IV. Prager

A P P E N D I X A

Here we derive the linearized approximation to the network dynamics. The general solution of the dynamic eqn (31 ) can be expressed by means of the matrix exponential (Rosenbrook & Storey, 1970):

v"~(t) = e "^ ' v~ + ~ e~^~'-'~b dr. (70)

Rewriting the matrix exponentials as power series gives

~ , i,k ~' tr / ) Akv val t ~: i t - - 7") k v ' ~ ( t ) = ~o-- -~- " o + f0 ~o ~ r ~ k ` ' A ~ b d r

= ~ ~ A % ~ " ~ + ~ Jo ( t - r ) ~dr k=0

= S' ~r/ ~ Akv~ ~ ()It)k+) A%

Making use of the eigenvector decompositions ( 33 ), we have

N N

Akv~ ' ' = Z a~X~ut, A~b = Z Bth~u,. t~l t - I

We will need to distinguish between the zero and nonzero eigenvalues of A, so define the set Z such that ;~, = 0 for i ~ Z and hi ~ 0 for i ~ Z. We now obtain

: ' ( ' ) = "'° u' 4" ,-0 c -7 ,., t~,x,* u,

=,o, ,oo-Zr-, +, Zz ,oo Z ;,

,~z ~-o (k + 1)!

= Z e"~,°u, + Y~ k-----T- - 1 I=l t ~ Z At

+ Y. o~,u t t fEZ

N

= E : ~ - ° u , + E ~'"'le °~''- I)+ E ,7~,u,t. :=l i~Z At icZ

(71)

Equation ( 71 ) is completely general in that it holds for arbitrary A, b and vg ~l. However, we can simplify the expression if we define A and b as in eqns (29) and (30). In this case, the eigenvectors of A with corresponding zero eigenvalues are orthogonal to the valid sub- space, ~ while b lies wholly within the valid subspaee, so/3t = 0 for i ~ Z. Equation ( 71 ) subsequently becomes

[:,(+,,) ,_1o v v a l ( / ) _ a~' ~ ;~,j ,. (72) i=l

A P P E N D I X B

It is common to employ some form of annealing procedure to facilitate local min imum escape and to force v towards a hypercube corner. A typical annealing procedure is hysteretic annealing ( Eberhart, Daud, Kerns, Brown, & Thakoor, 1991 ), though close variants include matrix graduated nonconvexity (Aiyer, 1991 ) and convex relaxation (Ogier & Beyer, 1990 ). In this appendix we show that the results of this paper are not affected by the use of such annealing procedures.

2 Unless T®v ~ = 0 for some v '~ in the valid subspace, which does not happen for any of the T '~ considered in this paper.

With hysteretic annealing, a variable amount of self-feedback is incorporated into each neuron, so that

T op .-~ T °p 4" 0I (73)

The network is started running with a negative value of O, and then 0 is gradually increased to positive values until convergence to a hy- percube corner is achieved. First, let us see what effect 0 has on the vector b:

b --)- T ' ~ ( T °p + 0I)s = T'~T°Ps + 0TV=s

= T ValTOp$.

Hence, b is unchanged by 0. The eigenvectors u, of A are also un- changed by 0. though the eigenvalues are affected, since

Au, --)- TVi~(T °p + 0I)TV~ut = T'~l'°ZI'V~Jut + 0T'~ut

= / 0 for ui orthogonal to the valid subspace.

t (~, + 0)ut for ul in the valid subspace.

So we see that the eigenvalues in the valid subspace are all shifted by an amount 0. However. the relative values of the valid subspace ei- genvalues are unchanged, and so the result that the dominant eigen- vector um~, should be a very significant influence on the evolution of v '~ still holds. In fact, hysteretic annealing enhances the dominance of um~ in the early stages of convergence, since 0 is usually started sul~ciently negative to ensure that only h,~,. out of all the eigenvalues, is greater than zero. Referring back to eqn (34), this means that neglecting the action of b, only the component of v "~ along Um~ is developed, the others being suppressed. In the simulations reported in this paper, we have used hysteretic annealing to ensure that the best possible performance is extracted from the network.

Matr ix notat ion

I n

O n

O n

R n

60

N O M E N C L A T U R E

n × n identity matrix n-element column vector of ones n X n matrix of ones n × n projection matrix which removes the o"

component of a vector Kronecker impulse function

Hopfie ld

u

v

V T i b

E g( T

K

network parameters

internal vector of Hopfield network state vector of Hopfield network matrix equivalent of state vector v connection matrix of Hopfield network input bias vector of Hopfield network Liapunov function of Hopfield network

) transfer functions of Hopfield network decay parameter of Hopfield network steepness of Hopfield network transfer func-

tions annealing parameter

Opt imiza t ion parameters

E °p optimization objective E cns constraint penalty function E a~° auxiliary linear problem objective ci weighting given to penalty function T °o Hessian of optimization objective

Page 19: An analytical framework for optimizing neural networks

Optimizing Neural Networks 97

i °~ linear coefficient vector of optimization objective T val valid subspace projection matrix s valid subspace offset vector S matrix equivalent of vector s

Valid subspace/eigenvector analysis notation

v vaj component of v in valid subspace vo a* Value ofv vaj at time t = 0 A Hessian of optimization objective for v on the

valid subspace b linear coefficient vector of optimization objective

for v on the valid subspace ki eigenvalue of A ui eigenvector of A Z set of indices of zero eigenvalues of A ai component o fv '~t along ui a 7 component ofvo al along oi

~i component of b along oi )~max largest positive eigenvalue of A Uma~ eigenvector of A with eigenvalue kmax

OLmax G o max

~max

Pk

/'~max

hmax

Wmax

component of v ~al along Urea x

component of vo at along Urea x

component of b along Umax eigenvector of Q eigenvalue of Q largest positive eigenvalue of R 'QR" eigenvector of RnQR" with eigenvalue Umax eigenvector of R 'PR" with largest positive ei-

genvalue

Problem mapping notation

P V(p) v(p) P O d(p) C 7~ Q

permutation vector permutation matrix vector equivalent of permutation matrix V(p) first Kronecker product submatrix of T °~ second Kronecker product submatrix o f T °p permutation dependent distance measure n × 2 matrix of planar coordinates unknown point set template point set