a neural network learning algorithm for adaptive principal component extraction (apex)

v1.2 A NEURAL NETWORK LEARNING ALGORITHM FOR

ADAPTIVE PRINCIPAL COMPONENT EXTRACTION (APEX)

S. Y. Kung and K. I. Diamaiitaras Department of Electrical Engineering, Princeton University, Princeton NJ 08544

ABSTRACT

This paper addresses the problem of the recursive computation of the principal components of a vector stochastic process. The applications of this problem arise in modeling of control systems, in high-resolution spectrum analysis, image data compression, motion estimation, etc. We propose a new algorithm called APEX which can recursively compute the principal components using a linear neural network. The algorithm is recursive and adaptive, namely given the first m - 1 principal components it can produce the m-th component itera tively. The paper also provides the numerical theoretical basis of the fast convergency of the APEX algorithm and demonstrates its computational advantages over the previously proposed methods. Extension to extracting constrained principal components via APEX is also discussed.

1 INTRODUCTION

The problems of data compression and pattern classifica- tion have attracted a lot of attention from many researchers of various fields like pattern recognition, artificial intelligence, signal processing, etc. Both problems rely on finding an efficient representation of the input data. This representation should extract the essential information out of the original sequence of patterns. For data compression this process is a mapping from a higher dimensional (input) space to a lower dimensional (representation) space. The data compression problem is also important in modeling multilayer perceptrons where the hidden neurons may be regarded as a layer of nodes corresponding to the most effective representation of the input patterns. Their activation patterns can be viewed as the target patterns of the representation transformation that may facilitate the classifi- cation problem tackled by the following layer(s).

A powerful mathematical tool for extracting such represen- tations is Principal Component Analysis (PGA) which derives an optimal linear transformation y = P x for a given target space dimension. The optimality criterion is based on the mean square error of the reconstructed input data j , from the actual components y. Define R as the correlation matrix of the input : R = E { x x T } . The rows of the optimal matrix P are the eigenvectors of R corresponding to its highest eigenvalues. This result stems from the Karhunen-Loeve Theorem that has been extensively studied in the past.

Recently, new techniques have been reported for the adaptive calculation of this transformation for a given set of random

This research was supported in part by the National Science Foundation under Grant MIP-87-14689, Air Force Office of Sci- entific Research and by the Innovative Science and Technology Office of the Strategic Defense Initiative Organization, admin- istered through the Office of Naval Research.

patterns [1]-[3]. In [l] a linear neural network is proposed (Fig- ure 1) where only one output neuron y and n inputs q . . . z, is used for the most dominant component. The activation of y is just a linear combination of the inputs with the weights qi

or more compactly y = qTx, where q and x are the weight and input vectors respectively. The updating rule is

Aqi = P(YG - YQ~). Oja proves that the algorithm converges and extracts the first Principal Component (PC) of the input sequence, namely in the steady state q=the normalized eigenvector of R corresponding to the largest eigenvalue. To extend Ojas method to extract more than one principal components using multiple output nodes, Sanger [2] proposed a modified method based on the following updating rule

where y = Q x , LT(- ) denotes the lower triangular part of a matrix (including the diagonal), and x , y are vectors with y having smaller or equal dimension with x. It is claimed [2] that the above algorithm converges to the optimal linear PCA transformation. One disadvantage of the above approach is the fact that for the training of every neuron non-local information is used, resulting in a lot of redundant computations. (A computational comparison will be given in section 4.) To avoid this problem, Foldiak [3] proposed another method that com- bines the hebbian learning embedded in Ojas training rule and the competitive learning that facilitates the neurons to extract different eigenvectors. The drawbacks of this approach are that (1) the entire set of weights have to be retrained when one additional component is needed; (2) the method does not produce the exact principal eigenvectors of R but rather a set of vectors that span the same space as the principal eigenvectors.

All the previous methods cannot effectively support a recursive approach for the calculation of the m-th principal component given the first m-1 components. The motivation behind such an approach is the need to extract the principal components of a random vector sequence when the number of required PCs is not known a priori. It is also useful in environments where R might be slowly changing with time (e.g. in motion estimation applications). Then the new PC may be added try- ing to compensate for that change without affecting the previously computed PCs. (This is similar to the idea of lattice filtering used extensively in signal processing - where for every increased filter order, one new lattice filtering section is added to the original structure but all the old sections remain com- pletely intact.) This idea leads to a new neural network called APEX, an acronym standing for Adaptive Principal-component Extractor, proposed in the next section.

861

CH2847-2/90/0000-0861 $1.00 0 1990 IEEE

2 A NEW NEURAL NETWORK - APEX

The APEX neuron model is depicted in Figure 2. There are n inputs { X I . . . z,} connected to m outputs { y l . . . ym} through the weights {pj,j}. Additionally, there are anti-hebbian weights wj forming a row vector w that feeds information to output neuron m from all its previous ones. We assume that the input is a stationary stochastic process whose autocorrelation matrix has n distinct positive eigenvalues A1 2 A 2 2 . . . 2 A,. We also assume that the first m - 1 output neurons correspond to the first m - 1 normalized principal components of the input sequence. The most important feature of APEX hinges upon the fact that the m-th neuron is able to extract the largest component which is orthogonal to the first m - 1 components represented by the already trained m - 1 neurons. This will be referred to as the orthogonal learning rule. The activation of each neuron is a linear function of its inputs

y = Px (2)

(3) Ym = PX + WY where x = [q . . . z , ] ~ , y = [y1 . . . ym-1IT, P is the matrix of the p;j weights for the first m - 1 neurons, and p is the row vector of the p m j weights of the m-th neuron.

The algorithm for the m-th neuron is

A P = P(YmXT - YkP)

A w = -y(ymyT t Y ~ W )

(4)

( 5 )

where ,!3 and y are two positive (equal or different) learning rate parameters. If we expand the above equations for each individual weight we get the following equalities

APm,j = P ( ~ m z j - ~ k ~ m , j ) , j = 1.. . R. -

in/ Figure 1. Ojas simplified neuron model

Figure 2. The linear multi-output model. The solid lines de- note the weights pi , w j which are trained at the m-th stage. (Note that { w j } asymptotically approach zero as the network converges.)

Awj = -y(y,yj + ykwj), j = 1.. . m - 1 Notice that the first equation is the same as Ojas adap-

tive rule, which is the hebbian part of the algorithm. We shall show that it has the effect of driving the neuron towards more dominant components. The second equation, represents what we call orthogonal learning rule. It is basically a reverse Oja rule, i.e. it has a similar form except for the opposite signs of the terms. The w-weights play the role of subtract- ing the first m - 1 components from the m-th neuron. Thus, the m-th output neuron tends to become orthogonal to (rather than correlated) all the previous components. Hence the orthogonal learning rule constitutes an anti-hebbian rule. It is hoped that the combination of the two rules produces the m- th principal component. This will be proved by the numerical theoretical analysis in the Section 3, and demonstrated by simulations in Section 4. The orthogonal learning rule also has a very important application to the problem of extracting constrained principal components as briefly discussed in Section 5 and elaborated in a future paper [4].

3 NUMERICAL ANALYSIS PROOF OF THE ALGORITHM

Assume that neurons 1 through m- 1 have converged to the first m - 1 principal components, i.e. P = [el . . . e,-1lT where e l , . . . , em-l, are the first m - 1 normalized eigenvectors of R, and let p(t) = Bj(t)eT, where t is the number of sweeps. (One sweep means one round of training process involving all the given sample input patterns.) Let M be the number of all the input patterns. We shall divide the analysis into two parts: Part I for the analysis of the old first (m- 1) principal modes; Part I1 for the remaining new (i.e. m-th, . . ., n-th ) modes. Part I : By averaging eq.(4) over one sweep (sweep t ) and assuming p(t) approximately constant in this period of time, we can derive the following formula

P(t + 1) = P(t) + P[(P(t) + w ( t ) P ) R - u(t)p(t)] (6) where u(t) = E{yk(t)}, p = MP (7)

For simplicity, in the following we will drop the index t from u( t ) . By focusing on the eigenmodes we can derive the updating rule for 6;

ei(t + 1) = [I + pyxi - u)lei(t) + pxjw,(t)

w;(t + 1) = -yfAj6i(t) t [l - yf(A; + u)]w;( t )

(8)

By the same token eq.(5) becomes

(9)

where y = M y

Refer to [4] for a more detailed derivation of eqs.(8) and (9). We rewrite the above dynamic equations in a matrix form

1 + p(Ai - U ) pxj [ $::,I = [ -7Aj 1 - -/(A; + U ) ] [ :;; ] (10)

(11)

When p = y (p = y) the system matrix has a double eigenvalue on

pj ( t ) = 1 - pu(t)

which is less than 1 as long as p is a small positive number, Hence all 8; and w; tend asymptotically to 0 with the same

862

speed, (since the value of the eigenvalues are the same for all the modes.)

Very importantly, the above relationship of p and 7 may be exploited to select a proper learning rate p which guarantees a very fast convergence speed. Note that

p = y = l / (Ma) (12) is an optimal value Fee eq.(11)). One way to estimate o is to take the average of y k over every sweep.

If p # y then there is an extra flexibility in choosing the speed of the decay of every mode i (i = 1,. . . , m - 1). The control over the decay speed becomes even stronger if we in- troduce different parameters yi for the different wi (i.e. for the different modes) thus selectively suppressing some modes more rapidly (or slowly) than others. Part II : Here we consider only modes i, for i 2 m. We follow the same steps as above but now w is removed from eq.(8) since there is no influence from w on those modes. (This is because the old nodes { y1, y2,. . . ,ym-l } contain only the first m-l components.) Hence, we have a very simple equation for every mode i 2 m:

O;(t + 1) = [I t p(Xi - 0)]0i(t) (13) According to Part I, 0; and w; will eventually converge to

0 (for i = 1, . . . , m - I), and in that case we will have n

0 = U ( t ) = A;B;(t)2 (14) i=m

therefore eq.(13) can not diverge since whenever the 0; becomes so large that o > A; then l+p(X;-o) < 1, and 0; will decrease in magnitude. Assume that 0,(0) # 0 (with probability 1).

then by eq.( 13)

For the convenience of the proof, let us assume that the eigenvalues exhibit a strict inequality relationship, i.e. A1 > A2 > . . . > A,. (The general case can be proved along exactly the same line.) In this case,

and

since 0, will remain bounded according to eqs.(l3) and (14). Eq.(15) further implies that for i = m + I, ... n, 0;(t) -+ 0 when t + CO. Then according to eq.(14), U becomes X,0$ and eq.( 13) for i = m becomes

r;( t ) + 0 as t -+ w (15)

om(t t 1) = 11 +Px,(~ - e,(t)*)~e,(t)

0, ( t ) -+ 1 as t -+ w

(16)

(17) therefore,

Hence the m-th normalized component will be extracted. (For more details refer to [4]). Note that eqs. (15) and (17) together imply that

~ ( t ) -+ A, as t -+ 00 (18)

4 COMPUTATIONAL EFFICIENCY AND SIMULATION RESULTS

Based on the above theoretical analysis, the APEX Algo- rithm is formally presented below: 4.1 APEX Algorithm

For every neuron m = 1 to N , ( N 5 n) 1. Initialize p and w to some small random values.

2. Choose p, y as in eq.(3) (see section 4.2).

3. Update p and w according to eqs.(4), ( 5 ) , until Ap and Aw are below a certain threshold.

4.2 Comparisons in Computational Efficiency

The APEX algorithm compares very favorably with all the previous ones and amounts to several orders of magnitude of computational saving. The claim can be supported by three different aspects.

1

2.

3.

4.3

Efficiency in Recursively Computing New PCs

For the recursive computation of each additional PC, the APEX method requires O ( n ) multiplications per iteration, In contrast, the Foldiaks method would require computation of all the PCs (including the previously computed higher components) so it requires O(mn) opera- tions.

Efficiency by Using Local (i.e. Lateral Nodes) Algorithm

The APEX algorithm also enjoys one order of magnitude of computational advantage over the Sangers method. More precisely, for the recursive computation of each additional PC, Sangers method (cf. eq.(l)) requires ( m + 1) multiplications per iteration for the m-th neuron, as opposed to 2(m + n - 1) multiplications per iteration in APEX. This significant computational saving stems from the fact that APEX uses only local (i.e. lateral ) y nodes which summarizes the useful information for our orthogonal (reverse-Oja) training, thus avoiding the (redundant) repeated multiplications with the synaptic weights {p i , j } when nonlocal (i.e. x ) nodes are used, as is the case in the Sangers method.

Reduction of Iteration Steps by Adopting Optimal Learn- ing Rate p More importantly, our analysis provides a very powerful tool for estimating the optimal value for p, y, as given in eq.(3). As another attractive alternative, instead of eq.(12) we propose to set

p = y = l/(MA,-l). This is an underestimated value of ,B as suggested by eq.(12) since Am-l > A, and limt,,~ = A,. A fairly precise estimate of X,-1 can be achieved by averaging the output of the previous neuron over one sweep. This calculation needs to be done only once for every neuron. (In the case of training the very first neuron, a common scaling scheme may be adopted.)

Simulation Results

In our simulations, the number of random input patterns is A4 = 20 and the input dimension is 5. For the case of ,6 =

863

y we use the value l/(MA,-l) as dlbriissed in the previous paragraph. Figure 3(a),(b) shows the absolute error between Average(y$) and actual A, and the square distance between the calculated vectors and actual component vectors as a function of the sweep number. The convergence is extremely fast as expected from the theoretical analysis. The results are very close to the actual components even for small eigenvalues, and are almost perfectly normalized and orthogonal to each other.

Using the above values for p and y the algorithm converges with a relatively large error. In Table 1 we show the speed of convergence and the final square distance between the calculated principal component vectors and the actual components. An obvious solution to this problem would be to adopt more conservative values for the learning parameters in the fine tuning phase (after the PC has already reached certain very close neighborhood). Table 1 summarizes the results when the values l/(MAm-1f) are used, with f = 1,5,10. The convergence is slower but towards a more accurate solution as f gets larger.

We can achieve both favorable convergence speed and accuracy by adopting the following compromise: (a) In the initial phase, the algorithm starts with f = 1 for fast (but rough) convergence, and (b) in the fine tuning phase, f will be increased to achieve higher precision until a desired accuracy threshold is reached.

Average sweep number where the square distance is within

5 EXTENDING APEX TO EXTRACT CONSTRAINED PRINCIPAL COMPONENTS

Average final square distance

In the above we have presented anew neural network (APEX) for the recursive calculation of the principal components of a stochastic process. The new approach offers considerable amount of computational advantages over the previous approaches. Some typical examples include, e.g. image data compression, SVD applications for modeling, for spectral estimation or for high resolution direction finding [5].

APEX also uniquely introduces a new application domain (which can not be handled by any of the previous known al- gorithms) in its ability to extract constrained principal components. The problem is t o extract most significant components within a constrained space. (Such a problem arises in certain anti-jamming sensor array applications and image feature extraction applications.) In this case, the old nodes { y i , y z , . . . ,y,-i} are not necessarily principal components of the input data. More generally, they may represent a set of arbitrary constraining vectors to which the search space is orthogonal. (For example, in anti-jamming application, they represent the directions of jamming signals.)

Here we shall show that in the steady state, y, will be orthogonal to all yi, i = 1,. . . m - 1. Suppose that y is a linear combination of x: y = A x , where the rows of A are not necessarily principal components, and AAT = I . Assuming the algorithm converges, the vector corresponding to y, (i.e. (p + w A ) ) is orthogond to A. This can be readily seen by setting eqs.(4) and ( 5 ) to 0. Multiplying eq.(4) by A and adding to eq.(5), we have the result sought

(p i - w A ) A = 0

This justifies the name orthogonal learning rule. The mathematical derivation of the convergence proof follows very closely what used in Section 3 and the readers are referred to [4] for more details.

factor 1 5 10

REFERENCES

[l] E. Oja, A Simplified Neuron Model as (I Principal Com- ponent Analyzer, J. Math. Biology, vol. 15, pp. 267-273, 1982.

5% of the final value (X 10-3) 21 1.23 118 0.34 194 0.32

[2] T. D. Sanger, An Optimality Principle for Unsupervised Learning, in Advances in Neural Information Proce.csing Systems, vol 1, pp. 11-19, (D. S. Touretzky editor).

Adaptive Network for Optimal Linear Fea- ture Extraction, IJCNN, pp. 1-401-1-406, Washington DC, 19a9.

[3] P. Foldiak,

[4] S . Y. Iiung, Adaptive Principal CO-mponent Analysis via Proceedings, Int. Symp. an Orthogonal Learning Network,

on Circuits and Systems, New Orleans, May 1990

[5] S. Y. Kung, D. V. Bhaskar Rao, and K . S. Arun, Spec- tral Estimation: From Conventional Methods to High Res- olution Modeling Methods, = in VLSZ and Modern Signal Processing, pp. (S . Y. Kung, et. al. editors).

- neuron1 -t neuron2 - neuron1 - neumna - neuron2 --xp- neuron4 - neuron3

0.5

1 0 20 30 0 10 20 30 00

Figure 3. For p = y = l/(MA,-l) the convergence speed is very fast for both (a) the computation of eigenvalues; and (b) the computation of eigenvectors.

Table 1. The performance for different learning rates.

864

a neural network learning algorithm for adaptive principal component extraction (apex)

Documents