application the karhunen-loeve expansion to feature ... · expansionextracts aset of features...

8
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-19, NO. 4, APRIL 1970 Application of the Karhunen-Loeve Expansion to Feature Selection and Ordering KEINOSUKE FUKUNAGA, MEMBER, IEEE, AND WARREN L. G. KOONTZ, MEMBER, IEEE Abstract-The Karhunen-Lo6ve expansion has been used pre- viously to extract important features for representing samples taken from a given distribution. A method is developed herein to use the Karhunen-Loeve expansion to extract features relevant to classifica- tion of a sample taken from one of two pattern classes. Numerical examples are presented to illustrate the technique. Also, it is shown that the properties of the proposed technique can be applied to unsupervised clustering, where given samples are classified into two classes without a priori knowledge of the class. Index Terms-Clustering, feature extraction, feature selection, Karhunen-Loeve expansion, pattern recognition, unsupervised learn- ing. INTRODUCTION Tll? HE Karhunen-Loeve expansion is a well known technique for representing a sample function of a random process [1]- [6]. It has been shown to be optimal in that the mean-square error committed by ap- proximating the infinite series with a finite number of terms is minimized [1], [2 ]. Thus the Karhunen-Loeve expansion extracts a set of features that is optimal with respect to representing a pattern class whose observable is a random process. Watanabe [1], [2] discusses application of the Kar- hunen-Loeve expansion to the representation of a pat- tern class. Chien and Fu [3] derive a necessary condi- tion under which more random processes than one class can be represented by a single expansion, assuming zero- means for all processes. They present further discussion and a character recognition example in [4]. In pattern recognition, however, we wish to extract features which represent the difference between one pat- tern class and another. These features do not necessarily coincide with the important features to represent the pattern classes. In particular, two classes may have principal features which are similar. Besides, different sets of features for different classes add further difficulty. Several approaches to extracting features for recogni- tion appear in the literature. Kadota and Shepp [7 ] and Kullback [8] give results for discriminating between two zero-mean Gaussian random processes. Their ap- proach is to find the linear transformation which maxi- mizes divergence. Fukunaga and Krile [9] find the trans- formation which simultaneously diagonalizes the co- variance matrices in a two-class problem. Tou and Hey- den proposed an iteration procedure to find the trans- formation to maximize the divergence in the reduced dimension [10]. It is not our purpose here to derive optimal features for recognition via the Karhunen-Loeve expansion. We will show, however, that if we transform our coordinate system before applying the expansion, we may extract "good" features for recognition. Thus we will in one way relate the pattern representation problem to the pattern recognition problem. An important byproduct of this investigation is the development of an algorithm for "unsupervised" cluster- ing. Here, a set of samples is dichotomized using only the a priori probabilities of the two classes assumed to exist. KARHUNEN-LOEVE EXPANSION FOR Two-CLASS PROBLEMS Let us, first of all, briefly introduce the Karhunen- Loeve expansion. A set of time functions,fi(t) (i= 1, 2, * * *, N), can be expanded as a linear combination of basis functions, 4j(t), (j = 1, 2, - - * ), as follows: 00 fi(t) = E iotijX(t). (1) The basis functions are obtained by solving the integral equation Xj4j(t) = f R(t, r)ck1(r)d-r -00 (2) where R(t, r) is the autocorrelation function of the f(t)'s, and is given by R(t, r) = EVi(t)fi(r)]) (3) where E [ * ] denotes expectation over the N samples. Suppose we measure and use only n discrete sampling points to express each time function, instead of the original continuous function. Then the functions may be represented as column vectors Manuscript received March 10, 1969; revised August 12, 1969. This work was supported in part by the Joint Services Program under ARPA Contract N00014-67-A-0226 Modified AE, and the Naval Ship Systems Command under Contract N000-68-C-1193. The authors are with the School of Electrical Engineering, Purdue University, Lafayette, Ind. 47907. Vi(tl) Fi =Lfi(t2) _fi('tn)J (i = 1, 2, * * *, N) (4) 311

Upload: others

Post on 15-Mar-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

  • IEEE TRANSACTIONS ON COMPUTERS, VOL. C-19, NO. 4, APRIL 1970

    Application of the Karhunen-Loeve Expansionto Feature Selection and Ordering

    KEINOSUKE FUKUNAGA, MEMBER, IEEE, AND WARREN L. G. KOONTZ, MEMBER, IEEE

    Abstract-The Karhunen-Lo6ve expansion has been used pre-viously to extract important features for representing samples takenfrom a given distribution. A method is developed herein to use theKarhunen-Loeve expansion to extract features relevant to classifica-tion of a sample taken from one of two pattern classes. Numericalexamples are presented to illustrate the technique.

    Also, it is shown that the properties of the proposed technique canbe applied to unsupervised clustering, where given samples areclassified into two classes without a priori knowledge of the class.

    Index Terms-Clustering, feature extraction, feature selection,Karhunen-Loeve expansion, pattern recognition, unsupervised learn-ing.

    INTRODUCTION

    Tll? HE Karhunen-Loeve expansion is a well knowntechnique for representing a sample function of arandom process [1]- [6]. It has been shown to be

    optimal in that the mean-square error committed by ap-proximating the infinite series with a finite number ofterms is minimized [1], [2 ]. Thus the Karhunen-Loeveexpansion extracts a set of features that is optimal withrespect to representing a pattern class whose observableis a random process.Watanabe [1], [2] discusses application of the Kar-

    hunen-Loeve expansion to the representation of a pat-tern class. Chien and Fu [3] derive a necessary condi-tion under which more random processes than one classcan be represented by a single expansion, assuming zero-means for all processes. They present further discussionand a character recognition example in [4].

    In pattern recognition, however, we wish to extractfeatures which represent the difference between one pat-tern class and another. These features do not necessarilycoincide with the important features to represent thepattern classes. In particular, two classes may haveprincipal features which are similar. Besides, differentsets of features for different classes add further difficulty.Several approaches to extracting features for recogni-tion appear in the literature. Kadota and Shepp [7 ] andKullback [8] give results for discriminating betweentwo zero-mean Gaussian random processes. Their ap-proach is to find the linear transformation which maxi-mizes divergence. Fukunaga and Krile [9] find the trans-

    formation which simultaneously diagonalizes the co-variance matrices in a two-class problem. Tou and Hey-den proposed an iteration procedure to find the trans-formation to maximize the divergence in the reduceddimension [10].

    It is not our purpose here to derive optimal featuresfor recognition via the Karhunen-Loeve expansion. Wewill show, however, that if we transform our coordinatesystem before applying the expansion, we may extract"good" features for recognition. Thus we will in oneway relate the pattern representation problem to thepattern recognition problem.An important byproduct of this investigation is the

    development of an algorithm for "unsupervised" cluster-ing. Here, a set of samples is dichotomized using only thea priori probabilities of the two classes assumed to exist.

    KARHUNEN-LOEVE EXPANSION FORTwo-CLASS PROBLEMS

    Let us, first of all, briefly introduce the Karhunen-Loeve expansion.A set of time functions,fi(t) (i= 1, 2, * * *, N), can be

    expanded as a linear combination of basis functions,4j(t), (j= 1, 2, - - * ), as follows:

    00

    fi(t) = E iotijX(t). (1)

    The basis functions are obtained by solving the integralequation

    Xj4j(t) = f R(t, r)ck1(r)d-r-00

    (2)

    where R(t, r) is the autocorrelation function of the f(t)'s,and is given by

    R(t, r) = EVi(t)fi(r)]) (3)where E [ * ] denotes expectation over the N samples.

    Suppose we measure and use only n discrete samplingpoints to express each time function, instead of theoriginal continuous function. Then the functions may berepresented as column vectors

    Manuscript received March 10, 1969; revised August 12, 1969.This work was supported in part by the Joint Services Programunder ARPA Contract N00014-67-A-0226 Modified AE, and theNaval Ship Systems Command under Contract N000-68-C-1193.

    The authors are with the School of Electrical Engineering, PurdueUniversity, Lafayette, Ind. 47907.

    Vi(tl)Fi =Lfi(t2)

    _fi('tn)J

    (i = 1, 2, * * *, N) (4)

    311

  • IEEE TRANSACTIONS ON COMPUTERS, APRIL 1970

    Equation (1) becomes a finite sumn

    Fi = E aij,,Ijj=l

    nk basis functions, the expansion becomes an approxima-tion of the Fi(k)'s. The mean-square error is given by

    (5)

    where 4j is also a vector representation of the jth basisfunction as

    (j = 1,2, .. , n).

    Also, (2) and (3) are replaced by

    XjAj = S4j,

    and

    S = E[FiFjb

    where F,T is the transposed vector of Fi. Equations (7)and (8) show that S is an autocorrelation matrix, and X,and 4j are the jth eigenvalue and eigenvector of S.The number of sampling points, n, should be deter-

    mined independently so as to retain enough informationto reconstruct the original time functions.Thus, when two sets of time functions, Fi() and

    Fi(I), are expanded by two sets of basis functions, 4jl)and Ij(2) F.(') and Fi() are expressed by

    n

    Fi =k aEij(i)IPj() (k = 1, 2; i = 1, 2, Nk* * AT) (9)j=l

    where it is assumed that the number of sampling pointsis n for both class 1 and class 2. We will use w1 and W2 inthis paper to represent class 1 and class 2. Since the basisfunctions are the eigenvectors of the autocorrelationmatrix as shown in (7), they are mutually orthonormal,i.e.,

    ,j(k)T(D(k) =1 for i = j

    0 for i # j.

    Thus, the coefficients of the expression ai,j(k) can be ob-tained by

    a ii(k) = ,j(k)TFi(k) (11)

    or in terms of matrix notation,

    Ai(k) = BTFj(k) (12)

    where

    atil (k

    Ai(k) and B = [1112 4n] (13)

    Lai.f (k)_

    When n basis functions are used, the Fi(k)'s are ex-panded without error. However, if we select fewer, say

    2 (/TZ2 = E: E F - j ij(k)j(k))

    k=l i EJAk

    (Fi(k) - E aij(k)4j(I) P(Wk)jIEJk

    (14)

    where P(wk) is the a priori probability of Wk, and Jk is the(6) set of nkj's such that 4j(k) is used in the expansion. Using

    (10) and (11), we see that (14) becomes

    2 = E E 4j (k)TTSAj (k)k=l jQJk

    (15)

    (7) where Sk is the autocorrelation matrix multiplied by thea priori probability, i.e.,

    (8) Sk = P(Wk)E[Fi (k)Fi )T] (k= 1, 2). (16)

    The mean-square error given by (15) can be minimizedwhen the j(DCM's and )j(i)'s are the eigenvectors of Siand S2, respectively, and the minimized Z2 can be ex-pressed by the linear summation of eigenvalues whoseeigenvectors are not selected to approximate Fi(k)'s.Hence

    -2= E X/1) + E Xj(2)JEIl jeJ2

    (17)

    The Xj(l)'s and Xj(2)'s are the eigenvalues of SI and S2, re-spectively.

    Thus, the basis functions can be selected in order ofeigenvalues until Z2 of (17) becomes smaller than somespecified value.From the above discussion we see that the Karhunen-

    Loeve expansion is a powerful technique to represent orestimate a set of time functions. In particular, themechanism of extracting features from the set of timefunctions suggests that we can apply the technique topattern recognition. However, one of the big disadvant-ages of the technique, as far as pattern recognition isconcerned, is the fact that the technique does not neces-sarily select important features for separating patternclasses. When two pattern classes share similar impor-tant features, the corresponding eigenvalues are "large"and dominant. But these eigenvectors or basis functionsare not important for separating the pattern classes.Therefore, a modification is needed in the Karhunen-Loeve expansion in order to apply the expansion to pat-tern recognition and this problem will be discussed in thenext section.

    NORMALIZATION PROCESS

    In this section, a kinid of normalizationl process isproposed in order to extract the important features forseparating two pattern classes. The notation used inwhat follows is in accordance with Table I.

    Let us assume that the autocorrelation matrices oftwo pattern classes are R1 and R2 premultiplied by a

    312

    -oj(tl)-(Di = Oj(t2)

    I Oj (11) I

  • FUKUNAGA AND KOONTZ: APPLICATION OF KARHUNEN-LORVE EXPANSION

    TABLE I

    Original Transform Normalized Transform CoefficientsCoordinate Matrix Coordinate Matrix of BasisSystem P System B Vectors

    Sample Vectors Xi(k) F(k) =PX,(k) Fi(k) Ai(k) =BTFi(k) Ai(k)i=1 , N; k=1, 2 of (19) of (13)

    Mean Vectors Mk Dk =PMk Dk Lk = BTDk Lkk=l, 2

    Autocorrelation Rk Sk =PRkPT Sk Qk = BTSkB QkMatrices of (16) =P(cWk) (21+MkMT) =P(Ck) (2k+M M1T) -P(Wk) (2k+MIMJ)k=O, 1, 2

    Covariance Matrices k Kk =p T Kk Ek = BTKkB okk=1, 2

    priori probabilities as defined in (16), and the autocor-relation matrix of the mixture of both classes, Ro, is

    Ro = R1 + R2. (18)

    classes should be positive. This fact derives the follow-ing relationship for eigenvalues:

    1> Xl > 2 > . . *>Xn >_ 0 for wx (27)0< 1 - X1 < 1 - 2l, (j = 1, 2, ... ., n). (21)Assume that the eigenvalues are in ascending order, i.e.,

    X1 () > 2 (1) 2 . . . > X (1) (22)

    Then, the eigenvalues and eigenvectors for the secondclass are

    S2bj(2= (I- )(2) = Xj(2)j(2) (23)

    or

    Sl,j/2)= (1- X(2))I'j(2) (24)From (21) and (24),

    =j(2)= .j(l) (25)

    and

    .j(1)= 1 -j(2). (26)

    Equations (25) and (26) show that both pattern classeshave the same set of eigenvectors, and the correspondingeigenvalues are reversely ordered. That is, the eigen-vector with the largest eigenvalue for class 1 has thesmallest eigenvalue for class 2 and so on. On the otherhand, from (17), the physical meaning of an eigenvalueis the mean-square error when the corresponding eigen-vector is eliminated. Therefore, all eigenvalues for both

    Here the superscripts which were used to indicate theclass are eliminated to yield a simpler expression.

    Thus, after the normalizing transformation, the im-portant features or effective basis functions for class 1are the least important features for class 2, and the im-portant features for class 2 are the least important forclass 1. Both classes cannot share common importantfeatures. Therefore, the selection of the basis functionscan be performed by taking Pi('), c2(1), -* * for class 1and 1n(1),Ibn-2(1, * * * for class 2.

    Unfortunately this technique does not extend to thegeneral multiclass problem. However, the multiclasspattern classifier may be realized as a sequence of pair-wise comparisons of likelihood functions. In this caseeach pair may be examined using the expansion de-scribed above.

    Equal Covariance CasesIn order to show some properties of these eigenvalues

    and eigenvectors, let us take a special case where bothpattern classes have the same covariance matrices andthe same a priori probabilities.As is shown in Table I, after normalization, the auto-

    correlation matrices Si and S2 can be expressed in termsof the covariance matrices, K1 and K2, and mean vectors,D1 and D2, as follows:

    Si = P(wi)(Ki + DiDIT) (i = 1, 2).Therefore, (20) becomes

    P(col)(Ki + D1DlT) + P(Wo2) (K2 + D2D2T) = I.

    When K2 = K1, and P(w2) = P(Cw) = 0.5,

    SI= 1/2{I - 1/2(D2D2T -DDlT)D

    (29)

    (30)

    (31)

    or

    S1= 1/2{I - (DoDT + DDoT) } (32)

    313

  • IEEE TRANSACTIONS ON COMPUTERS, APRIL 1970

    where Do and D are the mean vector of the mixture andthe vector from Do to the mean vector of the class 2, i.e.,

    D1= Do-D (33)D2 = Do + D (34)

    or

    Do = 1/2(D2 + D1) (35)D = 1/2(D2-D1). (36)

    The eigenvalues and eigenvectors of S1 follow from

    Slbj = 1/2{I -(D0DT + DDoT)}I14j = Xj4j (37)or

    (DoDT + DDoT)?bj = (1 - 2Xj)4)j. (38)Therefore, (DoDT+DDoT) has the same eigenvectors asSi and its eigenvalues,Ai are related with Xi as follows:

    Xi = 0.5 -O.5l.i. (39)Since the rank of the matrix (DoDT+DDoT) is 2, n-2eigenvalues are zero, and the other eigenvalues can beeasily obtained by solving the following equations:

    A2 = 93 = ... = An-1 = (40)I1 + An = tr [DoDT + DDoT] = 2dod cos 0 (41)

    Mi2 + MOn2 = tr { [DoDT + DDoT]2} (42)= 2d02d2(1 + cOs2 0)

    where

    do = x/DoTDo, (length of the vector Do) (43)d = VDTD, (length of the vector D) (44)

    and

    cos 0 = DTDo/dod (angle between Do and D). (45)

    The solution of (41) and (42) is

    pi = dod(cos 0 - 1) < 0

    Mn = dod(cos0 + 1) .0.(46)

    (47)

    Thus when K, = K2 and P(w1) =P(W2) = 0.5, (DoDT+DDoT) has (n -2) zero eigenvalues, U2, * * iXn-li andone positive and one negative eigenvalue, /n and Ml.From (39) S1 has the following eigenvalues:

    Xi = 0.5 + 0.5dod(1- cos 0)X2nX3 0.=- .5 =d(n-1+ 0.5An= 0.5- 0.5dod(l + cos 0). (50)

    Equations (48), (49), and (50) suggest that we select(Di for class 1 whose eigenvalue is between 0.5 and 1, and4n for class 2 whose eigenvalue is between 0.5 and 0.Since X2, X3, * * * , Xni- are not zero but 0.5, 4)2, . . .4)n-i contribute information for representing or estimat-ing both class 1 and class 2. However, these eigenvaluesdo not contribute any information for separating twopattern classes. This fact will be discussed in the nextsection.

    CLASSIFICATIONIn the previous section, it was shown that after trans-

    forming the coordinate system, wecan select the commonset of eigenvectors for both class 1 and 2. We now selecta subset of the eigenvectors to represent the sample andstudy the effectiveness of these eigenvectors in classifi-cation. We select those eigenvectors whose eigenvaluesare closer to 0 or 1.

    Intuitively, the above selection of eigenvectors isappealing to use, as was discussed in the preceding sec-tion. However, in order to give more rigorous justifica-tion, we have to set up an index of performance whichmeasures the separability of two pattern classes, and toobserve the contribution of each eigenvector to the in-dex. Although there are several indices of performance,let us take the divergence in this paper [11]. The di-vergence for two Gaussian distributions is given by

    Div = 1/2(L1 - L2)T(E)-1 + 02-1)(LI - L2) (51)+ 1/2 tr[(E)1 - 2)(02-1'-1-1)]

    where, as seen in Table I, Lk and e1 are the mean vectorand covariance matrix of the coefficients of theKarhunen-Loeve expansion. Equations (1 1) or (12)show that these coefficients, A's, can be considered as anew set of variables which are transformed from F's bythe eigenmatrix B of (12) and (13). Since the divergenceof (51) is invariant under the linear transformation ofTable I, if we select n eigenvectors, the distributions ofA M)'s and A 2M's have the same divergence as the originaldistribution of XM1)'s and XM2I's.

    Thus, the evaluation of eigenvectors for classificationpurposes becomes the evaluation of variables in the dis-tributions of A (k)'s.For calculating the divergence for Gaussian distribu-

    tions, Lk, Qk, and ek of Table I can be obtained as fol-lows:

    Lk= E[A,(k)] = BTDki

    (52)

    Qk = P(Wk)E[A i(k)A i(k)T] = BTP(wk)E[Fi(k)Fi(k)T]B

    {A fork= I= BTSkB =

    -IB-A for k = 2

    (k = Qk/P(COk) - LkLkT

    (53)

    (54)

    (48) where A is a diagonal matrix with '1, . . *, Xn as its(49) components.--I

    From (53)

    [aijk)2 JXj for k = 1i 1P- Xi for k = 2.

    (55)

    Equation (55) shows that selecting eigenvectors by Xjmeans the selection of eigenvectors whose coefficientshave larger differences between co, and W2 in the mean-square values. In the following discussion, we will showtheoretically that this selection is the best in terms of thedivergence when two covariance' matrices are equal;For unequal covariance cases we will show by two ex-

    314

  • FUKUNAGA AND KOONTZ: APPLICATION OF KARHUNEN-LOPVE EXPANSION

    amples that we obtain the reasonably good selection ofeigenvectors although the theoretical justification isdifficult.

    Equal Covariance Cases

    Let us assume K1=K2 and P(C1) =P(W2) = 0.5. Sincethe 4j's are selected as the eigenvectors of (DoDT +DDoT)as is shown in (38), and since,2= .** =/u.n-l=O from(40), 42,2 -*,*I,,. should satisfy

    (DoDT + DDoT)4,j= O (j = 2, ,n-1) (56)

    where 0 is a column vector with all zero components.Or,

    Do(DTJ?j) = -D(DoT4b,). (57)

    It is easily shown that (57) holds for finite and differentvectors Do and D if and only if

    DT4j = bTD = O and DOT41=4fDo=°T(58)0

    (j = 2,.* * 1)*(58)

    TABLE IIEIGENVALUES AND DIVERGENCE OF INDIVIDUAL FEATURESOF MARILL AND GREEN'S DATA FOR THE CHARACTERS

    A AND B

    Selection Order (j) Eigenvalue (Xi) jxj-0.51 Div31 0.9521 0.4521 9.8242 0.0858 0.4142 4.3733 0.1113 0.3887 4.6384 0.8525 0.3525 1.9785 0.7475 0.2475 0.64916 0.2684 0.2316 0.54637 0.3018 0.1982 0.37518 0.6375 0.1375 0.2772

    Thus, the divergence for equal covariance cases is

    Div = 1/2(L1- L2)T@-l(L1 - L2)1

    = 00 -0- 2 [Onn(l(2) 1- l1))2 + Oll(ln(2) - In(1))2

    -201ln (11 (2) - 11 (l )) (In (2) - In (1)) ] -(3

    That is, in this case, n -2 eigenvectors are orthogonalto the plane of D and Do.From (33), (34), (52), and (58), the components of

    Lk, Ij(k) areij k) = 4jTDk = djT(Do T D) = 0

    forj=2

  • IEEE TRANSACTIONS ON COMPUTERS, APRIL 1970

    TABLE IIIEIGENVALUES AND DIVERGENCE OF INDIVIDUAL FEATURES OF FABRICDATA SUPPLIED BY DUPONT CORPORATION, WILMINGTON, DELAWARE

    Selection Order (j) Eigenvalue (k,) XXi-0.51 Div3

    1 0.9748 0.4748 18.482 0.0443 0.4670 10.133 0.0759 0.4241 5.1464 0.9129 0.4129 4.3375 0.8244 0.3244 1.4536 0.1815 0.3185 1.3657 0.8131 0.3131 1.2908 0.1871 0.3129 1.2889 0.2076 0.2824 1.03910 0.7683 0.2683 0.813011 0.2401 0.2599 0.740512 0.2683 0.2317 0.546913 0.7294 0.2294 0.533214 0.2922 0.2078 0.417515 0.6981 0.1981 0.372416 0.6632 0.1632 0.238517 0.6619 0.1619 0.234218 0.3569 0.1431 0.178619 0.6024 0.1024 0.087620 0.4051 0.0949 0.074821 0.4263 0.0737 0.044422 0.5657 0.0657 0.035223 0.4356 0.0644 0.033724 0.5154 0.0154 0.002025 0.4849 0.0151 0.001926 0.5040 0.0040 0.0002

    coefficients in addition to the individual divergences.However, the individual divergence still gives a measurefor effectiveness of each eigenvector.

    Experiment 2: In this example, the mean vectors andthe covariance matrices are calculated from 26 dimen-sional data supplied by Dupont Corporation, Wilming-ton, Delaware. These data are measurements made oncertain fabrics.

    Table III shows the list of Xj, JX,-0.51 and Divj. Inthis example, the ordering of |J -0.51 is exactly thesame as the ordering by individual divergences.

    UNSUPERVISED CLUSTERINGWe now show how the properties discussed in the

    preceding sections can be used to develop an algorithmfor classifying samples given very limited a prioriknowledge.Suppose we have N samples which we know to come

    from two pattern classes. Further, assume that thea priori probabilities P(coj) and P(CO2) are known. Trans-form coordinates so that the mixture autocorrelationmatrix is I. Then, for all dichotomies such that NP(wi)samples are assigned to w, and NP(W2) to 02, (20) issatisfied. Thus, for all of these dichotomies, both classesshare common eigenvectors and have reversely ordereigenvalues in the range 0 to 1.Now our aim is to choose the best dichotomy accord-

    ing to some criterion. Given the criterion, C, a simplealgorithm for clustering is as follows.

    1) Assign NP(c1) samples to c1 and NP(W2) to W2arbitrarily.

    2) Exchange equal numbers of samples between coand w2 so as to increase C.

    3) Repeat until C is maximized.

    In what follows we assume P(w1) =P(W2) = 0.5.

    A Criterion COne property of a "good" dichotomy is that each

    feature should be effective for classification. Accordingto what we have shown, this means that the eigenvalueof each feature should be near 0 or 1. Thus, our criterionfunction should increase as each eigenvalue approaches0 or 1. A proposed criterion function is

    1 n 1 nc = E (X;- 0.5)2 + -2E {(1 - X) - 0.5}2n j=l fn j=l

    =2 E X2 - E Xj) + 0.5.n j=j i=i

    (64)

    That is, the eigenvalues which are far from 0.5 are moreappreciated.

    Equation (64) can be rewritten in terms of the ele-ments of S1, sij, rather than the eigenvalues, as follows:

    2C = - (trS12 - trSi) + 0.5

    n

    2 n nn=E Ea Sij2 - Esii) + o.5.n j=1 j=j i==l

    (65)

    If a sample F1 in class 1 is exchanged for a sample Fk inclass 2, then the change in C, AC, is given by

    2 n n nAC = E E (2sijAsi; ± Asj2) - E AsisX i-I j=1 i=1=l

    where

    Asij = - (fk(ti)fk(tj) - fl(ti)fl(tj) }I

    (66)

    (67)

    and wherefk(ti) is, from (4), the ith component of Fk.When N is large, the AS,j2 term of (66) can be neglectedsuch that (66) becomes

    AC = ACk + ACt (68)

    where

    2 n n nACk = - ZE E 2sijfk(ti)fk(tj) - fk(ti) }nN i==i i=c i-=l

    2

    = 2FkTS,Fkc- FkTFkJ,(69)

    and

    2AC, = - -~[2F1TS1FZ - FjTF1].

    nN(70)

    It should be noted that neither (66) nor (68) requireseigenvalue calculations and autocorrelation calculationsfor sample exchanges.

    316

  • FUKUNAGA AND KOONTZ: APPLICATION OF KARHUNEN-LO:VE EXPANSION

    TABLE IVCONVERGENCE DATA FOR SAMPLE EXCHANGE ALGORITHM

    Iteration Number of Exchanges Criterion Value, C,After Exchanges0.009

    1 87 0.1132 30 0.1683 23 0.2064 12 0.2255 4 0.2276 2 0.228

    X t3)3.0 _

    2.0[_

    1.0-

    0.0k

    Fig. 1. Flow chart of the sample exchange algorithm.-1.0_

    A Sample Exchange AlgorithmA sample exchange algorithm proposed in this paper

    is based on the assumption that we have a large enoughnumber of samples to justify (68). Also, P(col) =P(w2)= 0.5 is assumed. The flow chart of the actual algorithmis shown in Fig. 1. As is seen in Fig. 1, for a given classassignment, S, is calculated, and ACk's for all Fk's ofclass 2 and ACi's for all Ft's of class 1 are calculated andordered as

    ACkl . ACk2 >* >. ACkN12, (71)AC11 > AC12 > . >_ ACIN12. (72)

    Then

    AC1 > AC2 .*_ >. ACN12 (73)where

    ACi = ACk, + ACIi. (74)Exchanges are made for all pairs of Fki and Fli whichgive ACi> 0, since these exchanges increase C. Then S1is recalculated for the new class assignment after theabove exchanges, and the same process is repeated untilwe obtain a class assignment which gives

    AC1 = ACkl + AC1 < 0. (75)Equation (75) guarantees that this final class assign-ment gives the maximum value of C. Let us select msamples Fli's (iEI= {il, i2, *im}) from class 1 andm samples Fk,'s (jCJ= {Jil, j2 . . . jm }) from class 2.When (75) holds with (71) and (72), the exchange ofthese 2m samples gives

    -2.0

    EXI

    SY

    0

    0

    000 0000o o 000oo 0 A

    A0 AA8 °{PI 00

    AA OAA AA

    AAA A

    PLANATION OF SYMBOLS

    a WI WIO w2 "2

    Actul Asige

    A WI w I"'02 -1

    AAAAAAAA

    A

    a

    a a

    aA i- 2-3.0 -2.0 -1.0 0.0 1.0 2ao 3.0 f(2

    Fig. 2. Distribution and assignment of samples projected ontoa two-dimensional subspace.

    AC = E ACMi + E ACzj < m(ACk1 + ACt,) . 0. (76)iEI jEJ

    Since (76) holds for all possible m, I, and J, (75) guaran-tees the maximum C.Example 1: Again, Marill and Green's data are used

    [11]. According to the mean vectors and the covariancematrices in an eight-ditnensional space, 200 samples foreach class are generated so as to have a Gaussian distri-bution. The 400 samples from class 1 and 2 are normal-ized and mixed, and an initial class assignment is givenrandomly. The procedure of this paper reclassifies thesesamples into two classes. The classification error is 11samples out of 200 (5.5 percent). Since Marill and Greengave 2.75 percent error experimentally and Fukunagaand Krile gave 1.85 percent error theoretically for theBayes quadratic optimum classifier, 5.5 percent erroris a reasonable result [9]. Table IV shows the numberof exchanged sample pairs (the number of cycles in loop2 of Fig. 1) and C versus the number of class reassign-ments (the number of cycles in loop 1 of Fig. 1). Also,Fig. 2 shows the distribution of only 50 samples out of200 in the most significant two-dimensional subspace ofthe original eight-dimensional spaces.

    317

  • IEEE TRANSACTIONS ON COMPUTERS, APRIL 1970

    Fig. 3. Illustration of the features selectedby the algorithm.

    The misclassified samples appear in the region be-tween the two distributions. There is apparently anoverlap between correct and incorrect classified samplesbecause we show only two dimensions. These resultsjustify the criterion and the sample exchange algorithmof this paper.Example 2: Several other examples have been

    studied. As a result, one significant property of our un-supervised clustering algorithm has been observed,which is shown in the most simplified examples asfollows.

    Fig. 3 is the typical simplified example where foursamples are distributed symmetrically in a two dimen-sional space. If we assign the class to each sample, theprocedure of this paper picks up the most importanteigenvector for classification purposes. However, if wecompare C's among all possible combinations of classassignments under the assumption of P(W1) = P(W2)=0.5, the circled class assignments in Fig. 3 gives thelargest C. Therefore, the unsupervised clustering algo-rithms which maximize C separate these samples asshown in Fig. 3, although a human being cannot tellhow to separate these samples. The above class assign-ments are reasonable, if we look into the physical mean-ing as follows.

    1) In Fig. 3(a), X3 and X4 are the negative images ofX1 and X2, respectively. For example, if X1 is ablack and white image of a character on a mesh,X3 is the same image with black and white re-versed. Or if Xi represents a time function xl(t),then x3(t) = -x1(t). Therefore, X1 and X3 shouldhave the common feature which is different fromthe common feature of X2 and X4.

    2) In Fig. 3(b), X3 and X4 are approximately ex-pressed by aX1 and bX2 where a and b are someconstants. That is, the difference between X1 andX3 or X2 and X4 iS found in the amplitude of sig-

    nals. On the other hand, the difference betweenXi and X2 or X3 and X4 iS found in the shape ofthe time function. Thus, the unsupervised cluster-ing algorithm in this paper is in favor of signalshape rather than signal level as its feature. Whensignals are normalized in amplitude, that is,Xi/V\XiTXi is used as the signal representation,it becomes obvious to a human being that X1 andX3 should be separated from X2 and X4.

    CONCLUSIONSThis paper discusses the application of the Karhunen-

    Loeve expansion to pattern recognition. The majorresults obtained are as follows.

    1) By a transformation which converts the auto-correlation matrix of the mixture to a unit matrix,two classes never share the important basis vectorsfor classification purposes.

    2) Thus, the selection of the eigenvectors can becarried by the eigenvalues.

    3) This selection picks up two eigenvectors, keepingthe same divergence as for n eigenvectors, whenthe covariance matrices are equal. For unequalcovariance cases, two examples are shown tojustify the procedure.

    4) Using the above properties, an unsupervisedclustering algorithm is proposed. Examples areshown how samples are separated into classesusing only limited a priori knowledge.

    REFERENCES[1] S. Watanabe, "Karhunen-Lo8ve expansion and factor anal-

    ysis," Trans. 4th Prague Conf. on Information Theory, 1965.[2] S. Watanabe et al., "Evaluation and selection of variables in

    pattern recognition," Computer and Information Sciences-II,J. T. Tou, Ed. New York: Academic Press, 1967, pp. 91-122.

    [3] Y. T. Chien and K. S. Fu, "On the generalized Karhunen-Loeve expansion," IEEE Trans. Information Theory, vol. IT-13, pp. 518-520, July 1967.

    [41 "Selection and ordering of feature observations in a pat-tern recognition system," Information and Control, vol. 12, pp.395-414, 1968.

    [5] H. L. Van Trees, Detection, Estimation and Modulation Theory.New York: Wiley, 1968.

    [6] I. Selin, Detection Theory. Princeton, N. J.: Princeton Univer-sity Press, 1965.

    [7] T. T. Kadota and L. A. Shepp, "On the best set of linear ob-servables for discriminating two Gaussian signals," IEEETrans. Information Theory, vol. IT-13, pp. 278-284, April 1967.

    [8] S. Kullback, Information Theory and Statistics. New York:Wiley, 1959, pp. 197-200.

    [9] K. Fukunaga and T. F. Krile, "Calculation of Bayes recognitionerror for two multivariate Gaussian distributions," IEEETrans. Computers, vol. C-18, pp. 220-229, March 1969.

    [10] J. T. Tou and R. P. Heyden, "Some approaches to optimumfeature selection," Computer and Information Sciences-II, J. T.Tou, Ed. New York: Academic Press, 1967, pp. 57-89.

    [11] T. Marill and D. M. Green, "On the effectiveness of receptorsin recognition systems," IEEE Trans. Information Theory, vol.IT-9, pp. 11-27, January 1963.

    318