a sequential fitting procedure for linear data analysis models

29
Joumal of Classification 7:167-195 (1990) A Sequential Fitting Procedure for Linear Data Analysis Models Boris G. Mirkin Central Economics-Mathematics Institute Abstract: A particular factor analysis model with parameter cons~aints is gen- eralized to include elassifieation problems definable within a framework of fitting linear models. The sequential fitting (SEFIT) approach of principal component analysis is extended to include several nonstandard data analysis and classification tasks. SEF1T methods attempt to explain the variability in the initial data (com- monly defined by a sum of squares) through an additive decomposition attributable to the various terms in the model. New methods are developed for both traditional and fuzzy clustering that have useful theoretic and computational properties (prin- cipal cluster analysis, additive clustering, and so on). Connections to several known classification strategies are also stated. Keywords: Cluster analysis; Fuzzy clustering; (bi)linear model; Principal clusters; Additive clusters; Association measures for cross-classifications; Additive types. 1. Introduction This paper summarizes the author's work on including cluster analysis methods within a particular class of data analysis models. The basic tech- nique used is a sequential fitting (SEFIT) method, which can be interpreted as generalizing a common computational strategy in constructing principal The author is grateful to P. Arabic and L. J. Hubert for editorial assistance and reviewing going well beyond u-aditional levels. Author's Address: Boris G. Mirkin, Central Economics-Mathematics Institute, Krasi- kova s~. 32, Moscow W-418, U.S.S.R. 117418.

Upload: boris-g-mirkin

Post on 10-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A sequential fitting procedure for linear data analysis models

Joumal of Classification 7:167-195 (1990)

A Sequential Fitting Procedure for Linear Data Analysis Models

Boris G. Mirkin

Central Economics-Mathematics Institute

Abstract: A particular factor analysis model with parameter cons~aints is gen- eralized to include elassifieation problems definable within a framework of fitting linear models. The sequential fitting (SEFIT) approach of principal component analysis is extended to include several nonstandard data analysis and classification tasks. SEF1T methods attempt to explain the variability in the initial data (com- monly defined by a sum of squares) through an additive decomposition attributable to the various terms in the model. New methods are developed for both traditional and fuzzy clustering that have useful theoretic and computational properties (prin- cipal cluster analysis, additive clustering, and so on). Connections to several known classification strategies are also stated.

Keywords: Cluster analysis; Fuzzy clustering; (bi)linear model; Principal clusters; Additive clusters; Association measures for cross-classifications; Additive types.

1. Introduction

This paper summarizes the author's work on including cluster analysis methods within a particular class of data analysis models. The basic tech- nique used is a sequential fitting (SEFIT) method, which can be interpreted as generalizing a common computational strategy in constructing principal

The author is grateful to P. Arabic and L. J. Hubert for editorial assistance and reviewing going well beyond u-aditional levels.

Author's Address: Boris G. Mirkin, Central Economics-Mathematics Institute, Krasi- kova s~. 32, Moscow W-418, U.S.S.R. 117418.

Page 2: A sequential fitting procedure for linear data analysis models

168 B.G. Mirkin

components. A strategy of sequential projection onto linear spaces is developed that operationalizes the SEFIT method, and provides an additive decomposition of the dispersion or "'scatter" in the original data using the components of the solution, which is generally useful in interpretation. Three special cases of the general fitting problem using a least squares criterion are analyzed in detail: factor analysis (principal components), fuzzy cluster analysis (additive fuzzy types), and cluster analysis (principal clusters). In addition, some theoretical and practical advantages of this approach are demonstrated based on:

(i) the relation between approximation criteria and associated compu- tational methods, and the use of preliminary data transformation and the selection of measures of proximities between units and/or variables;

(ii) the possibility of an automatic determination of the values needed to control the use of the algorithm; and

(iii) the availability of various solution characteristics as an aid in interpretation.

2. Linear Model of Data Reduction

The data we consider can be represented by a two-mode I x K matrix Y = (Y/t) containing the values Yu¢ for variables k e <K> and entities i e </>. The basic model that is assumed for the data involves "latent" fac- tors m e <M> represented by the pairs (fro, am) of 1-dimensional column vec- tors f,,, = (f/,,,) and K-dimensional row vectors arm=(a,,,D, where m = 1 . . . . . M. These factors are to be interpreted as "explaining" (with some degree of error, E;k) the observed data Y by means of the following structural equation:

y/i = Zf/m am, + e/ i , (1) /"a

or in matrix notation,

Y = F A + E , (2)

where Y = (y/i), F = (f#~), A = (am*), and E = (e/i). The vector fm gives the scores f/m of the factor m e <M> for the entities i e </>; the vector a r = (am,) represents the factor using weights in the K-dimensional variable space. According to (1), each entity i is represented by the i-th row of Y, which is a linear combination of the vectors arm, m e <M> (except for the residuals e/i).

Page 3: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 169

Depending on the constraints placed on the values for factor m, we can consider several more specific models. Here, we will confine ourselves to the three following cases:

Factor analysis. There are no a priori constraints: f/m and a m are allowed to be arbitrary.

Fuzzy clustering. The values am are arbitrary but f/,,, must satisfy the con- straints:

0 < f / r e < l , and Z f / m = l foreach i e < / > . (3) m

In (3), the factors are interpreted as fuzzy clusters associated with the vectors am; the values f/m are considered probabilities of each i belonging to the fuzzy cluster m (see Mirkin and Satarov 1990).

Cluster analysis. Constraints for am are not given, but the values f ~ are 0 or 1. In this case, the vector fm corresponds to the nonfuzzy cluster Sm- {i :f/m = I }, and the vector am characterizes the cluster. For nonover- lapping clusters, the model in (1) requires the values Y/k tO coincide with am for all i e <I> (up to an approximation error elk). (See, for example, Jambu and Lebeaux 1983, or Mirkin 1987b.)

3. Diversity of Solutions and the Method of Sequential Fitting

To fit the model in (1), it is obviously necessary to estimate the unknown f/,,, and am using the given data Y/k; here, this will be approached as a problem of minimizing the e/k's. Typically, this optimization task is opera- tionalized through some measure of the size of the error terms, e(E), that has the following form:

e(E) = g ~ . (4)

Usually, however, minimizing e(E) under a given set of constraints is insufficient to obtain a unique solution. For example, in the problem of factor analysis, it is obvious that for each optimal A and F, the error matrix E and the value of the criterion e(E) is unchanged if A* = C A and F* = F C -1 (where C is an arbitrary nonsingular M x M matrix) are substituted in (2). Analogously, in the fuzzy clustering problem, each row vector of Y is approx- imated by a convex combination of the vectors arm. Thus, in particular, each

Page 4: A sequential fitting procedure for linear data analysis models

170 B.G. Mirkin

set of vectors arm having a convex closure that includes all row vectors y/r = (y~), yields a solution to the problem of fitting model (1) with zero errors (residuals) ~ik. Thus, equation (I) and the criterion e(E) are insufficient to determine a unique solution and further restrictions on the model are necessary.

In the factor analysis problem, additional restrictions can be made using the "simple structure principle" that requires a basis for the factor space with weights amk that are close to 0, or +I (I-Iarman 1960). This allows us to associate a "cluster" of the variables (for which a ~ = +_1) with the fac- tor m E <M>. Another set of restrictions would require rotating the factor solution to approximate a set of a priori values a ° . ~ both cases, we may wish to attempt satisfaction of the constraints at the same time the factor space is found. Explicitly, we would force the values a,,,k to be +1 or 0 only (in the first case), or a ~ = a ° , where a ° are specific values (in the second case). One interesting attempt to analyze this simultaneous problem was con- sidereal by Braverman and Mouchnick (1983) as the "method of extremal grouping of the variables."

We propose another principle of "simplicity" for the problem of obtaining factor solutions. The principle proceeds from the structure in the data, which is assumed to be sufficiently simple as to be revealed by the use of sequential optimization.

3.1 The Sequential Fitting (SEFIT) Method

Solutions are sought that can be constructed sequentially so that the factor (fro, am) is the best approximation to the residual data matrix Ym = (Y~), where

Y~ = Y/k - 2: f/,, an, . e l< ra

The method requires, in fact, a sequential strategy for the fitting of model (1). This is similar to a standard computational technique in principal component analysis, and in that case, leads to the optimal solution being sought. Although the method may not be appropriate for the fitting of all data struc- tures that might be of interest, and the optimality properties that are present in the context of principal component analysis do not generally transfer, in a number of situations, it does appear to offer a rather reliable and informative data analysis technique (see Sections 6-11 below).

Page 5: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 171

4. Iterative Projection Method

Consider the set y, xl . . . . . x M of vectors in Euclidean space R t. As is well-known, the linear combination ~ = ECm Xm of the vectors x,n that is closest to y in R t is defined by the orthogonal projection operator Px = X( Xr X) -1 Xr, where X is an l × M matrix with the vectors x,n as columns, so that i = Px Y and c = (X r X) -1 X r y. This rule gives the solution for fitting the model y = Z Cm Xm + e (with unknown Cm, m E <M>, and e) by the least squares criterion. In general, however, this solution is inappropriate for the data analysis problem of interest here. First, the set {xm} may not be given a priori, and second, the equality

(y, y) = £ c~ (Xm, x,n) + (e, e) m E < M >

(5)

(where as usual (y, y) = E y/Z) holds only when the Xm'S are pairwise orthogo- nal. The equality in (5) is crucial in interpretation, because it decomposes the sum of squares (y, y) according to the contributions c~ (xm, Xm) of the vectors xm and the "unexplained" residual (e, e).

Because the set {Xm } is typically not given a priori, we consider fitting the model

y = ~ C m X m + E , m ~ <M>

by arbitrary c and xm chosen from given subsets Dm of R t (m E <M>). To obtain a decomposition as in (5), we propose a sequential projection method, SEFIT. Each step m of the SEFIT method consists of constructing the single m-th factor as an approximation to the residual vector

Ym = Ym-1 -- Cm-1 Xm-1 = Y -- ~ Cn Xn, t l < t / ' l

with the subsequent construction of a new residual vector

Y m + l = Y m - - Cm Xm •

.

2.

Explicitly, the SEF1T method consists of the following steps:

Le tm = 1, Ym =Y.

Solve the problem of minimizing

lyre - c x l 2 - (Ym - -CX, -- CX) (6 )

over all x e Dm and real c. The possibly suboptimal solution defines cm and Xm.

Page 6: A sequential fitting procedure for linear data analysis models

172 B. G. Mirkin

3. S e t Ym+l = Ym - Cm Xm.

4. If a stopping criterion (defined below) is met, the process is com- pleted, and

m

y = ~ cnxn+e , n=l

where e = Ym+l ; otherwise, increase m by I and go to Step 2.

Theorem 1. / f the coeffrient cm gives the global minimum for (6) for fixed xm e Din, then independently of the selection of the set {Xm}, the decomposi- tion in (5) holds.

Proof. The vector Ym+l = Ym - Cm Xm is orthogonal to Xm in minimizing (6) by era. Thus, by Pythagoras's Theorem,

(.v~, y / ) = (CmX.,,CmX,,,) + (.V~+I, Ym+O -

Summing over m = 1 . . . . . M - 1, and noting that Yl = Y and YM = £, we obtain (5).

Note 1. The SEFIT method may also be considered for metrics other than the Euclidean, but particularly for L e spaces, where p > 1, and which have the norm:

Ixl = (Exf) 'p i

'defined for each vector x. Only the problem of minimizing (6) is changed. The proof of Theorem 1 remains the same for Lp, with a corresponding substi- tution of the exponent p instead of 2 and the use of the decomposition

lyl p =Ecemlxm tp + lel p . (59

Note 2. By Pythagoras's Theorem and for fixed Ym, the problem of minimizing lYm+l 12 in (6) is equivalent to maximizing c2mlxm 12 for cm, which is obviously equal to

Cm = (r,,,, x . ) / (x. , x . ) . (7)

Thus, for fixed Ym we rnay maximize

Page 7: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 173

g(x) = c~(x, x) = (y,,,, x) 2 / (x, x), (8)

over all admissible x ~ D,n with Cm defined by (7). The solution {x,,,} found with the SEFIT method will generally give

values for the residuals eik that exceed those that might be found through a simultaneous construction of a solution. Thus, the obvious question arises: are the residuals found by the SEFIT method reduced as the number of steps m increases? The answer depends both on the diversity of the sets Dm and the method chosen to solve the minimization problem in (6). One of the simplest sufficient conditions that is appropriate in a variety of situations, can be stated as follows (Mirkin 1987b):

Condition A. Every orthogonal vector u r = (0 . . . . . 0, 1,0 . . . . . 0) hav- ing the single 1 at the k-th position and in the space R t belongs to Din, and the vector x~ chosen in Step 2 of the SEFIT method is no worse than each possi- ble uk by criterion (8):

g ( x . ) >__ g(u ) =

where y~ denotes the k-th entry in the vector Ym.

Theorem 2. I f Condition A holds, then Ym converges to 0 as m increases.

Proof. Let y~ be the component of y,,, with maximal absolute value. Then, (Ym, Ym) / l < (y~)2. ThUS, by Condition A, g(xm) > (ym, Ym) / l, which implies:

I Ym+l 12 = (Ym, Ym)--g(Xm)<(Ym, Ym) d ,

where d = 1 - (1 11) < 1. Thus, (Ym, Ym) < (31, Y) din-i, where d m-1 converges to 0 for increasing m, and in turn, (Ym, Ym) "> 0.

On the basis of Theorems I and 2, we propose the following possible stopping rules for stage 4 of the SEF1T method: (a) the number of factors equals a prespecified fixed value; (b) the absolute or relative contribution of the factors to the data sum of squares becomes negligible; or (c) the sum of the factor contributions becomes sufficiently large.

To improve upon the adequacy of our approximation, we could con- sider different modifications of the SEFIT method, and in particular, a version of the SEFIT method that involves the recalculation of the coefficients (Tro- phimov 1981). If we denote by Pm the orthogonal projection operator onto the space of the first m vectors xl . . . . . xm, on each m-th step let y - P m Y be the residual vector and not ym - cm xm. The decomposition in (5) is now lost,

Page 8: A sequential fitting procedure for linear data analysis models

174 B.G. Mirkin

but the process does converge in a finite number of steps, as stated below in Theorem 3:

Theorem 3. I f Condition A holds, the vector Xmfound on the m-th step of the SEFIT method with coeffu:ient recalculation is linearly independent of the vectors xl . . . . . Xm-I ; thus, Ym = O for some m < I.

Proof Assume for some m = 1, 2 .... the vector x,,, found by the method depends linearly on the preceding vectors; that is, Xm = ~ b~ xn for some

n<nl coefficients bn, n < m, that are not all zero. The residual Ym = Y - Pro-1 Y and Xm are orthogonal because (Ym, Xm) = ~ bn(Ym, x,,) = 0, since Ym is orthogo-

n<b,n nal to xl . . . . . x,,,-1 by construction. This fact and Condition A imply

0 -- (Ym, Xm) 2 / (Xm, Xm) --> (y~)2 ,

for each k = 1 . . . . ,l; that is, Ym = 0 . The contradiction shows that the vector Xm does not depend linearly on the preceding vectors.

5. The SEFIT Method for the Data Analysis Model in (1)

Applied to the model in (1) with initial data matrix Y, the SEFIT method can be stated as follows, where the dimensionality l of the space now equals l = 1 × K.

1. Define m = 1, Ym = Y.

2. Minimize the criterion

(Y'~ -3~ ak) 2 , (9) i,k

where y.~ are the entries of the residual matrix Ym at step m, over all admissible ak and ~ . When ~ (and/or ak) are arbitrary, we may find the optimal solution in the usual way by setting the partial derivatives of (9) to zero:

Ym a = f lal 2 , (10)

or

fV ym = a T Ifl 2 (11)

Substituting (10) or (11) into (9), we find

Page 9: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 175

Y_, (y~ - at)~) 2 = Z (y.~)2 _ ~,rn , (12)

where

~,,,t = Ifl2 la l2 (13)

represents the contribution of the m-th factor solution to the data sum of squares, and which equals

o r

~m=ar Yrm Yma/ lal 2

km = fT ym yT f l lfl 2

(14)

(15)

depending on whether the basic equation used is (10) or (11). For- mula (12) implies that on the m-th step, we maximize the criterion (13) in the form of (14) or (15).

Denote the solution by arm = (a,,,k) and fm= (f/m).

3. A new residual matrix Ym+l = Ym - f m arm with entries

y.~+l = y~ _.~ ak , (16)

is calculated. The previous analysis implies (see (5))

m

EY~ = Y. ~,~ + y__,(y.~+l)2. (17) i,k n= l i,k

Thus, we may use the conditions: m

~., ~ / E r 2 > I - a ' , or X,:/Y'.y 2 <d" , n =1 i,k i,k

with prespecified values for d" and d'" as defining negligible propor- tions of the sum of squares in constructing reasonable stopping rules. If one of the conditions is not met, we increment m by one and return to Step 2 to find the next pair (fm, am); otherwise, we terminate the procedure.

For the problems of factor analysis, fuzzy, or traditional clustering, Condition A obviously holds; so Theorems 1 and 2 on the convergence of the process are applicable. Indeed, each I × K matrix F/t with all zero entries except for the single (i, k)-th element that equals 1, is the (i, k)-th standard basis in I xK-dimensional Euclidean space. Of course, the matrix F/k = ui uk r, where ui, uk are/-dimensional and K-dimensional standard bases with 1 as the i-th or k-th entry, respectively. Thus, the matrix F/k is

Page 10: A sequential fitting procedure for linear data analysis models

176 B.G. Mirkin

appropriate for all three problems, and the solutions from Step 2 of the SEFIT method found with the algorithms discussed below are always better than the trivial solution definable using the matrices F/k.

6. The SEFIT Method in Problems of Factor Analysis

We noted above that for the factor analysis problem, both values ak and 3~ in (9) may be arbitrary. In this case, equations (10)-(15) hold. Substituting (10) into (11) and (11) into (10), we fred an equivalence to the usual spectral decomposition of a Gramian matrix:

yTmYma=Zma, and YmY~f=~,nf , (18)

where ~'m is defined by (13)-(15). Thus, the solution to minimizing (9) is given by the eigenvectors of the square matrices Yrm Ym and Ym Yrm with corresponding maximal eigenvalue 3.,n. To obtain a normalized solution, we fix the norm of f (for example, Ifl = 1); then, (13) implies la12= ~.~. Several well-known properties of the method of principal component analysis are summarized by the following theorem:

Theorem 4. For factor analysis, the SEFIT method leads to the following version of principal component analysis. The contribution of the m-th factor ~,n is the m-th eigenvalue of the matrices y r y and Y y r (in decreasing order), and am (or fro) is the corresponding eigenvector of the matrix y r y (respectively, YyT)wi th the norm lfml = 1 (respectively, lain 12 = Xm). The decomposition in (13)for )~m reflects the contributions of the elements i ~ < / > , k ~ <K>.

Considering briefly the application of the SEFIT method to factor analysis and simple structure, the latter may be formalized using constraints for the sets of admissible values for ak. For example, suppose we require at, to be three-valued; 0 if the variable k is not associated with the factor m, 1 if positive, and - 1 for a negative association. In this case, (1{3) is applicable:

f= ~, yrff ak / ~ a 2 , k k

where y~ is the k-th column of Y,,,. Denoting, by A 1 (or A2) the set of the variables with ak = 1 (or - 1, respectively), the optimal f for a fixed vector a = (aD containing values of 0 or+l is

Page 11: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 177

f = ( y . y~t_ • y n ~ ) t l A 1 u A 2 1 ' (19) ke A, ke A=

where I A: u A 2 I denotes the cardinality of the set A 1 u A 2. Thus, the fac- tor f here equals the mean of the "residual" variables y~' (whose signs may have been reversed), and is analogous to the well-known centroid method of factor analysis (Harman 1960) except for two differences. First, (19) does not include all variables, but only A 1 u A 2. Second, the centroid method uses the loadings ak determined by f in (19) without constraints, both for computa- tion of the residuals by (16) and for purposes of interpretation (Harman 1960).

According to (14), the problem of determining a = (ak) requires max- imizing the value a T Yr m Y,,, a / l a 12, which is quite similar to "principal cluster analysis" discussed below in Sections 9 and 10 (although in slightly different terms); therefore, we omit its discussion at this point.

7. The SEFIT Method for Fuzzy Cluster Analysis

This section is based on the article by Mirkin and Satarov (1990). For an application to fuzzy cluster analysis, the SEFIT method is opera-

tionalized by the sequential minimization of the criterion in (9) using arbi- trary ak but with3] satisfying the constraints

0 _< 3] _< 1 - g? , (20)

m-1 where gm= ~ 3], is the accumulated probability of entity i e </> belong-

n=l ing to the preceding fuzzy clusters, n < m. For each step of the process, the following inequalities obviously hold for each i e < / > :

0 < f ~ < l ; ]~f/,,,_<l for i e <I> . m

To satisfy the relations in (3) after stopping the sequential process, it is neces- sary to define the last fuzzy cluster (the so-caned "joker") with membership function f0 (Bezdek 1981):

A0 = 1 - Z A . . m

The optimal a for a fixed f is defined by (11):

Page 12: A sequential fitting procedure for linear data analysis models

178 B.G. Mirkin

a = yY f / ifl 2 , (21)

and the optimal f for a fixed a is determined by (10). the constraints in (20), we have:

So, taking into account

0 for ci < 0 ; f = |ci for0_< ci _< l - g } n ; (22)

1 - g~ for ci > 1 - gin,

where ci = ~ y'~ ak / ~., a2; that is, c = Ym a / l a 12. k k

These formulae form the basis of the iterative algorithm for construct- ing the m-th fuzzy cluster. The process begins with some initial f that satisfies the constraints in (20), the vector a is calculated by (21), and f by (22). The whole process is repeated until the vector f found at some step differs minimally from a preceding one. The vector f with all zero entries except j] = 1 for that i corresponding to the most distant (from the origin, 0) row-vector y/r may be used on the initial step.

T h e o r e m 5. The pair (f, a) is a locally-optimal solution for the problem of minimizing criterion (9) with the constraints in (20) iff the vectors a and f satisfy equations (21)and (22), where

ci <- 1 -gT ' , (23)

for each i ~ <I>.

Proof. Let (a, f) be a locally optimal solution. Then obviously, (21) and (22) hold as the first-order optimality conditions. Denote:

< I> 1 = {i : ci < 0}, <•>2 = {i : 0 _< ci < 1 -- gT'} and <I>3 = {i : ci > 1 - gr~},

and prove <1>3 = 0 . Assuming the opposite that <1>3 ~: 0 , consider the vector a(A) = Aa for A > 1. By (22), the corresponding c(A) for a(A) equals c / A. For this c(A), the analogous sets </>(A) satisfy the obvious conditions:

< l > l ( A ) = < />1 , < I>2 ~ <1>2(A), <1>3(A) ___ <1> 3 .

By the definition of <1>3, them exists an e > 0 such that <I>3(A) = <I>3 for all A, satisfying the inequality I < A < I + e . Thus, for that A,

Page 13: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 179

<1>2 = <I>2(A), and obviously, c i / A < 0 for each i e <I>1, 0 < ci / A < 1 - g'~ for each i e <1>2, and cl / A > 1 - g~ for each i ~ <1>3.

Continuing, the value of the criterion (9) for f (A) determined from c / A with the formulae in (22) equals:

a(A)= Z [ Z (Yik-0) 2 ke <K> i e </>~

+ ~-~ (Yik- (c i /A)Aak) 2+ Z ( Y a - ( 1 - g ? ) A a k ) 2 , i e </>: i e </>3

and it is not difficult to demonstrate that

]~ (Yik - (1 - g'~)Aak) 2 = Constant - £ i e <1>3

Z ~, a2kA(l_grfl) { 2 c i - A ( 1 - g ~ n ) } . /c i e </>3

The maximal value of the function p ( x ) = x ( D - x ) is achieved when x = D / 2. In our case, x = A(1 - g7 ~) and D = 2ci, so x = ci gives the max- imum. Also, p(1 - g 7 ~) <p(A(1 _g~n)), because 1 - g ~ <A(1 - g ~ ) < ci. This implies A(A) < A(1), contradicting the local optimality for the pair (f, a), and proving </>3 = 9 .

Considering the converse implication of the theorem, suppose condi- tions (21), (22), and (23) are satisfied by some pair if, a), and denote

</>(a) = {i" ci = Z y'~ ak < O} ,

and by Y(a) the matrix that results from excluding all rows y/r with i ~ <l>(a) from Ym- Analogously, the vector f(a) is extracted from f by excluding each fi for i e </>(a), so by (23), the entries of f(a) are ci only for i ~ <l>(a). Thus, the pair (a, f(a)) is the first principal component for the matrix Y(a); i.e., a and f(a) are the eigenvectors of the matrices Y(a) T Y(a) and Y(a) Y(a) T, respectively, corresponding to the maximal eigenvalues. The set of vectors a* for which </>(a*) = l(a) is open; so the matrix Y(a) is invariant in some neighborhood of a. In that neighborhood, a maximizes (14) or, equivalently, minimizes (9). This result proves the theorem.

Page 14: A sequential fitting procedure for linear data analysis models

180 B.G. Mirkin

If (f, a) were a locally optimal solution of problem (9), (f/% ),a) would give the same value of (9) and also satisfy condition (20) for each ¥> 1; that is, the pair (f/~/, o/a) is locally optimal as well. But for 31 < I, the pair may not satisfy the constraints in (20).

8. The SEFIT Method for Cluster Analysis

or 1. analysis method (Mirkin 1987b).

If the cluster corresponding to the vector S = {i :j~ = 1 }, (11) has the form

ak = ~, yikl lSI , i E S

In cluster analysis, the values at~ are arbitrary and A is restricted to be 0 The SEFIT strategy in this case has been called the principal cluster

f = ~ ) is denoted by

(24)

where I S I denotes the cardinality of S. The vector a = (at,) for cluster S is the centroid defined over the K vari-

ables but only for those entities in S. (The vector a is referred to as a "real type"). Formula (15) implies that the problem of minimizing (9) is equivalent to maximizing the contribution of cluster S to total the sum of squares; the contribution of S has the form

g(S) = E % / t S t = l S l b(S) , (25) i, yEs

where b/j is the (i,j)-th entry in the proximity matrix B = Y yr , and b(S) is an average proximity in S that, in turn, equals the inner product of the vector a given in (24):

b(S) = E bij / I S I 2 = a r a . (26) i, j e S

Given these relationships, we have proved the following theorem:

Theorem 6. The general m-th step of the principal cluster analysis method constructs the set o f units Sm that maximizes the contribution (25) of Sin to the total sum of squares; the vector am from iteration m is the centroid for Sin. The contribution of variable k to g(Sm) equals a 2 I Sm I (according to (25) and (26)).

The problem of maximizing (25) over the set of all possible subsets S ~ </> is NP-complete. Although this problem was mentioned in Mirkin (1987b), that particular article emphasized the so-called principal clusters algorithm, which is a simple method for finding a possibly suboptimal S by

Page 15: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 181

sequentially selecting entities i e <I> to form S, starting with S = 9 . Specifically, we add to S that i maximizing the increment

A ( S , i ) = g ( S u i ) - g ( S ) = ( b i i + 2 ISI ( b ( i , S ) - b ( S ) 1 2 ) ) l ( I S I + 1), (27)

as long as that increment is positive. Here, b(i, S) is the average proximity of entity i to duster S, which is equal to:.

b ( i , s ) = E % / I s I (i E <t>). (28) jeS

Ignoring constant terms from (27), the general step of the sequential pro- cedure for the construction of S consists of finding that i maximizing

wi = bii + 2 1 S I b( i ,S) , (29)

and including i in S. A brief description of the algorithm is as follows.

Algorithm ADDY-SQ

Starting with S = 9 , define wi = bii (i ~ <I>), according to (29); the row vector y/mr most distant from 0 provides the first entity in S because bii = (Y•, Y~'). If

W i > I S [ b ( S ) - g ( S ) , (30)

then the increment in (27) is positive and that i is added to S; this general step is repeated using recalculated values I S I, b(S), and b(i, S), with S u i substi- tuted for S, using the following recurrence relations:

I S u i l = IS I + 1 ,

b(j , S u i) = ( I S I b ( j , S ) + biy) / l S u i l, j ~ < / > ,

b(S u i) = (( I S I )2b(S) + wi) I 1S u i 12 .

Entity i is not added to S iff(30) does not hold. In this case, the set S becomes the principal cluster Sin, and is characterized by the parameters am in (24), g(Sm) and a ~ I Sot I / g(Sm), k ~ <K> (see Theorem 6).

The next, (m + 1)-st principal cluster is found by using the residuals

Page 16: A sequential fitting procedure for linear data analysis models

182 B.G. Mirkin

y~+l = y.~ _ f ~ amk,

and with the same algorithm according to the SEFIT method.

Note 1. In general, the clusters Sm may overlap, although the procedure may be modified to find nonoverlapping clusters. Explicitly, we may add to Sm only those entities i not included in any preceding cluster S,, (for n < m); that is, we have to maximize wi in (29) using only previously nonincluded i. This modification requires the use of a different stopping rule; namely, termi- nation when the set of nonincluded entities is empty. A similar modification may also be implemented when additional a priori requirements for the interrelations between individual entities may exist.

Note 2. In our preceding discussion, we used average proximities b(i, S) and b(S) to describe the process of the principal dusters' formation. Alternatively, it may be performed directly using the vector a because b(i,S) = (yi, a) and b(S) = ara . We leave it to the reader to formulate a ver- sion of the method using only the a (with formulae for the recalculation of the a after adding some entity to the cluster).

Definition: A cluster S is strict (according to a given proximity matrix B) iff for each i ~ S:

bq, S) < b(S) 12.

Cluster S is a strict cluster in the variable space (according to its centroid a) ifffor each i ~ S:

(Yi - 1/2 a, a) < 0 ;

that is, the projection of Yi onto the segment 0 a is closer to 0 than the mid- point of the segment.

Theorem 7. Each principal cluster is a strict cluster (according to B - YYT).

Proof. According to the stopping role in (30), the principal cluster S satisfies the inequality

wi =bii + 2 IS 1 b6,S)<_ IS I b(S) ,

for each i ~ S. Then, 2 l S I b(i, S) < I S I b(S), and so, b(i, S) <- b(S) / 2.

Page 17: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 183

The principal cluster analysis suboptimal algorithm is similar to some well-known heuristic clustering algorithms: the B-coefficient method of Holz- inger and Harman (1941), algorithm "Specter" of Braverman and Doro- pheiuk (see Doropheiuk 1966, and also Braverman and Mouchnick 1983), the average linkage method of Sokal and Michener (1958), and so on. Here, we shall consider in detail only equivalent forms for the principal cluster analysis criterion. This discussion involves two rather useful points that are con- cemed with interpretational issues and recommendations for preliminary data transformation (especially for qualitative data).

9. Equivalent Criteria for Principal Cluster Analysis

The initial least-squares approximation criterion in (4) presumes all entries Yik of the data matrix are comparable, and each is represented with constant weight in the criterion by the use of its own squared residual e/k. Thus, some preliminary normalizing transformation of the data may be neces- sary. For example, to impose a common scale, the equalities

Zy/2k = K, ZY'~ = I, k i

may be imposed using common iterative proportional fitting strategies. At this point, we will not consider the issue of preliminary data normalization, but will assume that the data are comparable, and also column-centered; that is, the mean of the entries in each column of the raw data matrix has been subtracted from each of the elements, setting each column sum to zero. This approach is quite natural because the model in (1) does not include constant terms. Further, we shall discuss only the problem of nonoverlapping cluster- ing. In this case, criterion (4) is equivalent (by Theorem 6) to constructing the partition P = {S I . . . . . SM } with maximal total contribution (over all clus- ters) to the sum of squares for the initial data; that is, with a maximal value of the criterion:

g(P) E g(sm) = E ; s. ; b(Sm) . m gn

Also, from (5) we have:

i,k i,k

(31)

The criterion in (31) is known to be equivalent to the following three:

Page 18: A sequential fitting procedure for linear data analysis models

184 B. G. Mirkin

. minimization of the weighted variance

o.2(e) E E l Sin 1°'2 (Sin), m

where 02 (Sin) = E X (Ya - a,,~) 2 / I Sm I k ieS.

s.;

(32)

is the total variance in

2. minimization of the sum of distances to the centroids within clusters

D(al . . . . . a M ) = ~ E E(Ya-a , ,~ ) 2; rn iES . k

(33)

3. maximization of the sum of squared correlation ratios (for normalized and centered variables):

(P,k), k

where

(34)

T12 (p, k) - - E ( s . ) ) /

is the squared correlation ratio for the partition P = {$1 . . . . . Sea), and variable k with variance t~.

For proofs of these equivalences, see Braverman and Mouchnick (1983) for (31) and (32), Spaeth (1985) for (32) and (33), and Mirkin (1985) for (32) and (34).

Formulae (31)-(34) all represent the same criterion as in (4) but suggest rather different interpretations for the cluster analysis problem: (a) to find a partition with classes that consist of high interconnected units, by (31); (b) to find clusters with homogeneous values of the variables, by (32); (c)to fred clusters of data points around some centroids, by (33); and (d) to fred a parti- tion associated the most with the given set of the variables, by (34). The last form of the criterion may also be used in grouping algorithms applicable to two-way contingency tables (Mirkin 1985).

A case of special interest is when all variables are qualitative. A quali- tative grade s is represented by a Boolean column in the initial data matrix having an entry i equal to 1 or 0 depending respectively on whether or not entity i possesses grade s. Note that for dichotomous variables (defined by two grades), two ways of coding are possible: (a) the variable is represented

Page 19: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 185

by the single column, and 1 corresponds to one grade, 0 to the other grade (quantitative representation); or (b) the variable is represented by two Boolean columns, each corresponding to one of the two grades (qualitative case).

If we denote by Ps the proportion of ones corresponding to the grade s in the column, the column is centered by subtracting Ps from each of its entries. There are various ways of normalizing the columns:

(i) no normalization

It is straightforward to demonstrate using (24) that the loading of the cluster Sm for grade s is equal to

Pm~ ares = L~ -Ps ,

Pm

where p,,~ is the proportion of entities that belong to Sm and have grade s, and p,,, is the proportion of the total number of entities that l~ong to Sin, both pro- portions being taken with respect to all entities. The value ares can be inter- preted as the difference between the conditional probability of grade s occur- ing within Sin, p ~ I Pm, and the unconditional probability of grade s occuring, p,. Thus, it may be considered a natural index of association between Sm and grade s; see Mirkin (1985).

The proportion that grade s has in the contribution of Sm to the total sum of squares is determined (by Theorem 6) as:

Ipm a 2 = I(pms -P,n Ps) 2 /Pro. (35)

Thus, the total contribution to the sum of squares for the nominal variable X with the grades s from all clusters of the partition P = {Sin} equals:

A(X i P) =I Z (Pms-PmPs)21Pm =I[ Zp~Ipm-Zp~] • Fats runs $

(36)

This latter value is well-known in data analysis as the proportional reduction in the average number of the errors in the prediction of grade s, when the cluster containing a given entity is known (see, for example, Goodman and Kruskal 1954). The relative value of this measure is known as the Goodman-Kruskal taut, coefficient of association for cross classifications P and X:

Page 20: A sequential fitting procedure for linear data analysis models

186 B.G. Mirkin

w ( X v p ) = 6 ( x i e ) / (1 - Ep ) . (37) 8

Thus, when qualitative data are not normalized, the principal duster analysis criterion is equivalent to the maximization of the total association coefficient in (36); the separate terms of A(X IP) are the contributions given in (35) for the grade s to the cluster Sin; the vector am is determined by the difference between the conditional probabilities of the grade s occurring within the dusters to the unconditional probabilities of grade occurrence.

(ii) Normalizing the column s by (1 - E ps2) ~.

Here, we conclude by analogy to the preceding case that the criterion in (4) is equivalent to maximization of Goodman-Kruskars coefficient of associ- ation taut,, computed between the obtained partition P and the given variable X.

(iii) Normalizing the Boolean column s by p~ with subsequent (or equivalently in this case, with preliminary) centering of the columns.

In this case, the loadings are

ares = p~ (.Pms / Pm - Ps) / Ps = P~

where

[ P~ - 1 ] =p~ dP(m,s) , P,,,Ps

O(m, s) = (Pm~ / Pm Ps) - 1 (38)

is the relative increment in the probability of finding s by condition m, com- pared to the unconditional probability Ps. The values tic(m, s) are rather ade- quate measures of association between m and s for situations of small Ps (Mirkin 1985).

The contribution of grade s to Sm is

lpra a 2 = l(Pms -Pm Ps) 2 /Pra Ps = I~(s, m)Pm Ps ,

and so the total contribution of the variable X with occurrences of grades s to the partition P = {Sin } is, in fact, Pearson's coefficient:

X2( X, P) = l • (Pros -Pro Ps) 2 ]Pm Ps. ra,s

(39)

Page 21: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 187

It is of interest to note the formula from Mirkin (1985):

x2(x, e) = t E s ) , m , s

which demonstrates the dependence of the coefficient on relative changes • (m, s) of the probabilities Ps when the conditions m are known. Thus, in this case, the principal cluster analysis criterion is equivalent to the maximi- zation of Pearson's coefficients of association in (39) between the initial qual- itative variables X and the constructed partition P.

We have shown the equivalence between usual cluster constructions and "statistical" criteria for assessing "concordant" partitions. These ques- tions have also been discussed (but without use of the model in (7)) in previ- ous work of the author (Mirkin 1985) and in Saporta (1988).

The above results allow the formulation of several recommendations for the clustering of mixed data. It is sometimes appropriate to perform a preliminary normalization of the entries in columns k for quantitative vari- ables by dividing by the standard deviation, ok, and for qualitative grades s by dividing by p~. After this transformation, all quantitative variables have variance L For variables gauging qualitative grade s, the variance equals I ( 1 - Ps). This result implies concordance between the qualitative and quan- titative representations of the dichotomous variables: the variance I of the quantitative form equals the sum of the variances I ( 1 - p , ) + lps for two columns of the qualitative representation.

10. Other Criteria for Problems of Cluster Analysis

As is known, the least squares criterion in (4) is heavily influenced by the presence of outliers or perturbed data. However, the author is unaware of any attempts to fit the model in (1) with a criterion that is not based on least squares, either in the cluster analysis context or for traditional principal com- ponent analysis. We consider here the possibility of the SEFIT clustering method when Step 1 uses the city-block metric as a criterion, i.e.,

dPcb = ~., ly'~ --fi akl . (40) i,k

Theorem 8. Cluster S is optimal according to criterion (40) iff it maximizes the value

gl(S) = ~., (2b( i ,a ) - lal) , (41)

Page 22: A sequential fitting procedure for linear data analysis models

188 B. G. Mirkin

where

and

b( i ,a) = min( iy .~ l , lakl) ,

k E k(i, a)

k(i, a) = {k :y.~ at > 0} ,

l a l = ~ l a t l . k

(42)

The components ak o f the vector a are the medians o f the variables k in clus- ter S.

Proof. The necessary optimality condition for (40) has the following form:

E3~ sgn (y.~-3~ ak) = E sgn (y-~-ak) = 0 , i i~S

where S = i :3~ = 1; that is, a is the vector of medians. It is obvious that

*cb = E ty -akt + i~$ k i~S k

= ~ly-~ I - • E (ly-~! - l y .~ - akl) i,k i~S k

= E l y , ! -g~(S), i,k

since for each u, v

l u - v l = lul + I v l - I sgnu + s g n v l m i n ( u , v ) ,

and I sgn u + sgn v I equals 0 for uv < 0, or 2 for uv > O.

The proof implies that the value g l(S) is the contribution of S to E I y-~ I, and the decompositions in (41) and (42) allow a characterization of i,k the contributions to S for both the separate entities i (which are equal to b ( i , a ) - l a l ) , and for the separate variables k, which are equal to:

Isgny.~ + sgnakl min( ly-~l , lakl) . i~$

Page 23: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 189

Theorem 8 also yields some algorithms to construct suboptimal S (beginning with S = 0 ) by adding entities sequentially to S, one of which is considered below.

Algorithm ADDY-CB

We begin from the entity i most distant in the city-block metric that i m maximizes y~ I = ~ ly.~l, and define a = y ~ , S = {i}. Two units j , j are

k

selected that are closest to a according to measure (42) and added to S if g 1 (S) increases. The vector a is then recalculated as the median for the updated S, and the process repeated until g l(S) no longer increases. This algorithm constructs a suboptimal cluster S, and (41) implies that for each i ~ S:

b(i, a) < b(a, a) / 2 ,

which is analogous to the strict cluster concept in Euclidean space (see Sec- tion 9).

As another criterion for Step 1 of the SEF1T method, consider

~u = maxi.k ly~-3~ akl , (43)

which involves minimizing the residuals eik in model (1). It is obvious that minimizing (43) is equivalent to the following mathematical programming problem:

rain ~.

-~. <y~ -ak .~ < k

{1,01,

because optimal ~. equals the minimal value of the maximal deviation (43). Representing the vector f by the duster S = {i :3~ = 1 }, an equivalent

form of the problem is obtained as

min k -~.<y-~<~. ( i ~ S )

< m ak-~._yu,<ak + ~. ( i ~ S ) ~.>_0.

(44)

It is obvious from (44), that the optimal a is the center of the cube {y: a - ~. < y ~ a + L} with edge lengths equal to 2~., where k is that minimal

Page 24: A sequential fitting procedure for linear data analysis models

190 B.G. Mirkin

value for which all points of S are contained in the cube. For fixed S, the value ak is determined as the midpoint of the interval between the maximal and minimal values ofy~ for i ~ S.

The formulation in (44) shows that the cluster S may be found by distri- buting the set of all the points y~n in two congruent cubes with edge lengths of 2~, and centers at a (the S) and 0 (the rest of the points). It is not difficult to modify the linear reduction model in (1), or at least the first stage of the SEFIT method, by adding an unknown constant term to the model. It will lead the center of the "no S" cube to this unknown point.

Let us consider a suboptimal agglomerative algorithm for the criterion.

Algorithm ADDI-CUBE

Begin with ~ = max/,t ly-~ ! = yimoi~,, and place the entity i0 into S. As a general step, we consider )~s = max max l y-~l = m Yi, t,, and put is into S, each

i ~ S k

time recalculating a as the vector of the midpoint values between the maxi- mal and minimal y.~ for i ~ S. Computation terminates as soon as the maxi- mal deviation

max [max y-~ - min y~] k i ¢ S i E S

is greater than 2~,s. The algorithm ADDY-CUBE in the SEFIT method is clearly appropri-

ate for a special "comet-like" structure in data when a cloud of points in the variable space may break into a "nucleus" and "tail ." The tail forms a clus- ter first; then, the nucleus spawns a new tail, and so on. In the general case, we may have to consider the problem of simultaneously fitting all clusters in the model (1) by the minmax criterion. The points will then be distributed as M + 1 cubes of the same volume (M cubes for the clusters and the (M + l)-st cube with the points around 0).

Note 1. Using the criteria of this section requires a special preliminary data normalization. After centering, all elements of column k have to be divided by Y'. ly/t -yk l for criterion (40), or by the range max y.~ - m i n y~

i l t

for criterion (43).

11. Analysis of Proximity Matrices

At times, data are available in the form of a symmetric proximity matrix B = (b/j), i, j ~ <I>. Assuming that the proximities bij a r e generated according to some underlying factors fm with corresponding weights ~.,n so

Page 25: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 191

that the equation

(45)

holds for aI1 i, j e < /> , where ;~,n,f/= are unknown real numbers, and the f/= are (a) arbitrary in the case of the factor analysis problem, Co) discrete proba- bility distributions in the case of fuzzy clustering, or (c) ones or zeros in the case of traditional clustering.

The SEFIT method is applicable here as well. The dimensionality 1 of the space R t now equals l = I x I, the matrix B may be considered as a vector y, and the set D of admissible vectors is the set of square I x I matrices of rank 1 having form ffr, where f is an arbitrary admissible/-dimensional vec- tor. Only the least squares criterion will be considered here:

E e/~. (46) i , j

The SEFIT method applied to the model (45)-06) consists of the following steps.

1. Definem = 1, B,n = B.

2. Minimize, at least locally, the criterion

E (b j - :3) 2 t,J

(47)

.

by arbitrary (or by only positive) real ~, and admissible 3'},.~- Then, define f/,n = 3~ and ~ = ~..

Examine the stopping rule for the construction of the factor f,n using the values k 2 (f,,,, f,,,)2 defining the contribution of the m-th factor to the sum of squares for the data. If the stopping rule, which may be the same as SEF1T procedure's stage 4, is satisfied, the process stops. Otherwise, let

B~+~ = Bm - ~,,n fm fT, (48)

increase m by 1 and return to Step 2. By Theorem 1, the procedure finds the values Lm, f/~ satisfying the equality

~ b 2 = ~ 2 1 f m l 4 + ,~,e2 . (49) i , j m i , j

Page 26: A sequential fitting procedure for linear data analysis models

192 B.G. Mirkin

The factor analysis case is characterized entirely by the follow- ing statement.

Theorem 9. I f no constraints are placed on fire, the first m steps of the SEF1T method applied to matrix B construct eigenvectors f] . . . . . fm corresponding to its nonzero (positive) eigenvalues ~,t . . . . . ~,m (in decreasing order). When B = y y r , where Y is a two-mode entities by variables matrix, the vectors fl . . . . ,fro are the principal components o f Y.

We omit the proof because it does not involve any new results. Model (45) for the cluster analysis case is known as an additive cluster-

ing model (Shepard and Arabic 1979). Three versions of the SEFIT method for the model were discussed in detail in the author's paper (Mirkin 1987a). Unfortunately, that paper contains some minor mistakes: (a) the case when the diagonal elements of the proximity matrix axe absent is mixed up with the opposite case, and (o)the last column in Table 3 is based on inappropriate computation and conclusioI~s. We consider these topics briefly.

The optimal value ~. for fixed f equals, obviously, the average proxim- ity b~ for i , j ~ S = {i :fi = I}. Substituting that value into (47), we find:

where

Z (b73- ~;5 ~)2 = Z (b~) 2 - h(S) , ~,J ~,j

h(S) = ( Z b~ / I S I )2 = g2(S ) , i , j ~ S

i f the diagonal elements bii are present. The maximum of h(S) corresponds to the maximum of g(S), defined in (25); thus, in this case, we may use the same suboptimization algorithms. When b//is absent,

h(S) = ( Y'. b7])2/( IS l (ISJ-177. (507 i , j~ S

This modified index changes after entity i is added to S by the value:

A(S,i) = 4 ISI

I S I + I [( I SI - 1) b2(S) {b(i,S) / b ( S ) - 1/2 } + b2(S)], (51)

where, as in (287, b(S) is the average proximity in S, and b(i, S) is the average proximity of entity i to members of cluster S, defined in (28). Obviously, i maximizes A(S, i) iffit maximizes b(i, S). Thus, the suboptimal procedure of sequentially accruing entities to S is the same as in the preceding cases. Analogously, A(S, i) < 0 for all i ~ S iff

Page 27: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 193"

b(i, S) / b(S) - 1/2 < 0 for i ~ S ;

that is, each suboptimal S is a strict cluster, as in the case of traditional cluster analysis applied to two-way matrices.

The decomposition of the sum of squares for the present context is as follows:

(b~) 2 = ~ ~.~ r(Sm) + ~ ~.~ , (52) L,J m l , j

where r(S) = ( IS I )2 or I S I ( IS I - 1), depending on whether the diagonal elements b// are present or absent, respectively. The method of finding the clusters S does not influence the form of (52) if the Z,m are optimal (for fixed Sin). But we may analyze the contributions of the clusters Sm to the variance accounted for (not the usual sum of squares) of the proximities bii in the case when B is double-centered only. The last column in Table 3 of Mirkin (1987a) containing the values of variance accounted for is incorrect because the clusters in Table 3 were fitted to the initial proximity matrix B (without a preliminary centering of its entries).

We now consider the model of additive clusters for a three-way case when several proximity matrices Bk (k e <K>) on the set </> (without main diagonal) are available (Carroll and Arabie 1983; Arabie, Carroll, and DeSarbo 1987):

bij.k = +

with the least squares ~ ~ eii, k criterion for fitting the model. That criterion presumes that all i,j, k are equally weighted; thus, we may wish to conduct a preliminary transformation of the data.

The general step of the SEFIT method requires minimizing the criterion

Z (bT:. - h (53) i,j,k

over all possible Z.k and Boolean fi. It is not difficult to prove that the prob- lem is equivalent to maximizing the contribution of cluster S to the sum of squares ~ (bij.k) 2, which is equal to:

i , j ,k

h(S) = ~ hk(S) , (54) k

where

Page 28: A sequential fitting procedure for linear data analysis models

194 B.G. Mirkin

ht`(S) = (X ~,~, )2 / I S I ( t S i - 1 ) .

So the change Ah(S, i) = h(S u i) - h (S ) may be computed according to the formula Ah (S, i) = Y~ At̀ (S, i), using At, (S, i) defined by (51) and the entries

J: in the matrix Bt` (k ~ <K>). Obviously, if S is a strict cluster for each matrix B j:, then At`(S, i) < 0 for all k and i, and S is suboptimal by criterion (54). The opposite implication is not true: A/,(S, i) < 0 for all i yields At`(S, i) < 0 for some k, but not necessarily for all k ¢ <K>. The case of three-way three- mode (e.g., sources of data by entities by variables) matrices may also be con- sidered, but the question merits a separate discussion.

12. Conclusion

This paper has proposed an approximation approach to a cluster analysis problem based on a data analysis model and a sequential fitting of the component terms of the model. This approach encompasses the ostensi- bly diverse problems of principal component analysis and clustering (fuzzy or traditional). This type of generalization is important for theoretical as well as for practical concerns since software might be developed in a unified manner that would include all the special cases as options.

The sequential fitting strategy has two main advantages. First, the con- tribution of each element of the solution to the initial data sum of squares may be evaluated. This feature allows an estimation of the importance of the vari- ous components of a model (for example, we may evaluate the importance weight for each variable in each cluster). Second, several natural and non- trivial models for data (principal components, additive ideal types, principal strict clusters, and so on) can be included within the approach.

The approximation approach leads to algorithms close to several known heuristic clustering procedures that generally give good results in applications. The approximation approach may also allow the resolution of several very important questions in the context of practical data analysis applications:

(i) the use of preliminary data transformation of the data and how measures of proximity between units and the correlation between variables are determined by the criterion used in the data approxi- mation (see, for example, Section 11);

(ii) how the choice of the basic parameters of the computation process (number of clusters, initial "points," stopping rule) is determined by the approximation criterion and the rules for obtaining subsets;

Page 29: A sequential fitting procedure for linear data analysis models

A Sequential Fitting Procedure 195

(iii) how interpretation characteristics (the standard points, the contribu- tion values: for the clusters, for the variables in the clusters, or for the units in the clusters) are defined by the model and SEFIT pro- cedures themselves.

References

ARABLE, P., CARROLL, J. D., and DESARBO, W. S. (1987), Three-way Scaling and Cluster- ing, Newbury Park, CA: Sage.

BEZDEK, L C. (1981), Pattern Recognition with Objective Function Algorithms, New York: Plenum.

BRAVERMAN, E. M. and MOUCHNICK, I. B. (1983), Structural Methods of Emph'ical Data Analysis, Moscow: Nauka Publishers (in Russian).

CARROLL, J. D., and ARABIE, P. (1983), "INDCLUS: An Individual Differences Generali- zation of the AIX2LUS Model and the MAPCLUS Algorithm," Psyehometrika, 48, 157-169.

GOODMAN, L. A. and KRUSKAL, W. H. (1954), "Measures of Association for Cross Classifications," Journal of the American Statistical Association, 49, 723-764.

DOROPHEIUK, A. A. (1966), "'Algorithms for Pattern Recognition without Teachers, Based on the Potential Functions Method," Automation and Remote Control, 10, 78-87 (in Russian).

HARMAN, H. (1960), Modern Factor Analysis, Chicago: University of Chicago Press. HOLZINGER, K. J. and HARMAN, H. H. (1941), Factor Analysis, Chicago: University of

Chicago Press. JAMBU, K., and LEBEALrX, M.-O. (1983), Cluster Analysis and Data Analysis, Amsterdam:

North-Holland. MIRKIN, B. G. (1980), Analysis of Qualitative Variables and Structures, Moscow: Statistika

Publishers (in Russian). MIRKIN, B. O. (1985), Groupings in Socio-Economic Research, Moscow: Finansy i Statistika

Publishers (in Russian). MIRKIN, B. G. (1987a), "Additive Clustering and Qualitative Factor Analysis Methods for

Similarity Matrices," Journal of Classifuration, 4, 7-31; Erratum, 6, 271-272. MIRK1N, B. G. (1987b), "Method of Principal Cluster Analysis," Automation and Remote

Control, 10, 131-t42 (in Russian). MIRKIN, B, G. and SATAROV, G. A. (1990), "'Method of Fuzzy Additive Types in Multidi-

mensional Data Analysis," Automation and Remote Control (to appear, in Russian). SAPORTA, G. (1988), "About Maximal Association Criteria in Linear Analysis and in Clus-

ter Analysis," in Classiftcation and Related Methods of Data Analysis, ed. H. H. Boek, Amsterdam: Elsevier, 541-550.

SHEPARD, R. N. and ARABIE, P. (1979), "Additive Clustering: Representation of Similari- ties as Combinations of Overlapping Properties," Psychological Review, 86, 87-123.

SOKAL, R. R. and MICHENER, C. D. (1958), "A Statistical Method for Evaluating Relation- ship," University of Kansas Science Bulletin, 318, 1409-1438.

SPAETH, H. (1985), Cluster Dissection and Analysis: Theory, FORTRAN Programs, Exam- pies, Trans., J. Goldsehmidt, Chiehester: Ellis Horwood. (Original work published 1983.)

TROPHIMOV, W. A. (1981), "A Finite Method of Qualitative Factor Analysis," in Methods of Multidimensional Economics Data Analysis, ed. B. G. Mirkin, Novosibirsk: Nauka, 12-29 (in Russian).