parameter estimation in latent profile models

18

Click here to load reader

Upload: ap-dunmur

Post on 02-Jul-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Parameter estimation in latent profile models

E L S E V I E R Computational Statistics & Data Analysis 27 (1998) 371-388

COMPUTATIONAL STATISTICS

& DATA ANALYSIS

Parameter estimation in latent profile models

A.P. Dunmur , D .M. T i t t e r ing ton*

Department of Statistics, University of Glasgow, Glasgow G12 8QQ, Scotland UK

Received 1 February 1997; accepted I October 1997

Abstract

Latent profile analysis is a version of latent structure analysis in which the observed variables are continuous and the latent variables are discrete. The latent structure can be enriched if the latent vari- ables are multivariate, but computational difficulties can arise in the implementation of the appropriate version of the EM algorithm. These difficulties can be eased by incorporating mean-field approxi- mations in the E-step of the EM algorithm. Simple examples, treated in detail, show the effective- ness of these methods, which were first proposed in the engineering and neural-computing literatures. (~) 1998 Elsevier Science B.V. All fights reserved.

Keywords: EM algorithm; Incomplete data; Latent profile analysis; Mean-field approximations

1. Introduction

Bartholomew (1984) identifies the basic formulation of factor analysis models as being given by the relationship

f(x) = f~ g(xly) dH(y), ( 1 )

in which f (x) is the joint probability function of p observed variables, x, H(y) is the joint probability measure for s unobserved, latent variables, y, and g(xly ) is the conditional probability function for x, given y. This representation covers a variety of models, according to the nature of the variables in x and y. In this paper we look at the case of latent profile analysis, in which y is discrete and x is continuous.

* Corresponding author.

0167-9473/98/$19.00 @ 1998 Elsevier Science B.V. All fights reserved. 1'11 S 0 1 6 7 - 9 4 7 3 ( 9 8 ) 0 0 0 1 9 - X

Page 2: Parameter estimation in latent profile models

372 A.P. Dunmur, D.M. Titterington/ Computational Statistics & Data Analysis 27 (1998) 371-388

In fact, (1) represents a wide class of missing data problems, in which (x, y) would represent a complete observation within which in practice x is observed and y is not. The main feature that characterizes latent structure models is that the unobserved variables are not real, physical variables that happen to be missing. Instead, they represent underlying factors without precise physical definition, but which often turn out to have meaningful physical interpretation.

We shall impose conditional independence on #(ylx), so that

P

g(xlY) = ~I gi(xi[y), i=1

where xi is the ith component of x. Bartholomew (1984) comments that this condition is fundamental in ensuring that it is the dependence of the xi on y that accounts fully for the mutual dependence of the xi.

The layout of the paper is as follows: Section 2 introduces the idea of latent profile models with multiple latent variables, Section 3 acknowledges relevant literature in neural computing, and Section 4 constructs an appropriate EM algorithm for com- puting maximum likelihood estimates. In Section 5, mean-field approximations are introduced to obviate the computational difficulties in the E-step, and the relevant details for the latent profile example are described in Section 6. Section 7 is devoted to numerical illustrations and Section 8 contains a brief discussion.

2. Models with multiple discrete latent variables

In standard latent profile analysis (Gibson, 1959), the unobserved variable, y, is a d-category categorical variable, representing d latent classes. As a result, the proba- bility model for x is that of a finite mixture distribution with d components in which the class-conditional densities correspond to independence models. We extend this by proposing that y = (Yb-. . , Y~), where each yj is a categorical variable. In principle, the number of categories, d j, in the yj could vary over j , but for simplicity we shall consider only the c a s e d l . . . . . d~--d.

So far as the marginal probability function of the latent variables, h(y), is con- cerned, this structuring of y opens up a wealth of multivariate categorical models, and, in particular, the log-linear class. The overall sample space of y now consists of d ~ latent classes, and one might suggest any log-linear model for y, between the extremes of an independence model, involving s(d - 1) parameters, and the general d~-cell multinomial, with d ~ - 1 parameters. The advantage of this structuring of the latent variables is that interesting qualitative information may result from the nature of the variables.

Example 2.1. The simplest latent profile model.

In this example we assume an independence model for h(y). Thus s

h ( y ) = I-[ h(yj). j = l

(2)

Page 3: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titteringtonl Computational Stat&ties & Data Analysis 27 (1998) 371-388 373

It is convenient to represent the variable JS by an indicator vector ~ = O~1,..., yjd), and to define the probability vector hj =(hi1, . . . ,hja), where

hjk = P(yjk = l ),

for all j and k. In (2), therefore,

d

h ( y ) = I-I1--[(hY~k). (3) j = l k=l

Turning now to g(xlY), which we are assuming to represent an independence model, arguably the simplest case uses normal linear models, so that

d

]=1 k=l

var(X//y) = a/2

and X~.ly ~ N, i = 1 . . . . . p. Thus

) 1 i_~llog(a/z)- 1 - £ W j y j - £ W y j , log g(xly ) = const. - ~ _ ~ j=l j=l

where Wj is the (p x d) matrix with elements Wjrt. In some cases one might also as- sume a~ . . . . . ap 2 = a 2. Example 2.1 corresponds to a version of the standard factor- analysis model in which the latent variables are categorical rather than Gaussian.

3. A link between the simple latent profile model and neural computing

A structure similar to that in Example 2.1 has been discussed in the neural comput- ing literature, under the terminology of cooperative vector quantization (CVQ). We acknowledge in particular Ghahramani (1995), who in turns refers back to an unpub- lished Ph.D. Thesis by R. Zemel and to Hinton and Zemel (1994). In its simplest form, vector quantization associates a data vector x with one of a finite set of cate- gories or classes. As such, the activity is similar to cluster analysis, with each cluster corresponding to a class, and many familiar cluster-analysis algorithms have appeared in the vector-quantization literature; see, for instance, Cheng and Titterington (1994). In particular, the idea of using a statistical mixture model has been proposed, based on a prescribed number of clusters/mixture components, as well as the special versions that correspond to latent structure analysis. The extra feature of CVQ is the structuring of the latent classes in a multivariate way, as we have set out in Section 2. There are assumed to be d s latent classes, one of which obtains for each x, but they are structured as s sets of d subclasses, and one of each set of sub- classes obtains. Ghahramani (1995) effectively assumes the model of Example 2.1, but with a~ . . . . . o-~ = 1. The version presented here is more flexible and, there- fore, potentially more versatile. Ghahramani (1995) then investigates, as we shall do,

Page 4: Parameter estimation in latent profile models

374 A.P. Dunmur, D.M, Titterinotonl Computational Statistics & Data Analysis 27 (1998) 371-388

the problem of estimating parameters from a set of observed data, and applies the resulting algorithms to a problem in image analysis.

4. Parameter estimation by maximum likelihood

As with virtually all incomplete-data problems, explicit maximum likelihood esti- mates do not exist for the problems considered here, but there is a version of the EM Algorithm of Dempster et al. (1977) that can be used to maximize the likelihood iteratively. Suppose

and

h~(y) :P({y j , j = 1: j = 1,. . . ,s})

g,(xlY) = g(xl{3% = 1: j = 1,.. . ,s}),

where u=(u l . . . u s ) denotes a realization of the latent variables y. Given N inde- pendent observations, x(~),... ,x w), of x, the complete-data log-likelihood is

N

~c(O;{(x(n),y(,))})= ~ X - " (n) (,) 2-., Yl,, ''" Y~,, (log 9,(X(")ly(") ) + log h,(y("))}, r t = l It

where 0 denotes all unknown parameters and the second summation covers all d s possible realizations of the set of latent variables. In the EM algorithm a sequence of parameter estimates {0(m)} is generated from an initial 0(0), using the following double step: • E-step : Given O ( m - 1), evaluate

Q(O, O(m - 1 )) = E( ~c(O; {(x ("), y("))})[{x(")}, O(m - 1 )).

• M-step : Choose O=O(m) to maximize Q(O,O(m- 1)). Here, the E-step amounts to the computation of

f~")(m) E f (") v(")lx(") Otto- 1)}----P(y~] ), . . . . . v (") llx("),O(m- 1)) = ( Y l u ~ ' " : s u , ~ , ,. - , s u , =

=9~( x lY )h~(y 9v(x lY )hv(y , (4)

in which all parameter values on the numerator and denominator are based on O(m- 1). Since ~ , f~")(m)= 1, for all n, the {fut")(m)} have the interpretation of weights by which the nth observation is assigned to the latent class represented by u at time step m.

In the M-step, we have to maximize

Q(O, O(m - 1)) : y ~ ~ fu~")(m){log g,(x (") ly (")) + log h,(y("))} n u

with respect to the parameters 0 within g, and hu. So far as the first term is concerned, this is easy in the linear-Gaussian structure of Example 2.1; see later

Page 5: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titteringtonl Computational Statistics & Data Analysis 27 (1998) 371-388 375

for details. It would also be easy in many other exponential-family models; cf. Bartholomew (1984). So far as the second term is concerned, recall that it is nat- ural to choose a model for h(y) from the class of log-linear models, and explicit maximizers are available if we use a model for which direct estimation is feasi- ble; see for instance Section 3.4 of Bishop et al. (1975). Otherwise, this part of the M-step requires an iterative calculation, for instance using the iterative propor- tional fitting procedure (IPFP). In fact, even just one complete cycle of the IPFP leads to an increase in the log-likelihood of interest, and could form the basis of a so-called generalized EM (GEM) algorithm (Dempster et al., 1977), which is po- tentially much more economical than EM and yet should still reach the maximum likelihood estimate. Two cases for which there is no difficulty in the M-step are the full multinomial model, corresponding to d s unstructured latent classes, and, at the other extreme, the independence model of Example 2.1.

For the unstructured case, the updated (mth) estimate for the probability associated with the latent class corresponding to u is simply

hu(m)=N -1 Zf(~")(m). (5) tl

For the independence model, the estimates of the parameters in (3) are obtained by summing appropriate values in (5). For instance, the new estimate of hjv is

hlv(m ) = y]~ h,(m). U~Uj~13

We now return to the details of the M-step for 9(xly) for the case of Example 2.1. Here, the crucial part of the complete-data log-likelihood is

N N p

log 9(x(")ly(") ) = y ~ y ~ log Oi(X}n)ly(n) ). n=l n=I i=1

Example 4.1 (Example 2.1 continued). If g2 = diag(al 2 . . . . . o~2),

1 P Z log 9(x(")]y (")) = const - ~N ~ log(tr 2)

n i=1

2 n=l - j = l JYj ) j=l JYJ ~ "

Write wf for the ith row of W/ and write x} ") for the ith element of x ("). Then

I N P Zl°gg(x(")lY~n))=c°nst-- 2 Z l ° g ( ° ' 2 ) n i = 1

N / s 2

(6)

Page 6: Parameter estimation in latent profile models

376 A.P. Dunmur, D.M. Titterington/ Computational Statistics & Data Analysis 27 (1998) 371-388

t"- (n)T . (n)Tx Now write wit = (w~V; . . . . , w,~) and y(n)r = ~Yl . . . . , y~ ). Then the ith sum-of-squares term in (6) is

N

Z { ( x } n ) ) 2 - 2X}n)wTy (n) + wTy(n)y(n)Twi}

n = l

: ~ (x}") ) 2 - 2w/T ~_x[")y (") + w~ y(n)y(,)T Wi. (7) n = l n = l \ n = l /

Were the {y(n)} known, therefore, w; would be estimated by the standard least- squares formula

ffi = (yTy) - I yTxi ' (8)

where (x~)n--(") and the nth column of yT is y("). Consequently, the maximum - x i likelihood estimate of a 2 would be

~ = N- l ( x [x i - x~Yffi). (9)

If a common 0 -2 is assumed, then ~2= p-1 ~ #2. In reality, of course, the {y(")} are unknown and, when the M-step is performed,

y(") and y(,)y(,)T in (7) are replaced by their expected values, given the data {x (")} 62 and given current sets of estimates {•i} and { i}, say. Suppose we denote these

expectations by {)3 (")} and {y(,)y(,)T}, resulting in the replacement of y T y by y T y and Y by Y, say. Thus, within )3 ("),

~n) / ~ ( y ) r ) ) = p ( y ) ; ) 1 ) , l ~

where E and P denote the expectation and probability associated with the ~ condi- tional distribution. Also, within {y(,)y(n)T} there are elements such as

. (n) . (n) ~(y¢~ . (n), ~" Yj2h ) = Yj212 Yj, t, Yj~t~ =P(Yj, I, = 1).

The EM iterative stage then amounts to the following modifications of (8) and (9):

¢vi = (~f--y)-i ~.Wxi ' (10)

5~ = N-l(x~x~ - x~ Y~,~), (11 )

and these are evaluated for i = 1, . . . , p. The corresponding material in Ghahramani (1995) differs from this, in that (10) is replaced by

~i = Ey("-~y(n)Xy("~f'(y ( " ) T y(")y(")Ty(")X}"). (12) k n = l ) =

Also, 0-2= 1 is assumed for all i, so that no analogue of (11) appears. Ghahramani and Jordan (1996) extend the methodology to the case where the hid- den indicators are not independent but follow a Markov chain. Here also they fix the variances to be 1 and re-estimate w; using the corresponding version of (12), rather than the more natural (10).

Page 7: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titterington/ Computational Statistics & Data Analysis 27 (1998) 371-388 3 7 7

5. Computational aspects of the E-step and mean-field approximations

Ghahramani (1995) remarks that, even though the E-step in Example 2.1 requires only univariate and bivariate moments of the missing indicators, conditional on the observed data and the current estimates of the parameters, these moments can be obtained only by first computing the s-variate joint probabilities {f(~n)}(m), as de- fined in (4), and then summing to give the required marginal values. Computation of the complete set of {f~(n)}(m) has complexity of O(NdS), which may be imprac- tical in some contexts. He discusses the possibility of approximating them by Gibbs sampling, and he also develops a so-called mean-field approximation.

In this section we briefly review the mean-field approximation, in particular in the context of its application in the E-step of the EM Algorithm. For a more extensive discussion see Dunmur and Titterington (1998).

In the E-step, we have to evaluate expectations corresponding to a certain joint probability distribution,/3 say, for the latent quantities, conditional on the observables and given current (.') estimates of the parameters. However, with a multidimensional discrete sample space the number of quantities to be evaluated is very high, since the quantities involved are multiple indicators.

The essential feature of mean-field approximations is to replace a complicated joint distribution by an approximating product of simpler marginal distributions, and Dunmur and Titterington (1998) describe a number of rationales by which the ap- proximations can be derived. Here we shall just state one of the rationales. We suppose that p* is the proposed approximator of /3 and we define the product form

P*(Y) = r I P~(Yll ml), (13) l = 1

where ml is the mean-field parameter. In particular, when each Yl is a vector of indicators, we have

p * ( y ) = H H m y ~ r, (14) l r

within which ~-'~r mtr = 1, for all l. Equivalently,

log p*(y) = Z ZYtrlogmlr. l r

It will turn out that /3 allows us to write, for any particular (/ ,r),

log/3(y) -- ~ ytrhtr(Yol ) q- HI(y), F

where Yo, are a small subset of the {35) that are neighbours of yl, and Hi(y) contains all remaining terms, none of which involves Yl-

The rationale (Zhang, 1992, 1993) underlying the mean-field approximation is first to replace the joint distribution /3 by the product of the associated full conditional

Page 8: Parameter estimation in latent profile models

378 A.P. Dunmur, D.M. Titterinyton/ Computational Statistics & Data Analysis 27 (1998) 371-388

distributions, namely,

I I / 3 ( Y ' I { Y / J ~ l}). (15) t

Then, we envisage substituting means on the right-hand side of the conditioning sign, we evaluate the marginal mean of Yt corresponding to the resulting independence model and denote all means in the subsequent expression by m. These means then satisfy

mr = Ey,(YtIYj = m:,j ¢ l). (16)

Note that (15) is what Besag (1975) calls the pseudo-likelihood, which thereby relates the mean-field approximation to another familiar statistical concept.

Archer and Titterington (1997) compare various methods, including maximum- likelihood methods, in the analysis of hidden Markov chains. They issue a note of caution about the worrying imposition of determinism in substituting means for variables in the mean-field approach, and they give trivial examples that show that the approximation can be disastrous, but they also describe simulations that suggest, in support of the results reported in the papers of Zhang, Ghahramani and Jordan, that the use of mean-field approximations within the EM algorithm can be quite effective. This is an aspect that we re-visit in the present paper.

6. The mean-field approximations for the example

In this section we apply the mean-field approximation to the E-step for our exam- pie.

Example 6.1 (Example 2.1 continued). Recall that we must devote attention to /3(ytl{yj: j¢l}) . One way to identify this function which, in more detail, is P(YII{Yj: J# l},x,O), is to write down p(x,y]O), identify the factors within it that involve Yl, and divide them by the appropriate normalizing constant. Since the sam- ple space for each individual Yl is small, compared with that of y, computation of the normalizing constant is not difficult. Equivalently, one must identify the compo- nent of log p(x, ylO) that involves Yt- (To reduce the notation we omit the ." for the time being.) For this example, we obtain

IogP(YtI{YJ'J¢I}'x'O)=const'--yTWT(2-1( x - E W j y j ) j ~ t

-- 2 yT W/T~~-I Wtyt + yt T log hi,

where log hl denotes the vector with components {log hlk, k = 1,. . . ,d}. The third term on the right-hand side is equivalent to

1 T -1 T 1 --~tr{W~ 0 WlytYl } = - -~tr{WTf2-1~ diag(yt, . . . . , Yta)},

Page 9: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titterinotonl Computational Statistics & Data Analysis 27 (1998) 371-388

since Yt~Ytk, = 6a, ytk, where 61o,, is the Kronecker delta. Thus

and

379

ltr{ WTf2-' Wlyty~} 1 2 = --2 ~ (WJ(2-' Wl)ay,k, k

log P(Ylk ---- 1 [{)~,j # /} ,x , O) = const. + vlk,

where { (x )/1 Vlk= ~TQ-1 - -~- -~ . t ) _ ~ ( ~ T ~ - - I ~ ) ~ +loghtk. j#t k

The next stage is to substitute the mean-field parameters, ms. for the {~}, giving the approximation

P(Ytk = l[{yj=mj, j # l},x,O)

o( exp ~r~2-' - ~ ~rnj - ~ ( ~ T ~ - l ~ ) a + 1oghtk jet

Now recall that, when applied in the E-step of the EM algorithm, the parameter val- ues are the = values, and that this calculation is done for x = x ("), for each n = 1 .. . . , N. Thus, for each n, l and k,

p(yl~ )= l[{y~ ") --(~) : = mj , j • I} ,X (n),O)

c(exp [{ w/T~'~- 1 (x(n) --x,-~")']lj, k _ ~-z(1 ~xf~-, ~ ) a + log htk]

where ~" )= ~;=l;j~/ ~m~ ")" If we introduce the 'sofimax' vector, defined on positive-valued vectors v by

{soflmax(v)}k = e~ / ( ~ r e~ ) ,

and we carry out the final stage of the mean-field approach, namely to define -(") rrtlk as the expected value of the binary variable, yl~ ) associated with the above distribution, then

m~n)=softmax{~T~2-l(x'~)-xqt")) -I~~At+ log ht} , (17)

for all n and l, where zJt is a vector containing the diagonal elements of ~T~- I ~ . Formula (17) matches the corresponding expressions in Ghahramani (1995) and Ghahramani and Jordan (1996), although there is a slight lack of clarity in their notation. The Ns sets of nonlinear equations for the {m~ ")} have to be solved, in principle, in each E-step of the EM algorithm. Establishing formal conditions for a unique solution to Eq. (17) is a hard task due to the number and nature of the simultaneous equations.

Page 10: Parameter estimation in latent profile models

380 A.P. Dunmur, D.M. Titterinotonl Computational Statistics & Data Analysis 27 (1998) 371-388

7. Simple illustrative examples

In this section we examine very simple examples with a view to elucidating the comparative behaviour of standard EM and MF, as we shall call the EM algorithm with the mean-field version of the E-step. The two algorithms are dif- ferent, in general. The convergence properties of EM to the maximum likelihood estimates are in general very satisfactory (Wu, 1983), whereas the approximations and deterministic aspects of MF cast suspicion on its likely properties. On the other hand, the empirical results of Zhang (1992, 1993), Ghahramani (1995) and Ghahramani and Jordan (1996) are very encouraging. The examples we use here are deliberately simple, in order to help us pin down the main issues. They are both trivial versions of Example 2.1, and in fact concern simple Gaussian mixtures.

Example 7.1. In this example we take

,xiw,

where ~(.) denotes the standard Gaussian density function. The model is therefore that of an equally weighted mixture of two unit-variance Gaussians. In terms of Example 2.1, we have s = 1, d = 2 , tri = t r 2 = 1 (treated as known), and hT=(1 /2 , 1/2), treated as known. Thus, if y Z = (Yl,Y2) denotes the latent vector,

p(x~")ly¢") ) oc exp - (x ¢")2 + wly~n)(wi - 2x ~")) + w2Y2 (w2 - 2x~"))) ,

since y~")y~")=0, and y~"): = y~"), r = 1,2. The mean-field equations are therefore

m~ ") cx exp (wlx~n) _ 2 2 } ,

m~ ") cx exp {w2x (") - ~w~}l 2

and m~ ") + m~ ") = 1. The mean-field parameters are then used in the E-step instead of the expectations.

The true E-step considers the complete-data log-likelihood but with y!") replaced by /3~"), where

r

= e x p ( W r x ( n ) - - l l ~ : ) / ~ j e x p ( w j x ( n ) - - l w 2 ) ,

for r = 1,2. Hence, the mean-field equations (unsurprisingly) are identical to the expectations =~) and so both methods are equivalent. Pr

Example 7.2. In this example we assume that

XIy "-' N(W1Tyl + Wry2, 1),

Page 11: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titterington/ Computational Statistics & Data Analysis 27 (1998) 371-388 381

1 and otherwise is (0, 1). The vector where y~ = ( Y l l , Y I 2 ) --- (1, 0) with probability y2 a" = (y21, y22) has the same distribution, and is independent of yl. Thus, the marginal distribution of X is that of an equally weighted Gaussian mixture, with means (Wll + w21), (Wll + w22), (w12 + w21) and (wl2 + w22), and with unit variances. As in Example 7.1, we shall assume that the mixing weights and the common variance are known, and we shall create a one-parameter problem by expressing the mean parameters in terms of a single quantity. Specifically, we assume that, for some w, w~ =3w, w~2 =w, w21--3w and w22 =2w. Thus, the four-component means are 6w, 5w, 4w and 3w.

For a given observation, x, let m~ and m2 denote the mean-field parameters asso- ciated with y~ and y2, respectively. Then, for instance,

mll oc exp {3w(x-- 3wm21-- 2wm22)-- ~(3w)2} ,

{w(x- - 3wm21-- 2wm22)-- ~w 2} m12 oc e x p

and m12 = 1 - n i l . After using m22 ---- 1 - m2~, we obtain

rnH = [1 + exp{-2w(x - wm21 ) + 8w2}] -1.

By a similar argument,

rn21 = [1 + exp{-w(x - 2wmH) + (7/2)w2}] -~.

(18)

(19)

Given x and w, Eqs. (18) and (19) can be solved numerically for rnll and m21. Clearly, substitution of (19) into (18), and (18) into (19), give equations in m~l alone and m2~ alone, respectively. In the context of the E-step of the EM algorithm, there will be N such sets of equations, one set for each x (n), and w will be the current estimate ~.

At this point we look at the mechanics of the iterative stage of the EM algorithm, for the time being considering the complete-data likelihood for a single observa- tion, x. Since the mixing weights are assumed known, the quantity of interest is just

1 log p(x[y)= const. - ~(x - 3wyll - wy12 - 3wy21 - 2wyz2) 2

1 a = const. - ~aw + xbw,

where a=9yll + Y12 + 9y21 + 4y22 + 18yllY21 + 12ylly:z + 6y12y:l + 4ylzyz2 and b = 3 y l l + Y12 + 3y21 + 2y22. In computing a and b we have used the facts that y Z = y o, for all i and j , and yl~Yla=Y21Y22=O. In the E-step, each Yo is replaced

by some 33o. and each yijYkl by some y , ~ t , giving 1~ 2 - ~aw + x[~w, say. Furthermore, this is done individually for each x tn), so that the E-step results in the function

Page 12: Parameter estimation in latent profile models

382 A.P. Dunmur, D.M. Titteringtonl Computational Statistics & Data Analysis 27 (1998) 371-388

Table 1 Results for N=500, averaged over 50 simulations, Winit=0,1. The figures in parentheses give the standard deviation of the estimates, in units according to the final decimal place

Method w~e West RMS

EM 0.1 0.101 (8) 0.0080 MFs 0.1 0.101 (8) 0.0079 MFB 0.1 0.101 (8) 0.0079

EM 1.0 1.00 (1) 0.0143 MFs 1.0 0.99 (1) 0.0198 MF~ 1.0 0.99 (1) 0.0198

EM 2.0 2.00 (2) 0.0189 MFs 2.0 2.00 (2) 0.0219 MFB 2.0 1.99 (2) 0.0164

EM 5.0 5.001 (9) 0.0092 MFs 5.0 5.001 (9) 0.0094 MFB 5.0 4.51 (2) 0.4903

The M-step then gives the new estimate as ~b = ~ , x(")[~(")/~,, ~("). So far as the E-step is concemed, the details are as follows.

(i) Standard EM: Here, Y/~kt = P(Yij = Ykt = 1 ]x,#), for all relevant subscripts and, for instance, fill = Yl~'Y21 + Y1~fE2. For example, Y1"~21 c( exp{-½(x - 6~)z}.

(ii) MF: Here, Y ~ l = rhijhkt. There are a number of possible ways of implement- ing the MF algorithm: MFs, where the new estimates of the mean-field parameters are used immediately in subsequent mean-field calculations (sequential) and MFB, where the new estimates are stored until all the mean-field parameters have been estimated using the old estimates (batch). Thus, MFs corresponds to MFB in the same way that, in solving linear equations, the Gauss-Seidel method corresponds to the Jacobi method.

A numerical study was carried out based on various sample sizes of N = 200, 500, 1000 averaged over 50 simulations. The true parameter wt~e and the initial estimate winit were chosen from 0.1,0.5, 1.0, 2.0, 5, 0.

Table 1 presents a typical set of results for the estimation of the parameter w from a set of 500 examples averaged over 50 simulations; we~t denotes the average of the estimates and RMS denotes the root mean-squared error. The effect of the initial estimate was not significant in the EM and MFs runs. However, in the MFB runs with larger values of W~nit, the algorithm had trouble converging to a single solution; it became stuck in limit cycles. This means that the results of the simulation using MFB are not trustworthy for larger parameter values. The lack of convergence is due to the batch update and the lack of identifiability caused by estimating two degrees of freedom (the mean-field estimates) from a single observation. There is very little difference between the EM and MFs methods, suggesting that the mean- field approximation is reasonable.

Page 13: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titterington/ Computational Statistics & Data Analysis 27 (1998) 371-388

Table 2 Results for N = 500, averaged over 50 simulations, data generated using unequal mixing weights, e = 0.1, Winit = 0.1

Method Wtrue West RMS

EM 0.1 0.101 (8) 0.0080 MFs 0.1 0.10! (8) 0.0080 MFB 0.1 0.101 (8) 0.0080

EM 1.0 1.02 (2) 0.0242 MFs 1.0 1.01 (1) 0.0158 MFB 1.0 1.01 (1) 0.0158

EM 2.0 2.03 (2) 0.0339 MFs 2.0 2.03 (2) 0.0335 MFB 2.0 2.02 (1) 0.0261

EM 5.0 5.002 (9) 0.0091 MFs 5.0 5.002 (9) 0.0092 MFB 5.0 4.85 (7) 0.1632

383

An important point to note is that, as proved in the appendix, the first correction term to the mean-field terms is given by

l(AeZ)m(1 - m ) ( 1 - 2 m ) , 2 (20)

where (ZI/32) is the variance of the argument of the softmax function that appears in the mean-field equation (17) and m is the mean-field estimate of the individual expectations. Thus, whenever m is close to 0, 0.5 or 1 the corrections are small. In the case of four equally weighted mixtures, the true expectations of the latent variables are all 0.5 and the joint expectations for the different mixtures are 0.25. Hence, we would expect the mean-field algorithm to be capable of estimating the individual, and hence the joint, expectations reasonably accurately. In fact, it is possible to incorporate these correction terms in the mean-field approximation using the method of cavity fields; see Dunmur and Titterington (1997).

It is of interest to see how well the mean-field method performed when the data are generated using mixture distributions with unequal mixing weights; that is, the joint expectations of the latent variables are no longer equal to 0.25. This was studied using data generated from mixing weights given by Pll --P22 = 0.5--e , P~2 = P21 -~ ~,

where p~j is the probability that the first latent variable is in state i and the second is in state j , and e can be varied from 0 to 0.5. e = 0.25 is equivalent to the case studied above. The marginal probabilities of the latent variables are still 0.5.

Results for e = 0.1 are presented in Table 2. Again the initial estimate did not affect the results. The results show that the bias is generally greater than that for data generated from equally weighted mixture distributions, as would be expected. For small values of w~e the separation between the different mixtures is small and so we would not expect there to be much difference between the EM and MF algo- rithms as confirmed in the table. Interestingly, there is little difference between EM

Page 14: Parameter estimation in latent profile models

384 A.P. Dunmur, D.M. Titterinotonl Computational Statistics & Data Analysis 27 (1998) 371-388

Table 3 Results for N = 500, averaged over 50 simulations, e = 0.1, w~e = 0.1, method is EM. MFs and MFB produced almost identical results

Winit West RMS Pll p12 pal P12

0.1 0.101 (8) 0 . 0 0 7 9 0.250 (2) 0.250 (5) 0.250 (3) 0.250 (7) 0.5 0.122 (9) 0.024 0.059 (6) 0.129 (3) 0.269 (6) 0.543 (6) 1.0 0.15 (1) 0.048 0.0002 (1) 0.0042 (8) 0.066 (3) 0.930 (4) 2.0 0.15 (1) 0.052 0.000000 (0) 0.000000 (0) 0.000015 (4) 0.999985 (4)

Table 4 Results for N--500, averaged over 50 simulations, e = 0.1, wt~e = 1.0. MFs and MFB produced almost identical results

Method Winit West RMS p l 1 p12 p21 p22

EM 0.1 1.00 (2) 0.017 0.42 (4) 0.08 (3) 0.11 (4) 0.40 (4) MFs 0.1 0.98 (2) 0.028 0.39 (3) 0.06 (3) 0.37 (7) 0.18 (6)

EM 1.0 1.00 (2) 0.018 0.40 (5) 0.11 (3) 0.09 (3) 0.41 (3) MFs 1.0 1.08 (6) 0.101 0.1 (1) 0.4 (2) 0.1 (1) 0.4 (1)

EM 2.0 1.12 (4) 0.126 0.07 (9) 0.42 (8) 0.006 (5) 0.50 (3) MFs 2.0 1.1 (1) 0.168 0.2 (1) 0.1 (1) 0.3 (1) 0.39 (9)

EM 10.0 1.50 (2) 0.502 0.000 (0) 0.000 (0) 0.000 (0) 1.000 (0) MFs 10.0 1.50 (2) 0.502 0.000 (0) 0.000 (0) 0.000 (0) 1.000 (0)

and MFs for the larger values o f wtmc, again suggesting the MF approximation is reasonable.

The algorithms were then altered to estimate the joint distributions. A typical set o f results for Wtruc = 0.1 is presented in Table 3. Both MFs and MFB performed in an almost identical manner to EM on these data. The results show that, for the smaller initial value o f winit = 0.1, the algorithm assigns the data equally to all four mix- tures, contrary to the model from which the data were generated. However, this does not affect the performance unduly since at this smaller value o f wtm~ the mixtures are all close together. For the larger values o f winit, because the initial estimate sug- gests mixtures that are more separated, the data become assigned to a single mixture component, with a corresponding deterioration in performance.

The results for w ~ = 1.0 are presented in Table 4. Here, for the smaller winit the algorithms are able to estimate the true mixture probabilities correctly, with the EM algorithm outperforming MFs slightly. Again for larger initial estimates o f w the al- gorithms all perform less well, treating the data as coming from a single distribution.

A common feature o f many o f the simulations was the difficulty the algorithms had in converging. The MFB algorithm was particularly bad. When the priors were included in the algorithm MFs also had trouble converging to a solution.

Page 15: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titterinotonl Computational Statistics & Data Analysis 27 (1998) 371-388

Table 5 Results for N = 500, averaged over 50 simulations. Method was EM: MFs, MFB produced identical results. Winit = 0.1, e = 0.25, 0.1

385

Wtrue /3 West RMS e West RMS

0.1 0.25 0. I00 (7) 0.0073 0.1 0.100 (7) 0.0072 1.0 0.25 1.00 (1) 0.0102 0.1 1.00 (1) 0.0098 2.0 0.25 2.001 (8) 0.0079 0,1 2.000 (7) 0.0074 5.0 0.25 5.000 (7) 0.0074 0.1 5.000 (7) 0.0069

Table 6 Results for N = 500, averaged over 50 simulations, e = 0.1, Wt~e = 1.0, MFs and MFB produced almost identical results

Method wilt West RMS pll p12 P21 p22

EM 0.1 1.00 (1) 0.012 0.41 (4) 0.10 (2) 0.10 (2) 0.40 (3) MFs 0.l 1,00 (1) 0.014 0.30 (4) 0.20 (2) 0.19 (2) 0.32 (3)

EM 1.0 1.00 (1) 0.012 0.41 (4) 0.10 (2) 0.10 (2) 0.40 (3) MFs 1.0 0.99 (1) 0.014 0.30 (4) 0.20 (2) 0.19 (2) 0.32 (3)

EM 2.0 1.00 (1) 0.012 0.41 (4) 0.10 (2) 0.10 (2) 0.40 (3) MFs 2.0 1.00 (I) 0.014 0.30 (4) 0.20 (2) 0.19 (2) 0.31 (3)

EM 5.0 1.20 (1) 0.196 0.71 (2) 0.00 (0) 0.30 (2) 0.00 (0) MFs 5.0 1.24 (1) 0.243 1.00 (0) 0.00 (0) 0.00 (0) 0.00 (0)

Example 7.3. This is identical to the previous example, except that the observable consists o f a three component vector x T = (xl,x2,x3), and the observations given the latent variables are distributed as

Sly ~ N ( W ~ y l + W2y2, 1).

In this case the weights connecting the latent variables to the outputs were still related to a single parameter w, and were given by

= w , W 2 = w .

Initially, data were generated using equal mixing weights. As in the case o f a single output the initial conditions had no effect on the solution to the inference problem. The results for a typical set o f data are presented in Table 5. The three algorithms gave almost identical results. (They differ at 10-6.) Unequal mixing weights were

considered, as in the previous section; again the results are comparable among all three methods. The RMS errors are comparable to the case with equal mixing weights

even for the larger values o f wt~o. The model was extended to include estimation o f the priors on the mixture dis-

tributions and results are presented in Table 6. There is an interesting difference between the EM algori thms and the MF algorithms for small initial w; EM is able

Page 16: Parameter estimation in latent profile models

386 A.P. Dunmur, D.M. Titteringtonl Computational Statistics & Data Analysis 27 (1998) 371-388

to estimate the priors correctly, but MF is not able to get as close. This is be- cause EM estimates the joint probabilities directly, whereas MF computes them by multiplying the individual expectations.

8. Discussion

The main computational conclusion to be drawn from the examples concems the effectiveness of the mean-field approximations within the EM algorithm. Of course, the examples we have chosen have been very simple and there is no problem in applying the standard EM algorithm in these cases. The practical advantages of the approximations come to the fore in much more complex problems in which the required latent structure is much less simple, that is, when d s is large. Under those circumstances, the computational demands of solving the equations that determine the mean-field parameters outweigh the difficulties of implementing the E-step in the standard EM algorithm.

One feature of the simulations was that for larger initial parameter values the al- gorithms had trouble converging to a reasonable solution. This is because for these larger parameter values the 'softmax' function becomes saturated, i.e. gives values close to 0 or 1 and therefore the algorithm may find it harder to explore the whole of the solution space, leading to lack of convergence. This suggests that, in selecting initial estimates for the parameters, it is better to start from smaller values rather than to try to reduce from larger values. Obviously, if the true value is itself large then there may be problems whatever the initial estimates are unless they are close to the true value.

The general ideas can also be applied beyond the case of latent profile analysis. For instance, Dunmur and Titterington (1998) report and illustrate the correspond- ing analysis for the latent class model (Hagenaars 1990). Further developments will cover the possibilities of missing values in x and of a mixture of discrete and con- tinuous variables within x, as well as Bayesian versions of the methodology.

We have concentrated on maximum likelihood estimation. Molenaar and von Eye (1994) look at moment estimation of latent profile models and relate the models to factor analysis models in terms of their second-order moment structure.

Acknowledgements

The work in this paper was carried out with the support of a grant from the UK Engineering and Physical Sciences Research Council. The authors would like to thank the referees for their helpful comments.

Appendix. Calculating the corrections to the mean-field equations

The expectations of the latent variables are given exactly by

E ( y l k ) : E { y j : j 4 , } {Ey , (Y lk l{y j : j ¢ / } ) } .

Page 17: Parameter estimation in latent profile models

A.P. Dunmur, D.M. Titteringtonl Computational Statistics & Data Analysis 27 (1998) 371-388 387

For the categorical latent variables under study, the expectation over a single latent variable is given by (17), and hence

E(ylk ) = E{yj:j~l} {softmax(etk)},

where etk is given by (17). This equation can then be subjected to a Taylor expansion about E(el~) to give

1 0 2 E(y,k) = softmax{E(e,k)) + ~ ~E(Aet, Aett)~.~softmax{E(etk)} + O(Ae3),

rt

where Aetk = elk--E(elk). The naive mean-field approximation is to set rntk = softmax {E(etk)}. Hence, after some algebra,

1 E ( y l k ) = mtk + ~ Z E(Ae t , Aelt)mtk(cS~ - ml,.)(b,t - 2mlt) + O(A~;3),

rt

which implies (20).

References

Archer, G.E.B., Titterington, D.M., 1997. Parameter estimation for hidden Markov chains. Statist. Comput. (invited revision), submitted.

Bartholomew, D.J., 1984. The foundations of factor analysis. Biometrika 71, 221-232. Besag, J., 1975. Statistical analysis of non-lattice data. The Statistician 24, 179-195. Bishop, Y.M.M., Fienberg, S.E., Holland, P.W., 1975. Discrete Multivariate Analysis. MIT Press,

Cambridge, MA. Cheng, B., Titterington, D.M., 1994. Neural networks: a review from a statistical perspective (with

discussion). Statist. Sci. 9, 2-54. Dempster, A.P., Laird, N.M., Rubin, D.B., 1997. Maximum likelihood estimation from incomplete data

via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39, 1-38. Dunmur, A.P., Titterington, D.M., 1997. On a modification to the mean-field EM algorithm in

factorial learning. In: Mozer, M.C., Jordan, M.I., Petche, T. (Eds.), Advances in Neural Information Processing Systems 9. MIT Press, Cambridge, MA, pp. 431-437.

Dunmur, A.P., Titterington, D.M., 1998. Analysis of latent structure models with multidimensional latent variables. In: Kay, J.W., Titterington, D.M. (Eds.), Statistics and Neural Networks: Recent Advances at the Interface. Oxford University Press, Oxford, to appear.

Ghahramani, Z., 1995. Factorial learning and the EM algorithm. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (Eds.), Advances in Neural Information Processing Systems 7. MIT Press, Cambridge, MA.

Ghahramani, Z., Jordan, M.I., 1996. Factorial hidden Markov models. In: Touretzky, D.S., Mozer, M.S., Hasselmo, M.E. (Eds.), Advances in Neural Information Processing Systems 8. MIT Press, Cambridge, MA, pp. 472-478.

Gibson, W.A., 1959. Three multivariate models: factor analysis, latent structure analysis and latent profile analysis. Psychometrika 24, 229-252.

Hagenaars, J.A., 1990. Categorical Longitudinal Data. Sage, London. Hinton, G.E., Zemel, R.S., 1994. Autoencoders, minimum description length and Helrnholtz free energy.

In: Cowan, J.D., Tesauro, G., Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6. Morgan Kaufmann, San Mateo, CA, pp. 3-10.

Page 18: Parameter estimation in latent profile models

388 A.P. Dunmur, D.M. Titteringtonl Computational Statistics & Data Analysis 27 (1998) 371-388

Molenaar, P.C.M., von Eye, A., 1994. On the arbitrary nature of latent variables. In: von Eye, A., Clogg, C.C. (Eds.), Latent Variables Analysis. Applications for Developmental Research. Sage, Thousand Oaks, CA, pp. 225-242.

Wu, C.F.J., 1983. On the convergence properties of the EM algorithm. Ann. Statist. 11, 95-103. Zhang, J., 1992. The mean-field theory in EM procedures for Markov random fields. I.E.E.E. Trans.

Signal Process. 40, 2570-2583. Zhang, J., 1993. The mean-field theory in EM procedures for blind Markov random field image

restoration. I.E,E.E. Trans. Image Process. 2, 27-40.