a bayesian classification discrimination*mw/mwextrapubs/lavine1992.pdf · posterior classification...

11
The Canadian Journal of Statistics 451 Vol.20, No. 4, 1992, Pages 451-461 La Revue Canadiennede Statistique A Bayesian method for classification and discrimination* MichaelLAVINE and Mike WEST Institute of Statistics and Decision Sciences, Duke University Key words and phrases: Bayesian computations, classification, discrimination, mixtures of normals, posterior sampling AMS 1985 subject classifications: Primary 62430; secondary 62F15, 65U05. ABSTRACT We discuss Bayesiananalyses of traditional normal-mixture models for classification anddiscrim- ination. The development involves application of an iterative resamplingapproach to Monte Carlo inference, commonly called Gibbs sampling, and demonstrates routine application. We stress the benefits of exact analyses over traditional classification and discrimination techniques, including the ease with which such analyses may be performed in a quite general setting, with possibly several normal-mixture componentshaving different covariance matrices,the computation of exact posterior classification probabilities for observeddata and for futurecases to be classified, and pos- teriordistributions for these probabilities that allow for assessment of second-level uncertainties in classification. RESUME Les auteurss'interessent a l'analyse bayesienne des modeles traditionnels de classificationou de discrimination dans lesquels interviennent des melanges de lois normales. Leur travail porte plus particuliErement sur l'application d'une procedure de r&-&chantillonnage iterative a une methode inf6rentielle de type Monte-Carlo, communement appelee l'&chantillonnage de Gibbs. Ils font valoir les avantages que comportent les analyses exactes par rapport aux techniques traditionnelles de classification et de discrimination.Ils illustrent entre autres avec quelle facilit6 il est possible d'effectuerde telles analyses, meme lorsque les matrices de variances-covariances des lois normales qui composent le melange different. Ils montrent aussi commenteffectuer de faqon exacte le calcul des probabilities de classificationa posteriori d'observations passees et futures,ainsi que celui des lois a posteriori assocides a ces probabilites, ce qui permet d'evaluer l'incertitude du deuxieme ordrerattachee au processus de classification. 1. INTRODUCTION Binder (1978) describes a general class of normal-mixture models, and discusses some ingredients of Bayesian approaches to classification, clustering, and discrimina- tion using such models. Hartigan (1975, Chapter 5), McClachlan and Basford (1988), and Titterington, Smith, and Makov (1985) discuss approaches to inference, Bayesian and non-Bayesian, in similar models. These analyses, and most others to date, depend on various forms of analytic or numerical approximation to inferences, due to the mathemat- ical difficulties induced by the complexity of likelihood functions for model parameters. Recent developments in Monte Carlo analysis using iterative resampling schemes, as in *Both authorswere supported in part by NSF grant DMS-8903842. The second authorwas also supported by NSF grant DMS-9024793.

Upload: others

Post on 17-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

The Canadian Journal of Statistics 451 Vol. 20, No. 4, 1992, Pages 451-461 La Revue Canadienne de Statistique

A Bayesian method for classification and discrimination*

Michael LAVINE and Mike WEST

Institute of Statistics and Decision Sciences, Duke University

Key words and phrases: Bayesian computations, classification, discrimination, mixtures of normals, posterior sampling

AMS 1985 subject classifications: Primary 62430; secondary 62F15, 65U05.

ABSTRACT

We discuss Bayesian analyses of traditional normal-mixture models for classification and discrim- ination. The development involves application of an iterative resampling approach to Monte Carlo inference, commonly called Gibbs sampling, and demonstrates routine application. We stress the benefits of exact analyses over traditional classification and discrimination techniques, including the ease with which such analyses may be performed in a quite general setting, with possibly several normal-mixture components having different covariance matrices, the computation of exact posterior classification probabilities for observed data and for future cases to be classified, and pos- terior distributions for these probabilities that allow for assessment of second-level uncertainties in classification.

RESUME

Les auteurs s'interessent a l'analyse bayesienne des modeles traditionnels de classification ou de discrimination dans lesquels interviennent des melanges de lois normales. Leur travail porte plus particuliErement sur l'application d'une procedure de r&-&chantillonnage iterative a une methode inf6rentielle de type Monte-Carlo, communement appelee l'&chantillonnage de Gibbs. Ils font valoir les avantages que comportent les analyses exactes par rapport aux techniques traditionnelles de classification et de discrimination. Ils illustrent entre autres avec quelle facilit6 il est possible d'effectuer de telles analyses, meme lorsque les matrices de variances-covariances des lois normales qui composent le melange different. Ils montrent aussi comment effectuer de faqon exacte le calcul des probabilities de classification a posteriori d'observations passees et futures, ainsi que celui des lois a posteriori assocides a ces probabilites, ce qui permet d'evaluer l'incertitude du deuxieme ordre rattachee au processus de classification.

1. INTRODUCTION

Binder (1978) describes a general class of normal-mixture models, and discusses some ingredients of Bayesian approaches to classification, clustering, and discrimina- tion using such models. Hartigan (1975, Chapter 5), McClachlan and Basford (1988), and Titterington, Smith, and Makov (1985) discuss approaches to inference, Bayesian and non-Bayesian, in similar models. These analyses, and most others to date, depend on various forms of analytic or numerical approximation to inferences, due to the mathemat- ical difficulties induced by the complexity of likelihood functions for model parameters. Recent developments in Monte Carlo analysis using iterative resampling schemes, as in

*Both authors were supported in part by NSF grant DMS-8903842. The second author was also supported by NSF grant DMS-9024793.

Page 2: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

452 LAVINE AND WEST Vol. 20, No. 4

Gelfand and Smith (1990), for example, now provide for the relevant calculations to be performed. This is demonstrated and illustrated here.

2. NORMAL-MIXTURE MODELS

Suppose data yj (j = 1,2,...,; yj E RP) are modelled as exchangeably distributed

according to a discrete mixture of a known number k of multivariate normal components. Specifically, (yjl'r) ~ N(ILi; 1i) with probability Oi, for each i = 1,...,k, for some mean vectors i= - (iii; i - 1,...,k), variance matrices 1 - (1i; i = 1

...., k), and classification probabilities in the k-vector 0 = (01,... Ok); here ar represents all

parameters Tr - (i,J, 0). Introduce classification variables zj, where zj = i implies

that yj is drawn from component i of the mixture, or classified into group i. Thus knowledge of zj revises p(yjlFr) to the single normal component (yjlZ = i, rr) ~

N(IFi; Zi). Additionally, (z1jO) are conditionally independent with P(zj il0)

= Oi. We are concerned with problems of inference about the model parameters Trr, the classification

quantities zj, and the classification of future cases. Inferences will be based on observing a sample of the yj with, typically, only a fraction of the corresponding classification

quantities zj observed. Here we specify a class of prior distributions and detail some structure of the resulting posteriors.

We generalize Binder (1978, Section 3.3) in defining a conditionally conjugate prior for rr. We assume (Iii, Zi) to be mutually independent over groups i - 1,..., k, with normal-inverse Wishart priors. The notation and structure for such priors follow West and Harrison (1989, Section 15.4.3) and are briefly detailed in an appendix here. We assume

(liiJi) r~ N(mi,o; li/hi,o), for some means mi,0 and precision parameters hi,o > 0, and take the margin for 1i as the inverse Wishart distribution with vi,o > 0 degrees of freedom and scale matrix Vi,o, denoted by 1i 1 W -1(vi,o, Vi,o), as in the appendix. Finally, we assume 0 to be independent of (li, ), with a Dirichlet prior, 0 D (ao), where

ano = (al,o,..., ak,0); the prior mean vector is E(0) =ao/Ao, where Ao al,o + - - +ak,0. Consider now a set of observations y = (yj; j -

1,..., n) for some integer n, writing z = (zj; j 1 ..., n). Under the specified model, the joint distribution of (y, z, Trr) has the following component conditional distributions.

(1) p(li, I y, z, 0). Fixing z implies the data are classified as k independent normal samples, and the analysis is standard, as in De Groot (1970, Section 9.10). Prior independence leads to posterior independence of the (tFi, 1i) over groups i, with normal-inverse Wishart posteriors defined as follows. Let Gi - {lzj

= i}, the index set for observations in group i, and gi - #Gi, so that n gi + + gk. For each group, the sufficient statistics are the mean vectorsi

- gi- yj, and the matrices of sums of

squares and cross-products Si = (yj - Yi)(yj - ,i)T, where each sum is over j E Gi. Then p(i, ilY, z, O) = p(tIi, ily,z) has components (FIili, y,z) ' N(mi;li/hi) and

(lily, z) ,

W( -1(vi, Vi) with hi = hi,o + gi, mi = (hi,omi,o + giyi)/hi, vi = vi,o + gi, and Vi - Vi,o + Si + (yi -

mi)(,i - mi)Tgihi,o/hi.

We note a minor modification of these results to apply when the reference prior

P( i, i) oc l~i -(p+l)/2 is used in place of the normal-inverse Wishart priors above. Then, assuming gi > p + 1, the posteriors p( Ri,,i y,z) are as above, but now with

hi = gi, mi = fi, vi = gi - p, and Vi = Si.

(2) p(O ly, z, i, 1). Given z, 0 is conditionally independent of (y, I, 1), and has the Dirichlet posterior (0 z) i (a), where a (al,..., ak) with ai -

ai,o+gi; the posterior

means are E(0ilz) ai/A, where A al + .

+ak.

Page 3: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

1992 CLASSIFICATION AND DISCRIMINATION 453

(3) p(zly, rr). Given y and r = (i, , 0), the zj are conditionally independent. For each j = 1 ..., n,

P(zj = ily, or)0C Oip(yjl i, li, i) (i = 1 .....

k), and summing to unity over i = 1,...,k. Here p(yj fIi, i,.zj = i) is just the normal density function for group i, with mean vector Cti and variance matrix li, evaluated at the point yj.

(4) As a corollary to (1) and (2), we may easily obtain the marginal posteriors for (iilY, z) and the predictive distributions for new cases drawn from any specified group. Using results and notation from the appendix, the margin for lli is (Iiy, z) ,

Ti (mi; Vi/(hivi)), with density given in Equation (Al) of the appendix. In predicting a future observation drawn from group i, say (yfli, i, zf = i) -- N(I'ti;1i), the predictive distribution is (yfy,z,9zf i) , Tv(mi; Qi), where zf is the classification indicator for

yf and Qi - Vi(1 + hi)/(hivi); the density function is given in Equation (A2) of the appendix. If zf is unknown, the unconditional predictive distribution is just the mixture

P(Yf y, z) Ep(yf y,z,zf = i)ai/A. A general framework supposes that we may observe a training sample of some t per-

fectly classified cases, and a further u unclassified cases. Thus we assume we are to observe data Y(T) = (Y1,... yt) together with classification indicators Z(T) = (z,

.... ,Zt),

and then Y(u) = (Yt+l1....Yt+u),

and we will process these two datasets sequentially,

(y(T), Z(T)) followed by y(u). We will then proceed to inference about the model param- eters it and the classification quantities z(U)(zj; j = t + 1 ..., t + u) and also to predictive classification of future cases.

Consider first processing the training sample. The prior is conjugate and the analysis is standard, since the data are perfectly classified into normal components, the quantities Z(T) = (zi. ... zt) being known. The components of analysis are just as described under (1) and (2) above, with n = t classified observations. Following the analysis, we are left with independent normal-inverse Wishart posteriors p(1Fi, Zi •Y(T), Z(T)), and the Dirichlet posterior p(0 IZ(T)). The structure of the joint posterior for Tr

= (.I, g, 0) given (Y(T), Z(T))

is just that of the prior, with the defining parameters appropriately updated. Consider now the unclassified sample Y(u). We know that (y(u),z(u)) is conditionally

independent of (Y(T),Z(T)) given the parameters 'r, and so points (1)-(3) above apply to determine various components of the posterior p(i, , 0, Z) IY(T), Z(T), zy(u)). This in- volves simply replacing the prior for Tr throughout by the similarly structured distribution

p(rr y(T), Z(T)), just obtained, that summarizes the revised state of information about the parameters based on the training sample. Now marginal posteriors for (,1, ), for ex- ample, are difficult to compute, since z(u) is uncertain. It is at this point that Monte Carlo analysis using iterative resampling from the conditional posteriors defined under points (1)-(3) is useful.

3. SAMPLING THE POSTERIOR

We now identify the posterior p(rr y(T), Z(T)) as the prior in points (1)-(2) above, and the unclassified sample (y(u),z(U)) as the data (y,z) to be processed, with the sample size u replacing n. For notational convenience, write D for the known data information D (Y(T), ZT), Y(u)). Note that the following sampling exercise is computationally straightforward:

(a) Given z(U), we may draw a sample from the posterior p( t,1ID, z(U)) -P((

i,,iID,z(u), the product over i = 1,...,k independent normal-inverse Wishart

Page 4: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

454 LAVINE AND WEST Vol. 20, No. 4

components defined as in (1). Convenient and efficient algorithms for simulating inverse Wishart distributions are given in Anderson (1984, pp. 247, 254-255).

(b) Also given z(u), it is trivial to sample 0 from the conditional Dirichlet posterior p(O ID, z(u)) defined as in (2).

(c) Given Ir = (Fp,1,0), it is similarly trivial to sample from the posterior p(z(u)ID, 7r) defined as in (3).

Based on these observations, an iterative resampling technique, as in Gelfand and Smith (1990), for example, provides for an approximate draw from the joint posterior p(r, z(u)lD) to be obtained as follows. Start with an assigned value for the initial classification vector Z(U). Proceed through (a) and (b) to sample Ir from the conditional posterior based on this value of z(u). At (c), use this sampled value of Ir to determine

p(z(u)lD,ir) and sample from this distribution to get a new value for z(u). Return to (a) and repeat, iterating through this cycle repeatedly to update the values of 'rr and

z(u). With sufficient iteration, this process leads to "final" values (w, z(u)) that form an approximate draw from the joint posterior p(rr, Z(u) ID). Replicating the process provides for an approximate random sample to be drawn from the posterior, forming the basis of a Monte Carlo analysis.

A suitable starting value for the vector z(u) is given by initially classifying the data

y(u) into groups i = 1,..., k according to their individual preposterior classification probabilities P(zj = iIy(T), Z(T), yj), for each j = t+ 1,..., t + u. These are easily computed via P(zj = il z(T),(T), yj) o P(zj = iIy(T),z(T))p(y y(T), z(T), Z = i). The first term here is just E(0ilY(T), z()), the ith element of the mean vector of the Dirichlet posterior

p(0 IY(T), (T)), from point (2) above. The second term is just the value at yj of the density of the multivariate T-distribution for predicting new cases in group i, given under point (4) above. A referee suggests an alternative for determining starting values: select zt+l as before, and for j = t +

2,....,t + u select zj based on the probabilities

P(Zj = ily(T),z(T),yt+1, .. Yj-,,Zt+l,... ,z1j-,,y). We don't know which alternative is better.

Suppose this procedure is followed to produce a sample of size N from the posterior, denoted by (Ir (r), z(u)(r); r = 1, ..., N), say. Monte Carlo inference about the elements of Ir and z(U) may be based directly on the sampled values, or more efficiently on refined approximations to the marginal posteriors determined as follows. Simply recall that, were z(u) known, inference about elements of Tr would be based on standard normal theory. Also, were Tr known, then inference about z(u) would be simple too, based on the conditional probabilities P(zj = ily, ir) defined in item (3) of Section 2. Then the Monte Carlo approximations to p(wr ID) and p(z(u)ID) are simply the mixtures

N

p(Tr ID) - N- Zp(onID, z(u)(r)), r=l

(1) N

p(z(yU D) 1-

N-1 p(z(U)|Dt

(r)). r=l 1

(i) The first equation in (1) has a margin for Li (i = 1 ...., k) that is a mixture of

conditional T-posteriors, easily evaluated and summarized. Similarly, inference about Zi will be based on a mixture of inverse Wisharts.

(ii) Of particular interest in discrimination and classification are the posterior proba- bilities P(zj = ijD) for each i = t + 1 ...., t + u. The second equation in (1) directly gives Monte Carlo estimates of these quantities.

Page 5: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

1992 CLASSIFICATION AND DISCRIMINATION 455

(iii) Consider prediction of further observations. Suppose that such an observation

yf is known to come from component i of the mixture; thus, if zf is the classification indicator for yf, we require the density function p(yf ID, zf = i). Now (1) applies to give a mixture of T-distributions, each component p (yfD, zf = i, z(u)(r)) identified as described in point (4) of Section 2.

(iv) If zf is unknown, the density p(yfD) N-'• p(yflD, z(u)(r)) forms the basis for prediction of yf. The mixture components here are easily given by

k

p(yf D, z(u)) = ~ P = ilD, z(u))p(yflD, z(U), f

= i); i=l

the probability forming the first term of the summand here is evaluated as P(zf =

ilD, z(u)) = E (OiID, z(u)), and the second term is just the density in (iii). (v) In attempting to classify yf when zf is unknown, we are interested in the posterior

probabilities P(zf = iID, yf) oc P(zf = iID)p(yf ID, zf = i). (2)

The first term here is simply approximated, using (1), as P(z1 = ilD) = E (OiID) N-' E (OilD,z(u)(r)), the sum over r = 1,...,N, of course. The second term is evaluated as in (iii).

In connection with classification and discrimination in points (ii) and (v), note that the classification probabilities are dependent and that the sampling-based calculations allow for assessment of the dependence. Neighbouring points will, with high probability, belong to the same group. To focus discussion, consider the simple example of one- dimensional data coming from a mixture of just two normal distributions with known variances of unity. Suppose also that, based on training data, the posteriors for group means are

1ti ^ N((-1)i2; 1), and the posterior Dirichlet for 0 has a = (25,25), with

CE(Oi) = 0.5 for i = 1,2. Suppose two unclassified cases are observed at zero. Easy calculations show that P(zj = iID) = 0.5 for each i. However, it can also be shown that P(zi = Z2ID) = 0.7. This feature will arise, quite generally, in considering an observation or a yf that has neighbouring points that are uncertainly classified, and particularly when those points are influential in updating the posterior distributions of moments of any component normal with which they are identified in conditioning.

4. ILLUSTRATION

The illustration is based on a two-dimensional, three-component version of a waveform recognition problem developed in Breiman, Friedman, Olshen, and Stone (1984, Section 2.6.2). Bivariate observations are generated from a k = 3, equally weighted component mixture of nonnormal distributions, as follows. Define matrices

C (5 1), C2( 1), and

Cil(5 ). 3 5 ' 1 5 ' 3 1 "

An observation from component i of the mixture is generated according to

y yjl - -

_

+ Yj2 1 - j fj2

where w1 is uniform over the unit interval, and the Ejr are independent N(0; ) quantities. The wj and ey, are also independent over observations j.

Page 6: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

456 LAVINE AND WEST Vol. 20, No. 4

(a) (b)

? .. . : :,.

5? 5

4- 4 tr ** I g

3~ - 2:2

11 ?

3 2 3 3 ,

3 3 " :--r." ? 1 2 1 . 31 3

???.? ,?

I~.' !C !. i I !

0 2 4 6 0 2 4 6

yl yl

FIGURE 1: Data yf = (y l, y2): (a) 15 classified cases marked as 1, 2, and 3 according to group, and (b) 300 unclassified cases.

An initial training sample of size t = 15 drawn from this model appears in Figure 1(a); there are just three cases from component 1, five from component 2, and seven from component 3. This forms the basis of initial analysis using independent reference priors for the group moments (5li, Z~), and a reference prior for 0 with Dirichlet parameter ao = (0, 0, 0). Predictive distributions based on the training data appear in Figure 2. In these graphs, and in Figure 4, density contours plotted determine approximate 25, 50, and 75% regions. Figure 2 displays these contours for each of the three bivariate T-densities, P(y; Iy(T), Z(T), Zf = i), and also the unconditional predictive density p( yf IY(T), Z(T)), just the mixture of the three T-densities as noted in point (4) of Section 2. The paucity of training data from component 1, in particular, leads to a rather diffuse distribution for that component, with vi = gl -p = 3 - 2 = 1 degree of freedom; this is reflected in the mixture, which has a mode corresponding to each component 2 and 3, but not 1. It is also straightforward to compute the posterior classification probabilities for a future case

yf,

P(zf =

ily(T), z(r), yf)oc E(OiIy(T), z(T))p(yfly(T), z(T), Zf = i) (i = 1, 2, 3).

Figure 3(a) displays a discrimination function based on these probabilities. In the dark

region P(zf = 31y(r), Z(T), yf) is the largest of the three probabilities; the shaded and white

regions correspond to components 2 and 1 respectively. Figure 3(d), (c), and (d) display contours of the classification probabilities. In Figure 3(d), for example, the four regions from dark to white are where P(zf = 31y(rT), z(T), yf) > 0.9, 0.5 < P(zf = 3|y(T), z(T), yf) < 0.9, 0.05 < P(zf = 3y(rT), z(), yf) < 0.5, and P(zf = 3y(T), z(T), yf) < 0.05, respectively. The probability on component i naturally decreases as we move away from the region of the mode of p(yf Y(rT), z(T), zf = i). One point of interest, however, which is quite general, is that this probability again increases eventually in some directions. For example, P(zf = 1 ly(T), z(T), yf) decreases as yf moves away from the mode near (2, 4)T, but eventually

increases again as yf moves either northwest or southeast. Generally, once we are removed from the central region where the component densities vary, as displayed, the component T-density that is most diffuse in any direction will eventually dominate.

So far the computations are standard, analytically derived. Now consider further, unclassified data. Figure l(b) displays a further set y(u) of u = 300 unclassified cases.

Page 7: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

1992 CLASSIFICATION AND DISCRIMINATION 457

Mixture density Component 1

6 6

5 5

4 4

"

23

,

3

2 2

- I I

o 2 4 6 0 2 4 6

yl yl

Component 2 Component 3

6 -6

5 5

4 4

2 2

1 - 1

0 - _0 _

o 2 4 6 0 2 4 6

yl yl

FIGURE 2: Contours of densities p( yf IY(T), Z(T)) and p( yf IY(T), z(T, zf = i) for components i = 1,2, and 3. The contours represent 25%, 50%, and 75% probability regions, the last being the outermost contour in each case.

The iterative resampling analysis of Section 3 is performed to give inferences partially summarized in Figures 4 and 5. The Monte Carlo analysis has sample size N = 500, each sampling exercise based on 50 resampling iterations. Figure 4 provides predictive densities p( yf D, z(u), zf = i) (i = 1,2, 3) and p(yf ID, z(U)) as described under points (iii) and (iv) of Section 3, neatly summarizing the data analysis in predictive terms. Posterior classification probabilities for future cases may be evaluated as in point (v) of Section 3, and again plotted as functions of yf; the discriminant function and contours of these probabilities appear in Figure 5. The superiority of such plots over the usual linear or quadratic discrimination rules is clear. However, if desired, analogues of such rules may be deduced if a classification loss function is imposed, since Monte Carlo approximations to the posterior expectations required to evaluate expected losses may be easily computed. Further inferences may be easily derived from the results of this

Page 8: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

458 LAVINE AND WEST Vol. 20, No. 4

Case t=15: Classification summary

(a) Modal Component (b) Component 1

......................... .

......... .r5??~?2.,:r.?,.........:::??t ~ ~ ?

....

.....- .... 6 6

,, i~iz iiiiiiiii i .......::::.!i?ii.i::i.r!•

...... . .......

....

-- !iiiii:: ii~is!•is...... .. . . .

....... ..

...iig~ lii~~l~ j-ii

..

........... ...:..:...:

ii:iii:,.,,

?

0 t ~ ,:~i :: - '"

::':?:':::+:':':~:'":':'?:'~:':':':':

22-

2.. ........2.

... .... . . . .. . . . . . . . .. . ..... . ................ .. . . .... ... . ~. . ..... ..... . !

.. . . . ..

. . . . ....

. . . .

o 2 4 6 0 2 4 6

yl y

(c) Component 2 (d) Component 3

5I15

...!ii:!:iiiii!-i:.....':........ ? ~ ~ ~ ~ ~ ........... .iiiii~i~i~',i

? • ~ ~~.

. .......... ...i',iiiiii

o 2 4.6.0.2.4.6

y1- . .. ....y...X.-

.............

..............

;::,:,::::;:::::i::::.:.:.:,:.::l:::-??:::;;;:?

..' ?????' ..... .... . i ~~:::....??:::: ??. ??LX

S. .... ' ........?.. .... ...

.! ! : r . ........

0 2 4 60 2 4 6

Y1 yl

cV .?:?:?:?:...... ...............:?: ?

?:1:? ........... ...........

?~....?:

c\I'

2:: ??????: ???::::: :

Y1 Y1 .:: ?:::::::::?t::::?~1 i::::?;??:::::::~:

FIGURE 3: (a) Modal probabilities. The black, grey, and white regions are where components 3, 2, and 1 are favoured, respectively. (b), (c), (d) Contours of P(zf = ily(, z (, ,y) for components i = 1,2, and 3. The black, dark grey, light grey, and white regions are where P(zf =

ily•I(, z(T), yf) > 0.90, 0.50 < P(zf = ily(T), z(, yf) < 0.90, 0.05 < P(zf =

ily(, z(T), Yf) < 0.50, and 0.05 > P(z = ilY(y(, z(, yf), respectively.

analysis. Posterior inference for the parameters of the component normals are commonly of interest, and the ingredients for such additional computations are available, though are not pursued further here.

The computations were performed using c and FORTRAN routines running on DECsta- tions under Ultrix 4.0. On a DECstation 2100, the resampling computations reported in this example were timed as follows. For the complete 500 samples and with 50 iterations each, the complete analysis of 300 unclassified cases to produce all the required outputs was timed at around 75 minutes cpu time. This code was not optimized in any way. The time will increase roughly in proportion to numbers of unclassified cases. Of course, ad- ditional effort is needed to deduce contour plots of predictive densities and classification

Page 9: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

1992 CLASSIFICATION AND DISCRIMINATION 459

Case t=15 & u=300: 25%, 50% and 75% predictive contours

Mixture density Component 1

6 - 6

5 5

4 4

, 3 3

2 2

1 1

0 0

0 2 4 6 0 2 4 6

yl yl

Component 2 Component 3

6 -6

5 -5

4 -4

3 - 3

2 2-

1 - 1 -

0 2 4 6 0 2 4 6

yl yl

FIGURE 4: Contours of densities p(yflD, z(u)) and p(yflD, z(U),zf = i) for components i = 1,2, and 3. The contours represent 25%, 50%, and 75% probability regions, the last being the outermost contour in each case.

probabilities, and for further posterior analysis.

APPENDIX

The notation and density functions for the normal-inverse Wishart distribution are as follows. For any p-vector IL and p xp variance matrix I we have (FI1•Z) ~^ N(m; ~ /h) for some precision parameter h > 0. Also, ; W~

- - (v, V), the inverse Wishart distribution

with density function p(s) = c(p, v)IVI(P+v-1)/2I -(p+v/2) exp{--0.5 trace(1~-V)} for

some constant c(p, v). Here v > 0 is the number of degrees of freedom, and V is a variance matrix such that E (1) = V/(v - 2) for v > 2. The marginal multivariate T- distribution for FL has v degrees of freedom and scale matrix M = V/(hv), with density

Page 10: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

460 LAVINE AND WEST Vol. 20, No. 4

Case t=1 5 & u=300: Classification summary

(a) Modal Component (b) Component 1

...6 : .

...

6 .

...i?i:ii >.3.......

11 .........iiii!i~i!ii iil!........

.

.....:

. .

.iii?!i~~i ... .

..... ..... ..... ..... . .:r :

C?.s CIQ ...... . . . . . .:~:~ ............ : ......

....:

. . . . . . . . . . . . . .' . . .

o 2 4 6 0 2 4 6

yl yi

(c) Component 2 (d) Component 3

..........77i .........!i~i~iiii!ii s ... .............

...............: ........:.:::::: :. - .:.:....:.:... .

....6•....... .. . . .

... ...:•iiiii:•?

?.j:::::?::::::?:::.::5:::::?::::::::i

o 2 4...6.0

2 4 6

::::::::::::~::..B

::I~t ?

:?l???' ?.: <.:.:.?:.:.:'

??i

o - iii•:::i!i:::i~i:::•:::•i i?i.:?. 2?? :" r?.: ????s..

....... . .. i .

0 4 6 0 2 4 6

yl y1 FIGURE 5: (a) Modal probabilities. The black, grey, and white- regions are where components 3,

2, and 1 are favoured, respectively. (b), (c), (d) Contours of P(zf = ilY(T), z(), Y(u), Yf)

for components i = 1,2, and 3. The black, dark grey, light grey, and white regions are where P(zf = ilY(T),, z(7, y(u), f) > 0.90, 0.50 < P(zf = ilY(T), z(7', y(), Yf) < 0.90, 0.05 < P(zf = iy(T), z(), y(u),yf) 0.50, and 0.05 > P(zf = ily(), z(T), y(u),Yf), respectively.

function

1 (

(--m)TM-l(t,--m)

-(p+v)/2

p(i) = C(p, v)IMI- (1 + ( - Ml(Al)

where C(p, v) = F((p + v)/2)(vt)-P/2/F(v/2). By way of notation, we write Fp 'Tv(m; M), the dimension p being implicit.

If (yl p~, 1) ,;

N( Ip; 1), then the predictive distribution for y is a similar T-distribution but with increased spread, namely Y v

'Tv(m; Q) with Q = V(1 + h)/(hv). The density

Page 11: A Bayesian classification discrimination*mw/MWextrapubs/Lavine1992.pdf · posterior classification probabilities for observed data and for future cases to be classified, and pos-

1992 CLASSIFICATION AND DISCRIMINATION 461

function is simply

y

( y-m)TQ-(ym)

-(p+v)/2

p(y)-= C(p, v)QI- 2 1 + (A2) v

REFERENCES

Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis. Second Edition. Wiley, New York. Binder, D.A. (1978). Bayesian cluster analysis. Biometrika, 65, 31-38. Breiman, L., Friedman, J.K., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees.

Wadsworth, Belmont, Calif. DeGroot, M.H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York. Gelfand, A.E., and Smith, A.F.M. (1990). Sampling based approaches to calculating marginal densities. J.

Amer. Statist. Assoc., 85, 398-409. Hartigan, J.A. (1975). Clustering Algorithms. Wiley, New York. McClachan G.J., and Basford, K.E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel

Dekker, New York.

Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.

West, M., and Harrison, P.J. (1989). Bayesian Forecasting and Dynamic Models. Springer-Verlag, New York.

Received 16 April 1991 Revised 25 October 1991 Accepted 19 December 1991

ISDS Duke University

Box 90251 Durham, N.C. 27708-0251

U.S.A.