gaussian mixture models and probabilistic decision

11
Neural Comput & Applic (1999)8:235–245 1999 Springer-Verlag London Limited Gaussian Mixture Models and Probabilistic Decision-Based Neural Networks for Pattern Classification: A Comparative Study* K.K. Yiu, M.W. Mak and C.K. Li Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Probabilistic Decision-Based Neural Networks (PDBNNs) can be considered as a special form of Gaussian Mixture Models (GMMs) with trainable decision thresholds. This paper provides detailed illustrations to compare the recognition accuracy and decision boundaries of PDBNNs with that of GMMs through two pattern recognition tasks, namely the noisy XOR problem and the classification of two-dimensional vowel data. The paper highlights the strengths of PDBNNs by demonstrating that their thresholding mechanism is very effective in detecting data not belonging to any known classes. The orig- inal PDBNNs use elliptical basis functions with diagonal covariance matrices, which may be inap- propriate for modelling feature vectors with corre- lated components. This paper overcomes this limi- tation by using full covariance matrices, and showing that the matrices are effective in charac- terising non-spherical clusters. Keywords: EM algorithm; Gaussian mixture mod- els; Pattern classification; Probabilistic decision- based neural networks 1. Introduction Pattern classification is to partition a feature space into a number of decision regions. One would like *This project was supported by the Hong Kong Polytechnic University Grant No. G-V557. Correspondence and offprint requests to: M.W. Mak, Department of Electronic & Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong. E-mail: enmwmakKpolyu.edu.hk to have perfect partitioning so that none of the decisions is wrong. If there are overlaps between classes, it becomes necessary to minimise the prob- ability of misclassification errors or the average cost of errors. One approach to minimising the errors is to apply the Bayes’ decision rule [1]. The Bayesian approach, however, requires the class-conditional probability density to be estimated accurately. In recent years, the application of semi-parametric methods [2,3] to estimating probability density func- tions has attracted a great deal of attention. For example, Tråve ´n [3] proposed a method, called Gaussian clustering, to estimate the parameters of a Gaussian mixture distribution. The method uses a stochastic gradient descent procedure to find the maximum likelihood estimates of a finite mixture distribution. C ´ wik and Koronacki [4] extended Tråve ´n’s work so that no constraints on the covari- ance structure of the mixture components were imposed. Due to the capability of Gaussian mixtures to model arbitrary densities, Gaussian Mixture Mod- els (GMMs) have been used in various problem domains, such as pattern classification [5] and cluster analysis [6]. There have been several neural network ap- proaches to statistical pattern classification (for a review, see Bishop [2]). One reason for their popu- larity is that the outputs of multi-layer neural net- works are found to be the estimates of the Bayesian a posteriori probabilities [7,8]. Research has also shown that neural networks are closely related to Bayesian classifiers. For example, Specht [9] pro- posed a Probabilistic Neural Network (PNN) that approaches the Bayes optimal decision surface asymptotically. In Streit and Luginbuhl [10], a four- layer feedforward architecture incorporated with

Upload: lykhuong

Post on 08-Dec-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Gaussian Mixture Models and Probabilistic Decision

Neural Comput & Applic (1999)8:235–245 1999 Springer-Verlag London Limited

Gaussian Mixture Models and Probabilistic Decision-BasedNeural Networks for Pattern Classification: A ComparativeStudy*

K.K. Yiu, M.W. Mak and C.K. LiDepartment of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong

Probabilistic Decision-Based Neural Networks(PDBNNs) can be considered as a special form ofGaussian Mixture Models (GMMs) with trainabledecision thresholds. This paper provides detailedillustrations to compare the recognition accuracyand decision boundaries of PDBNNs with that ofGMMs through two pattern recognition tasks,namely the noisy XOR problem and the classificationof two-dimensional vowel data. The paper highlightsthe strengths of PDBNNs by demonstrating that theirthresholding mechanism is very effective in detectingdata not belonging to any known classes. The orig-inal PDBNNs use elliptical basis functions withdiagonal covariance matrices, which may be inap-propriate for modelling feature vectors with corre-lated components. This paper overcomes this limi-tation by using full covariance matrices, andshowing that the matrices are effective in charac-terising non-spherical clusters.

Keywords: EM algorithm; Gaussian mixture mod-els; Pattern classification; Probabilistic decision-based neural networks

1. Introduction

Pattern classification is to partition a feature spaceinto a number of decision regions. One would like

*This project was supported by the Hong Kong PolytechnicUniversity Grant No. G-V557.

Correspondence and offprint requests to: M.W. Mak, Departmentof Electronic & Information Engineering, The Hong KongPolytechnic University, Hung Hom, Hong Kong. E-mail:enmwmakKpolyu.edu.hk

to have perfect partitioning so that none of thedecisions is wrong. If there are overlaps betweenclasses, it becomes necessary to minimise the prob-ability of misclassification errors or the average costof errors. One approach to minimising the errors isto apply the Bayes’ decision rule [1]. The Bayesianapproach, however, requires the class-conditionalprobability density to be estimated accurately.

In recent years, the application of semi-parametricmethods [2,3] to estimating probability density func-tions has attracted a great deal of attention. Forexample, Tråve´n [3] proposed a method, calledGaussian clustering, to estimate the parameters of aGaussian mixture distribution. The method uses astochastic gradient descent procedure to find themaximum likelihood estimates of a finite mixturedistribution. Cwik and Koronacki [4] extendedTråven’s work so that no constraints on the covari-ance structure of the mixture components wereimposed. Due to the capability of Gaussian mixturesto model arbitrary densities, Gaussian Mixture Mod-els (GMMs) have been used in various problemdomains, such as pattern classification [5] and clusteranalysis [6].

There have been several neural network ap-proaches to statistical pattern classification (for areview, see Bishop [2]). One reason for their popu-larity is that the outputs of multi-layer neural net-works are found to be the estimates of the Bayesiana posteriori probabilities [7,8]. Research has alsoshown that neural networks are closely related toBayesian classifiers. For example, Specht [9] pro-posed a Probabilistic Neural Network (PNN) thatapproaches the Bayes optimal decision surfaceasymptotically. In Streit and Luginbuhl [10], a four-layer feedforward architecture incorporated with

Page 2: Gaussian Mixture Models and Probabilistic Decision

236 K.K. Yiu et al.

Gaussian kernels, or Parzen windows, was shownto be able to approximate a Bayesian classifier. InLin et al. [11], a face recognition system basedon Probabilistic Decision-Based Neural Networks(PDBNNs) was proposed. A common property ofthese neural network models is that they use Gaus-sian densities as their basis functions.

Although Gaussian mixture models are commonlyapplied to pattern classification, they have limi-tations, as patterns not belonging to any knownclasses will be wrongly classified to one of them.To resolve this problem, Roberts and Tarassenko[12] proposed a robust method for novelty detection.The method uses a Gaussian mixture model togetherwith a decision threshold to reject unknown data.Other proposals include the PDBNNs suggested byLin et al. [11], where a decision threshold for eachclass is used to reject patterns not belonging to anyknown classes.

PDBNNs were used to implement a hierarchicalface recognition system in Lin et al. [11]. Whileexcellent performance (97.75% recognition, 2.25%false rejection, 0% misclassification and 0% falseacceptance) has been achieved, the characteristics oftheir decision boundaries have not been studied indetail. This paper highlights the strengths ofPDBNNs by means of empirical comparisonsbetween the recognition accuracy and decisionboundaries of PDBNNs and GMMs. The originalPDBNNs employ elliptical basis functions with diag-onal covariance matrices, which may not be appro-priate for modelling feature vectors with correlatedcomponents. This paper attempts to use PDBNNswith full covariance matrices to resolve this problem.

In the next two sections, the structural propertiesand learning rules of GMMs and PDBNNs areexplained. Experimental procedures and results areprovided in Section 4, where two problem sets,namely the noisy XOR problem and the classi-fication of two-dimensional (2D) vowel data, areused to evaluate the performance of PDBNNs andGMMs. Finally, concluding remarks are given inSection 5.

2. Gaussian Mixture Models

Gaussian Mixture Models (GMMs) are one of thesemi-parametric techniques for estimating probabilitydensity functions (pdf). The output of a Gaussianmixture model is the weighted sum ofR componentdensities, as shown in Fig. 1. Given a set ofNindependent and identically distributed patternsxi = { x(t); t = 1,2,. . .,N} associated with classvi, weassume that the class likelihood functionp(x(t)uvi)

Fig. 1. Architecture of a GMM.

for class vi is a mixture of Gaussian distributions,i.e.

p(x(t)uvi) = ORr=1

P(Qruiuwi)p(x(t)uvi, Qrui) (1)

whereQrui represents the parameters of therth mix-ture component,R is the total number of mixturecomponents, p(x(t)uvi, Qrui) ; 1(mrui, Srui) is theprobability density function of therth componentand P(Qruiuvi) is the prior probability (also calledmixture coefficients) of ther th component. Typi-cally, 1(mrui, Srui) is a Gaussian distribution withmeanmrui and covarianceSrui.

The training of GMMs can be formulated asa maximum likelihood problem where the meanvectors {mrui}, covariance matrices {Srui}, and mix-ture coefficients {P(Qruiuvi)} are typically estimatedby the EM algorithm [13]. More specifically, theparameters of a GMM are estimated iteratively by1

m(j+1)rui =

SNt=1 P(j) (Qruiux(t))x(t)SN

t=1 P(j) (Qruiux(t))

S(j+1)rui =

SNt=1 P(j)(Qrui ux(t)) [x(t) − m(j)

rui ][x(t) − m(j)rui ]T

SNt=1 P(j) (Qruiux(t))

and

P(j+1) (Qrui) =SN

t=1 P(j) (Qruiux(t))N

(2)

where j denotes the iteration index andP(j)

(Qruiux(t)) is the posterior probability for therthmixture (r = 1,. . .,R). The latter can be obtained byBayes’ theorem, yielding

P(j) (Qruiux(t)) =P(j)(Qrui)p(j) (x(t)uQrui)

Sk P(j) (Qkui)p(j) (x(t)uQkui)(3)

1 To simplify the notation, we have droppedvi in Eqs (2) to (4).

Page 3: Gaussian Mixture Models and Probabilistic Decision

237Gaussian Mixture Models and Probabilistic NNs

where

p(j) (x(t)uQrui) =1

(2p)I2 uS(j)

rui u12

× expH−12(x(t) − m(j)

rui)T(S(j)rui)−1(x(t) − m(j)

rui)J (4)

3. Probabilistic Decision-Based NeuralNetworks

3.1. Decision-Based Neural Networks

Decision-Based Neural Networks (DBNNs) wereoriginally proposed by Kung and Taur [14] forrobust pattern classification. One unique feature ofDBNNs is that they adopt a modular network struc-ture. In other words, a DBNN is composed of anumber of small sub-networks, with each sub-net-work representing one class. Learning in DBNNs isbased on a decision-based learning rule, where theteacher only tells the correctness of classificationfor each training pattern. The weights of the networkwill be updated whenever misclassification occurs.Reinforced learningis applied to the subnet corre-sponding to the correct class so that the weightvector is updated in the direction of the gradient ofthe discriminant function, whereasanti-reinforcedlearning is applied to the (unduly) winning subnetto move the weight vector along the opposite direc-tion. This has the effect of increasing the chance ofclassifying the same pattern correctly in the future.

3.2. Probabilistic Decision-Based NeuralNetworks

PDBNNs [11] are a probabilistic variant of theirpredecessor, DBNNs [15]. Like DBNNs, PDBNNsemploy a modular network structure, as shown inFig. 2. However, unlike DBNNs, they follow a pro-babilistic constraint. The subnet discriminant func-tions of a PDBNN are designed to model some log-likelihood functions of the form

f(x(t), wi) = logp(x(t)uvi)

= logFORr=1

P(Qruiuvi)p(x(t)uvi, Qrui)G(5)

where wi ; { mrui, Srui, P(Qruiuvi),Ti} and Ti is thedecision threshold of the subnet.

Learning in PDBNNs is divided into two phases:Locally Unsupervised (LU) and Globally Supervised(GS). In the LU learning phase, PDBNNs adopt

Fig. 2. Structure of a PDBNN. Each class is modelled by asubnet. The subnet discriminant functions are designed to modelthe log-likelihoods functions, as given by Eq. (5).

the expectation-maximisation (EM) algorithm [16]to maximise the likelihood function

l(wi;xi) = ONt=1

log p(x(t)uvi)

= ONt=1

log FORr=1

P(Qruiuvi)p(x(t)uvi, Qrui)G (6)

with respect to the parametersmrui, Srui, andP(Qruiuvi), where xi = { x(t); t = 1,2,. . .,N} denotesthe set ofN independent and identically distributedtraining patterns. The EM algorithm achieves thisgoal via two steps: expectation (E) and maximisation(M). The former formulates a so-called complete-data likelihood by introducing a set of missingvariables, and the latter maximises the function ineach iteration. Specifically, at iterationj of the E-step, the posterior probabilityP(j)(Qruiux(t),vi) forcluster r (r = 1,. . .,R) is computed:

P(j)(Qruiux(t),vi) =P(j)(Qruiuvi)p(j)(x(t)uvi,Qrui)

Sk P(j)(Qkuiuvi)p(j)(x(t)uvi, Qkui)

= h(j)rui(t) (7)

Then in the M-step, the complete-data likelihoodfunction [11] is maximised, resulting in

P(j+1) (Qruiuvi) = (1/N) ONt=1

h(j)rui (t),

m(j+1)rui = S1/ON

t=1

h(j)rui (t)D ON

t=1

h(j)rui(t)x(t), and

S(j+1)rui = S1/ON

t=1

h(j)rui (t)D

Page 4: Gaussian Mixture Models and Probabilistic Decision

238 K.K. Yiu et al.

× ONt=1

h(j)rui (t)[x(t) − m(j)

rui][x(t) − m(j)rui]T. (8)

Note that Eq. (8) is the same as Eq. (2), the learningalgorithm of GMMs.

In the Globally Supervised (GS) training phase,target values are utilised to fine-tune the decisionboundaries. Specifically, when a training pattern ismisclassified to theith class, reinforced and/or anti-reinforced learning are applied to update the meanvectors and covariance matrices of subneti. Thus,we have

m(j+1)rui 5 m(j)

rui 1 hm Ot,x(t)PDi

2

h(j)rui S−1(j)

rui [x(t) 2 m(j)rui]

2hm Ot,x(t)PDi

3

h(j)ruiS

−1(j)rui [x(t) 2 m(j)

rui]

S(j+1)rui 5 S(j)

rui 112

hs Ot,x(t)PDi

2

h(j)rui (H(j)

rui(t) 2 S−1(j)rui )

212

hs Ot,x(t)PDi

3

h(j)rui (H(j)

rui(t) 2 S−1(j)rui ) (9)

where H(j)rui(t) = S−1(j)

rui [x(t) − m(j)rui ] [x(t) − m(j)

rui]T S−1(j)rui ,

and hm and hs are user-assigned (positive) learningrates. The false rejection setDi

2 and the false accept-ance setDi

3 are defined as follows:

I Di2 = { x(t); x(t) P vi, x(t) is misclassified to

another classvj}.I Di

3 = { x(t); x(t) ¸ vi, x(t) is classified tovi}.

The intermediate parameterh(j)rui is computed as in

Eq. (7). At the end of thejth epoch, the conditionalprior probabilityP(j+1)(Qruivi) is updated according to

P(j+1) (Qruiuvi) = (1/N) ONt=1

h(j)rui(t) (10)

An adaptive learning rule is employed to train thethresholdTi of subneti. Specifically, the thresholdTi

at iteration j is updated according to

T(j+1)i = (11)

HT(j)i − htl′(T(j)

i − f(x(t),wi)) if x(t) P vi (reinforced learning)

T(j)i +htl′(f(x(t),wi) − T(j)

i ) if x(t) ¸ vi (anti-reinforced learning)

where ht is a positive learning parameter,l(d) =1

1+e−d is a penalty function, andl′(d) is the derivative

of the penalty function.

4. Experiments and Results

We have used two problem sets, namely the noisyXOR problem and the classification of two-dimen-sional (2D) vowel data, to evaluate the performanceof PDBNNs and GMMs. The K-means algorithm wasused to determine the initial positions of the functioncentres for both PDBNNs and GMMs. Then, thecovariance matrices in the noisy XOR problem wereinitialised by sample covariance [1] in the case of onecentre per cluster or by the K-nearest neighboursalgorithm (K = 2) in the case of two and four centresper cluster. For the classification of 2D vowel data,the covariance matrices were initialised by samplecovariance. The EM algorithm was subsequently usedto estimate the mean vectors, covariance matrices, andprior probabilities. In the Globally Supervised (GS)training phase of the PDBNNs, reinforced and anti-reinforced learning were applied to fine-tune thedecision boundaries and the decision thresholds.

To avoid over-training during the GS learning phase,the network parameters and the classification accuracyon the training set were recorded after every epoch.When the classification accuracy ceased to increasefor Tm epochs, training was terminated and the networkparameters producing the maximum classification accu-racy were retained.Tm is set such that the maximumclassification accuracy can be reliably determined. Alarge value ofTm can improve the reliability of themaximum classification accuracy, but it could alsoprolong the training time excessively. While a smallvalue of Tm could reduce the computation time, smallfluctuation in classification accuracy during GS trainingcould result in a premature termination of the training.We found empirically thatTm = 1000 gives a goodcompromise between these two constraints. Figure 3illustrates the variation of the classification accuracy

Fig. 3. Variation of classification accuracy in the course ofglobally supervised learning. This curve is based on the 2Dvowel problem, with two centres per class.

Page 5: Gaussian Mixture Models and Probabilistic Decision

239Gaussian Mixture Models and Probabilistic NNs

during the GS learning phase in the 2D vowel prob-lem. Note that the accuracy continues to increase upto a maximum and then starts to decrease gradually,suggesting that settingTm = 1000 is appropriate.

The performance of GMMs and PDBNNs wasevaluated. The scores of the discriminant functionsf(x(t),wi) corresponding to individual classes werecompared. The sub-network with the maximum scorewas claimed to be the winner. A recognition decisionis regarded as ‘correct’ when the winner is associatedwith the correct class and its score is greater than itsdecision threshold. When the winner is not associatedwith the correct class and its score is larger than itsthreshold, the unknown pattern is regarded as ‘incor-rectly recognised’. On the other hand, a false accept-ance is encountered when the winner is not associatedwith the correct class, but its score is greater than itsthreshold. When the scores of all sub-networks aresmaller than their corresponding thresholds, the inputpattern was regarded as ‘unclassifiable’. More formally,a speech patternx

→was considered as being Correctly

Recognised (CR), Incorrectly Recognised (IR),Unclassifiable (UC), or Falsely Accepted (FA) accord-ing to the following criteria:

I Correctly Recognised (CR):x→

P Ck s.t.fk(x→) . Tk andfk(x

→) . fj (x

→) ∀j ± k

I Incorrectly Recognised (IR):x→

P Ck ∃j ± k s.t.fj(x→) . Tj andfj(x

→) . fk(x

→)

I Unclassifiable (UC):x→

P Ck but fk(x→) , Tk ∀k

I Falsely Accepted (FA):x→

¸ any known classes butfk(x→) . Tk and fk(x

→)

. fj(x→) ∀j ± k

In the above criteria,Tk and fk(x→)(= f(x

→, wk)) denote,

respectively, the threshold and the discriminant func-tion’s output corresponding to class Ck.

For the GMMs, classification decisions were solelybased on their outputs, with the largest one beingselected as the recognised class. This is equivalent toassigning a zero decision threshold to all GMMs.

4.1. Noisy XOR Problem

In the neural computing literature, much effort hasbeen spent on solving the exclusive-OR (XOR) prob-lem. An interesting property of the XOR problem isthat it is the simplest logic function which is notlinearly separable. As a result, single-layer networksare not able to solve this problem. The XOR problemwith bipolar inputs has four training patterns: {(−1,−1)T; 0}, {( −1, 1)T; 1}, {(1, −1)T; 1} and {(1, 1)T;0}, whereT denote vector transpose. The patterns withthe same inputs (e.g. (1, 1)T) are classified to one

class (class 0), while those with different inputs (e.g.(−1, 1)T) are classified to another class (class 1). Thisproblem can be extended ton-dimensions, resulting inan n-bit parity problem.

The noisy exclusive-OR problem can be consideredas an extension of the convensional XOR problem.Instead of using four training patterns, the noisy XORproblem investigated in this work utilises 2000 trainingpatterns generated from four Gaussian distributionswith mean vectors and covariance matrices, as shownin Table 1. Each distribution is responsible for generat-ing 500 random samples. The networks were trainedto classify the patterns into two classes. The patternsgenerated by the distributions with mean vectors (1,1)T and (−1, −1)T were classified to one class (class0), whereas those generated by distributions with meanvectors (1,−1)T and (−1, 1)T were classified to anotherclass (class 1). Therefore, the noisy XOR problem canbe considered as an extension of the XOR problem,with the inputs being corrupted by interdependentnoise.

The objective of this experiment is to compare thedecision boundaries formed by the PDBNNs and theGMMs through the noisy XOR problem. The trainingset consists of four Gaussian clusters (two clusters perclass). The number of centre(s) per cluster was set toone, two and four. In the experiments, we set thePDBNNs’ learning rates as follows:hm = 0.001,hs = 0.00001 andht = 0.0001. As these parametersinteracted with each other nonlinearly, their valueswere found empirically.

Figures 4 and 5 show the test data, decision bound-aries, function centres and contours of constant basisfunction outputs formed by the PDBNNs and GMMswith various numbers of function centres. Table 2summarises the performance of the PDBNNs and theGMMs, respectively, based on 2000 test vectors drawnfrom the same population as the training set.

Table 1. Mean vectors and covariance matrices of thefour clusters in the noisy XOR problem. Each clustercontains 500 samples.

Class 1 Class 2

Mean Cov. Mean Cov.vector matrix vector matrix

S−11 D S0.9

0.20.20.6D S−1

−1D S 0.5−0.2

−0.21.0D

S 1−1D S0.7

0.00.00.7D S1

1D S0.70.1

0.10.5D

Page 6: Gaussian Mixture Models and Probabilistic Decision

240 K.K. Yiu et al.

Fig. 4. Decision boundaries, function centres and contours of constant basis function outputs (thin ellipses) produced by PDBNNswith (a) 1 centre per cluster, (b) 2 centres per cluster and (c) 4 centres per cluster. Centre 00, centre 01, centre 10 and centre 11represent the four groups of centres corresponding to the four clusters.

It is evident from Fig. 5 that some of the decisionboundaries of the GMMs extend to infinity in theinput space. On the other hand, the decision boundariesof the PDBNNs are confined to the regions with alarge among of test data. This is because the decisionboundaries formed by the GMMs are purely dependenton the outputs of the mixture models, i.e. they are

formed by the points in the input space where thetwo mixture outputs are equal. No matter how faraway from the origin, there should be input vectorsproducing equal outputs in the two mixture models.As a result, some of the decision boundaries of theGMMs extend to infinity. The PDBNNs, however, usedecision thresholds to reject data that produce low

Page 7: Gaussian Mixture Models and Probabilistic Decision

241Gaussian Mixture Models and Probabilistic NNs

Fig. 5. Decision boundaries, function centres and contours of constant basis function outputs (thin ellipses) produced by Gaussianmixture models with (a) 1 centre per cluster, (b) 2 centres per cluster and (c) 4 centres per cluster. Centre 00, centre 01, centre 10and centre 11 represent the four groups of centres corresponding to the four clusters.

outputs in all subnets. Consequently, the PDBNNs’decision boundaries enclose most of the test data. Ofparticular interest is that the GMMs are only able toclassify the data as either Class 1 or Class 2, whilethe PDBNNs are able to create decision regions wheredata are neither classified as Class 1 nor Class 2.

Table 2 shows that the recognition accuracy of thePDBNNs is lower than that of the GMMs; however,their chance of incorrectly recognising a pattern islower. Basically, a PDBNN can be considered as aGMM with trainable decision thresholds. The thresh-olds provide a trade-off between the recognition accu-

Page 8: Gaussian Mixture Models and Probabilistic Decision

242 K.K. Yiu et al.

Table 2. Performance of PDBNNs and GMMs with 1, 2 and 4 centres based on the noisy XOR problem (CR –Correctly Recognised; IR – Incorrectly Recognised; UC – Unclassifiable).

Number of centre(s) per cluster for PDBNNs Number of centre(s) per cluster for GMMs

1 2 4 1 2 4

Train Test Train Test Train Test Train Test Train Test Train Test

CR % 89.72 89.97 89.89 89.64 88.93 88.64 92.08 92.36 92.34 92.15 92.37 91.85IR % 7.73 7.48 7.52 7.43 7.18 7.22 7.92 7.64 7.66 7.85 7.63 8.15UC % 2.55 2.55 2.59 2.93 3.89 4.14 N/A N/A N/A N/A N/A N/A

racy and the ability to reject data not belonging toany known classes. PDBNNs are, therefore, appropriatefor classification problems where false acceptance mustbe minimised.

Table 3 compares the prior probabilities (mixturecomponents) of PDBNNs and GMMs. It shows thatthe GS learning of PDBNNs is able to set the priorprobabilities P(Qruiuvi) of some components to zero.This is equivalent to removing the redundant functioncentres without affecting the decision boundaries. Evi-dence can also be found in Figs 4(b) and 4(c), wherethe shape of the decision boundaries depends mainlyupon some of the Gaussian components, suggestingthat the other components have insignificant contri-bution to the mixture distribution. On the contrary, allmixture coefficients in the GMMs are non-zero, asshown in Table 3. Therefore, we cannot remove anymixture components in the GMMs without affectingthe decision boundaries.

4.2. Two-Dimensional Vowel Data

In this experiment, the performance of the PDBNNsand GMMs was compared via a set of two-dimen-sional (2D) vowel data [17–19]. They were obtainedby computing the first and second formants (F1 and

Table 3. The mixture componentsP(Qruiuvi) of PDBNNsand GMMs in the noisy XOR problem with eight centresper class.

Component PDBNNs GMMsr

P(Qru1uv1) P(Qru2uv2) P(Qru1uv1) P(Qru2uv2)

1 0.3692 0.0000 0.1075 0.22032 0.0000 0.0015 0.0461 0.07973 0.1248 0.0003 0.1625 0.15944 0.0001 0.4945 0.1855 0.03985 0.0000 0.5036 0.0976 0.09196 0.0000 0.0000 0.1163 0.16847 0.0000 0.0000 0.1390 0.09548 0.5058 0.0000 0.1455 0.1452

F2) of ten vowels spoken by 67 speakers. The resultingfeature vectors were divided into a training set with338 vectors and a test set with 333 vectors. Thesedata sets are particularly suitable for accessing theperformance of pattern classifiers, as they contain over-lapped and non-spherical clusters.

In the original PDBNNs, the learning rates foroptimising the threshold values are identical for bothreinforced and anti-reinforced learning. In this experi-ment, however, the learning rateht for reinforcedlearning was set to 0.05 while that for anti-reinforcedlearning was set to 0.0. As a result, the thresholdswere optimised by reinforced learning only. Thisarrangement was found to yield lower threshold valuesand higher recognition accuracy. The learning rateshm

and hs were set to 0.5 and 150, respectively. Theyare identical for both reinforced and anti-reinforcedlearning.

Ten vowel classes in the data set were dividedinto two subsets: unknown set and target set. Thevowel /u/ in the word ‘who’d’ was chosen arbitrarilyas the unknown set. The remaining vowels werechosen as the target set.2 This arrangement enablesus to evaluate the robustness of the PDBNNs andGMMs in detecting data not belonging to any knownclasses. Each of the nine vowels in the target setwas modelled by a GMM and a PDBNN.

Figures 6 and 7 show the training data, decisionboundaries, contours of constant basis function out-puts and function centres formed by the PDBNNsand GMMs with various numbers of functioncentres. Table 4 summarises their performance.

It can be seen from Fig. 7 that some of the GMMs’decision regions are unbounded, while those formedby the PDBNNs are bounded. This is because thethresholding mechanism in PDBNNs rejects data inregions where the network outputs are lower than thedecision thresholds. Table 4 shows that the recognition

2 As a result, 305 out of 338 vectors in the training set were usedfor training.

Page 9: Gaussian Mixture Models and Probabilistic Decision

243Gaussian Mixture Models and Probabilistic NNs

Fig. 6. Decision boundaries, function centres and contours of constant basis function outputs (thin ellipses) produced by PDBNNswith (a) 1 centre per class and (b) 2 centres per class.

Fig. 7. Decision boundaries, function centres and contours of constant basis function outputs (thin ellipses) produced by Gaussianmixture models with (a) 1 centre per class and (b) 2 centres per class.

accuracy of PDBNNs is comparable to that of theGMMs. Although the GMMs are able to classify mostof the data, they fail to identify the data in theunknown class (*), and blindly classify them to oneof the known classes, resulting in 100% false accept-ance rate. On the other hand, Fig. 6 clearly shows thatthe PDBNNs are able to create decision regions where

unknown data are rejected, resulting in a lower falseacceptance rate. Figure 7(b) also shows that there isan artifact in the GMMs’ decision boundaries. Thetriangular region (around coordinate (100,1000)) isincorrectly classified to the vowel class /` / (in theword ‘hud’) whose function centres are located at(632,1108) and (830,1381).

Page 10: Gaussian Mixture Models and Probabilistic Decision

244 K.K. Yiu et al.

Table 4. Performance of PDBNNs and GMMs with 1 and 2 centres based on the 2D vowel data set (CR – CorrectlyRecognised; IR – Incorrectly Recognised; UC – Unclassifiable; FA – Falsely Accepted).

Number of Centre(s) per Cluster for PDBNNs Number of Centre(s) per Cluster for GMMs

1 2 1 2

Train Test Train Test Train Test Train Test

CR % 75.68 77.97 80.34 78.14 77.77 78.39 79.40 79.02IR % 23.42 19.60 19.36 19.63 22.23 21.61 20.60 20.98UC % 0.90 2.43 0.30 2.24 N/A N/A N/A N/AFA % 66.67 63.64 45.15 54.24 100.00 100.00 100.00 100.00

Karayiannis and Mi [19] used the same set of datato evaluate the performance of what they called theGrowing Radial Basis Neural Networks (GRBNNs).As they used 10 classes instead of 9, it may bedifficult to compare the recognition accuracy obtainedin their experiments with that of ours. However, it isinteresting to compare the decision boundaries formedby the PDBNNs with those by GRBNNs. We can seefrom Figs 4 and 5 of Karayiannis and Mi [19] thatthe vowel /i/ in the word ‘heed’ is not enclosed bythe decision boundaries. This means that the decisionregions corresponding to /i/ extends to infinity in theF1–F2 space. On the other hand, this situation hasbeen prevented by the thresholding mechanism ofPDBNNs, as shown in Fig. 6(b).

5. Conclusion

Based on two pattern recognition tasks, the perform-ance of PDBNNs and GMMs have been comparedand their characteristics have been highlighted. Thelocally unsupervised learning of PDBNNs and GMMsare essentially the same. However, PDBNNs have aglobally supervised learning phase through which thedecision boundaries and thresholds are adjusted when-ever misclassification occurs. Therefore, PDBNNs canbe considered as GMMs with trainable decision thresh-olds. They employ a modular network structure witheach class represented by one subnet. The discriminantfunction of each subnet is compared with that ofothers and with its decision threshold for makingclassification decisions.

In the noisy XOR problem, it was found that theglobally supervised learning rule of PDBNNs is ableto remove redundant function centres by graduallydecreasing the corresponding prior probabilities to zero.This leads to a smaller network. In the case of GMMs,all components have contributions to the outputs; there-fore, no redundant centres can be removed withoutaffecting the decision boundaries.

With the thresholding mechanism, the performanceof PDBNNs can be divided into recognition accuracy,incorrectly recognised rate, unclassifiable rate and falseacceptance rate. It was found that large decision thresh-olds result in low recognition accuracy, and vice versafor small thresholds. The thresholds also producebounded and locally conserved decision regions. Thiseffectively minimises the chance of falsely acceptingunknown patterns. Hence, PDBNNs are suitable forclassification applications where the minimisation of afalse acceptance rate is an important issue.

References

1. Duda, RO, Hart, PE. Pattern Classification and SceneAnalysis. Wiley, 1973

2. Bishop, CM. Neural Networks for Pattern Recognition.Oxford University Press, 1995

3. Tråven HGC. A neural network approach to statisticalpattern classification by semiparametric estimation ofprobability density functions. IEEE Trans Neural Net-works 1991; 2(3): 366–377

4. Cwik, J, Koronacki, J. Probability density estimationusing a Gaussian clustering algorithm. Neural Compu-tation Applic 1996; 4:149–160

5. Reynolds, DA, Rose, RC. Robust text-indepententspeaker identification using Gaussian mixture speakermodels. IEEE Trans Speech and Audio Processing 1995;3(1): 72–83

6. McLachlan, GJ, Basford, KE. Mixture Models: Inferenceand Applications to Clustering. Marcel Dekker, 1988

7. Ruck, DW, Rogers, SK, Kabrisky, M, Oxley, ME, Suter,BW. The multilayer perceptron as an approximation to aBayes optimal discriminant function. IEEE Trans NeuralNetworks 1990; 1(4): 296–298

8. Richard MD, Lippmann, RP. Neural network classifiersestimate Bayesian a posteriori probabilities. Neural Com-putation 1991; 3: 461–483

9. Specht, DF. Probabilistic neural networks. Neural Net-works 1990; 3: 109–118

10. Streit, RL, Luginbuhl, TE. Maximum likelihood trainingof probabilistic neural networks. IEEE Trans NeuralNetworks 1994; 5(5): 764–783

11. Lin, SH, Kung, SY, Lin, LJ. Face recognition/detectionby probabilistic decision-based neural network. IEEE

Page 11: Gaussian Mixture Models and Probabilistic Decision

245Gaussian Mixture Models and Probabilistic NNs

Trans Neural Networks: Special Issue on BiometricIdentification 1997; 8(1): 114–132

12. Roberts, S, Tarassenko, L. A probabilistic resource allo-cating network for novelty detection. Neural Computation1994; 6: 270–284

13. Mak, MW, Li, CK, Li, X. Maximum likelihoodestimation of elliptical basis function parameters withapplication to speaker verification. Proc Int Conf SignalProcessing 1998; 1287–1290

14. Kung, SY, Taur, JS. Decision-based neural networkswith signal/image classification applications. IEEE TransNeural Networks 1995; 6: 170–181

15. Kung, SY. Digital Neural Networks. Prentice Hall, 1993

16. Dempster, AP, Laird, NM, Rubin, DB. Maximum likeli-hood from incomplete data via the EM algorithm. J RoyStatistical Soc, B 1977; 39(1): 1–38

17. Lippmann, RP. Pattern classification using neural net-works. IEEE Commun Mag 1989; 27: 47–54

18. Ng, K, Lippmann, RP. Practical characteristics of neuralnetwork and conventional pattern classifiers. In: RPLippmann et al., eds, Advances in Neural Inform. Pro-cessing Syst. 3, Morgan Kaufmann 1991; 970–976

19. Karayiannis, NB, Mi, GW. Growing radial basis neuralnetworks: Merging supervised and unsupervised learningwith network growth techniques. IEEE Trans NeuralNetworks 1997; 8(6): 1492–1506