acoustic modeling for speech recognition based on a generalized laplacian mixture distribution

11
Acoustic Modeling for Speech Recognition Based on a Generalized Laplacian Mixture Distribution Atsushi Nakamura ATR Interpreting Telecommunications Research Laboratories, Kyoto, 619-0288 Japan SUMMARY In acoustic modeling for speech recognition, the Gaussian distribution or the Gaussian mixture distribution is widely used. The general reason for preference of the Gaussian distribution in the parametric modeling of an unknown ensemble is the central limit theorem. The Gauss- ian distribution has many properties that are theoretically clear. For the particular problem, however, in which the time series of an acoustic feature is to be modeled on the basis of a limited number of training samples for speech recognition, there is no guarantee that the method based on the Gaussian distribution is always optimal. Consequently, this paper proposes an acoustic modeling approach based on the generalized Laplacian distribution, which can repre- sent a wider range of distribution shapes, including the Laplacian and Gaussian distributions. The formulation of the generalized Laplacian distribution and the method of estimation of the distribution parameters are described. The acoustic model with the generalized Laplacian mixture output distribution is constructed by retraining of the hid- den Markov model with the Gaussian mixture output dis- tribution. It is shown by a continuous speech recognition experiment using natural uttered speech that the recognition performance is improved compared to recognition based on the Gaussian mixture distribution. © 2002 Wiley Peri- odicals, Inc. Electron Comm Jpn Pt 2, 85(11): 32–42, 2002; Published online in Wiley InterScience (www.interscience. wiley.com). DOI 10.1002/ecjb.10093 Key words: speech recognition; acoustic model; dis- tribution shape; kurtosis; generalized Laplacian distribution. 1. Introduction In acoustic modeling for speech recognition, the Gaussian distribution and the Gaussian mixture distribution are widely used. The typical reason for the general prefer- ence of the Gaussian distribution in the parametric model- ing of an unknown ensemble is the central limit theorem. Various properties of the Gaussian distribution are theoreti- cally clear. In the particular problem, however, in which the time series of an acoustic feature parameter is to be modeled for speech recognition using a limited number of training samples, there is no guarantee that the method based on the Gaussian distribution is always optimal. The Laplacian distribution [1] is another continuous distribution that is used in acoustic modeling. Compared to the Gaussian distribution, the Laplacian distribution has a sharper peak and a wider spread, as a result of which a higher likelihood is assigned to samples far from the mean than in the Gaussian distribution. Because of this property, in the recognition of speech including periods locally devi- ating from the model, the correct hypothesis should be less likely to be rejected due to decreased likelihood. On the other hand, in the discrimination of categories with similar feature parameter distributions, there arises the problem that misdiscrimination may easily occur due to excessive overlap of the distributions. It is impossible to decide beforehand what kind of distribution is suited for modeling of a specified kind of acoustic feature parameter and particular training samples. Consequently, there is a limit to the modeling performance so long as a distribution of fixed form, such as the Gaussian distribution or the Laplacian distribution, is used. © 2002 Wiley Periodicals, Inc. Electronics and Communications in Japan, Part 2, Vol. 85, No. 11, 2002 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J83-D-II, No. 11, November 2000, pp. 2118–2127 32

Upload: atsushi-nakamura

Post on 11-Jun-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Acoustic Modeling for Speech Recognition Based on aGeneralized Laplacian Mixture Distribution

Atsushi Nakamura

ATR Interpreting Telecommunications Research Laboratories, Kyoto, 619-0288 Japan

SUMMARY

In acoustic modeling for speech recognition, theGaussian distribution or the Gaussian mixture distributionis widely used. The general reason for preference of theGaussian distribution in the parametric modeling of anunknown ensemble is the central limit theorem. The Gauss-ian distribution has many properties that are theoreticallyclear. For the particular problem, however, in which thetime series of an acoustic feature is to be modeled on thebasis of a limited number of training samples for speechrecognition, there is no guarantee that the method based onthe Gaussian distribution is always optimal. Consequently,this paper proposes an acoustic modeling approach basedon the generalized Laplacian distribution, which can repre-sent a wider range of distribution shapes, including theLaplacian and Gaussian distributions. The formulation ofthe generalized Laplacian distribution and the method ofestimation of the distribution parameters are described. Theacoustic model with the generalized Laplacian mixtureoutput distribution is constructed by retraining of the hid-den Markov model with the Gaussian mixture output dis-tribution. It is shown by a continuous speech recognitionexperiment using natural uttered speech that the recognitionperformance is improved compared to recognition based onthe Gaussian mixture distribution. © 2002 Wiley Peri-odicals, Inc. Electron Comm Jpn Pt 2, 85(11): 32–42, 2002;Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjb.10093

Key words: speech recognition; acoustic model; dis-tribution shape; kurtosis; generalized Laplacian distribution.

1. Introduction

In acoustic modeling for speech recognition, theGaussian distribution and the Gaussian mixture distributionare widely used. The typical reason for the general prefer-ence of the Gaussian distribution in the parametric model-ing of an unknown ensemble is the central limit theorem.Various properties of the Gaussian distribution are theoreti-cally clear. In the particular problem, however, in which thetime series of an acoustic feature parameter is to be modeledfor speech recognition using a limited number of trainingsamples, there is no guarantee that the method based on theGaussian distribution is always optimal.

The Laplacian distribution [1] is another continuousdistribution that is used in acoustic modeling. Compared tothe Gaussian distribution, the Laplacian distribution has asharper peak and a wider spread, as a result of which ahigher likelihood is assigned to samples far from the meanthan in the Gaussian distribution. Because of this property,in the recognition of speech including periods locally devi-ating from the model, the correct hypothesis should be lesslikely to be rejected due to decreased likelihood.

On the other hand, in the discrimination of categorieswith similar feature parameter distributions, there arises theproblem that misdiscrimination may easily occur due toexcessive overlap of the distributions. It is impossible todecide beforehand what kind of distribution is suited formodeling of a specified kind of acoustic feature parameterand particular training samples. Consequently, there is alimit to the modeling performance so long as a distributionof fixed form, such as the Gaussian distribution or theLaplacian distribution, is used.

© 2002 Wiley Periodicals, Inc.

Electronics and Communications in Japan, Part 2, Vol. 85, No. 11, 2002Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J83-D-II, No. 11, November 2000, pp. 2118–2127

32

A mixture distribution whose components are fixedshape distributions (such as the Gaussian mixture distribu-tion) is a useful tool that can represent various distributionshapes. It is empirically known, however, that even if thenumber of mixtures is increased in order to improve therepresentation power of the distribution, the improvementof speech recognition performance is limited by the smallnumber of training samples. In such a case, the improve-ment of the recognition performance is better if the individ-ual component distribution has a greater representationpower. Or, the same recognition performance can be real-ized with fewer component distributions than by the Gauss-ian mixture distribution.

This paper proposes an acoustic modeling approachbased on the generalized Laplacian distribution, which canrepresent a wider range of distribution shapes, including theLaplacian and Gaussian distributions. The formulation ofthe generalized Laplacian distribution and the method ofestimation of the distribution parameters are presented. Anacoustic model using the generalized Laplacian mixtureoutput distribution is constructed by retraining of the hid-den Markov model (HMM) with the Gaussian mixtureoutput distribution. It is shown by a speech recognitionexperiment that the recognition performance is improvedcompared to the model based on the Gaussian mixturedistribution.

The authors themselves have already demonstratedthe essential effectiveness of acoustic modeling with con-sideration of the distribution shapes based on the general-ized Laplacian distribution in Ref. 2. Independently, Basuand colleagues performed a similar study and demonstratedthe effectiveness of the method in Refs. 3 and 4. This paperdiscusses in detail the method and the effectiveness ofacoustic modeling based on the generalized Laplacian dis-tribution, by extending the content of Ref. 2.

This paper is organized as follows. Section 2 formu-lates the generalized Laplacian distribution. Section 3 pre-sents the method of estimation of the distributionparameters. Section 4 presents the training procedure forthe hidden Markov model with the generalized Laplacianmixture output distribution. Then a real acoustic model isconstructed. The results of a continuous speech recognitionexperiment using the model are presented. Section 5 dis-cusses the effectiveness of the proposed method based onthe experimental results.

2. The Generalized Laplacian Distribution

2.1. Modeling mismatch at the distributionshape level

The density function and the parameters of theLaplacian and Gaussian distributions are summarized in

Table 1. The Laplacian and Gaussian distributions are bothincluded in the family of exponential distributions. They arespecified by the location and the scale parameters. Asindicated by the parameter names, these distributions pro-vide flexible representations in terms of the location and thescale on the axis of the probability variable. But the rangeof shape of the probability density curve produced bychanges of the scale parameter is strongly restricted by theexpression used for the probability density function. Inother words, these distributions have insufficient repre-sentation powers with respect to the distribution shape.

Consider, for example, an ensemble completely fol-lowing the Laplacian distribution (Laplacian ensemble).When the ensemble is modeled by maximum likelihoodestimation assuming the Gaussian distribution, there inevi-tably arises a mismatch with the Laplacian ensemble distri-bution, as shown in Fig. 1. The extent of the mismatchbecomes more marked in the tail of the distribution if thelogarithmic likelihood is used.

This problem not only is produced between theLaplacian ensemble and the Gaussian distribution, but isinevitable when the unknown ensemble is modeled byassuming an a priori distribution shape in some way. Thismay degrade the accuracy of score comparison betweenhypotheses by the acoustic model in speech recognition. Inthe following, we consider a method of reducing the abovemismatch by modeling in which the distribution shapeparameter is introduced.

Thus, the major points of this paper are as follows:

• Derivation of the distribution (more precisely thedensity function) with the parameter representingthe shape

• Proposal of the parameter estimation method inthe derived distribution

Table 1. Laplacian and Gaussian distributions

33

• Verification of the improvement in speech recog-nition performance by the distribution shape in-formation extracted from a limited number of datasamples.

2.2. Derivation of the generalized Laplaciandensity function

The Laplacian and Gaussian distributions are com-bined, and the generalized Laplacian distribution is formu-lated with the distribution shape parameter. The followingproperties are derived from Table 1, as the common prop-erties of the Laplacian and Gaussian distributions.

(Property 1) The distributions belong to the family ofexponential distributions. A simple N-th order function isobtained by forming the logarithm of the density function(N is 1 or 2).

(Property 2) The maximum likelihood estimation ofthe scale parameter is given by the N-th power of theabsolute mean error between the maximum likelihood esti-mation of the location parameter and the data sample value(N is 1 or 2).

It is desirable in the formulation that the derivedgeneralized Laplacian distribution should retain these prop-erties as far as possible, inheriting the model parameterestimation algorithm in the implementation.

First, consider the following function, which for-mally includes the Laplacian and Gaussian density func-tions:

where ρ is a positive real number. The logarithm of f(x) isa ρ-th order function, and it is seen that the distribution withf(x) as the density function has property 1, with N extendedto a positive real number ρ.

In order for f(x) to be a density function, its two-sidedinfinite integral must be 1:

In other words,

Consider the definition of Euler’s gamma function (or theEuler integral of the second kind),

Calculating the denominator of the right-hand side of Eq.(3),

Consequently,

Finally, the generalized Laplacian density function is

The generalized Laplacian distribution representedby the above density function has the parameter ρ repre-senting the distribution shape, together with the locationparameter µ and the scale parameter β. The generalizedLaplacian distribution is the same as the Laplacian andGaussian distributions when ρ = 1 and ρ = 2, respectively.When ρ takes a value other than these, the shape is differentfrom either the Laplacian or the Gaussian distribution (Fig.2). In other words, by defining ρ appropriately, modelingwith a distribution shape better matched to the ensembledistribution can be expected.

In the study by Basu’s group [3, 4], a density functionis used via a generalization process different from Eq. (7).The density function in Ref. 3, for example, corresponds tothe result when the parameter transformation

Fig. 1. Mismatch in distribution shape betweenLaplacian population distribution and

Gaussian distribution.

(1)

(2)

(3)

(4)

(5)

(6)

(7)

34

is applied to the scale parameter in Eq. (7). The two densityfunctions are equivalent in terms of shape representationpower as the dimensional distributions, although the impli-cations of the scale parameter are different.

3. Parameter Estimation for GeneralizedLaplacian Distribution

3.1. Estimation of location and scaleparameters

Maximum likelihood estimation based on analyticsolution of the likelihood equation is widely used in theestimation of the distribution parameters in acoustic mod-eling. It is not easy, however, to solve the likelihood equa-tion in general for the location parameter of the generalizedLaplacian distribution. In other words, it is difficult to applymaximum likelihood estimation strictly by analytic means.

Consequently, we wish to determine the locationparameter value that locally maximizes the logarithmiclikelihood function by using the gradient method. For thesample set {x0, . . . , xM−1}, we set

where ε is the learning factor in the gradient method. µ isupdated until the logarithmic likelihood function reachesconvergence. The finally obtained value is the estimate ofthe location parameter.

Next, the likelihood equation for the scale parameteris solved on the basis of the estimate of the location parame-ter. In other words, consider

Simplifying the expression,

Then,

It should be noted that the above value is the ρ-thpower of the absolute mean error between the maximumlikelihood estimation of the location parameter and the datasamples. It is an extension of property 2 in Section 2.2 inwhich N is extended to a positive real number ρ. This is alsoconvenient from the viewpoint of implementation, in thesense that the generalized Laplacian distribution can beconsidered as an extension of the Gaussian and other dis-tributions.

3.2. Estimation of distribution shapeparameter

The distribution shape parameter, which is a featureof the generalized Laplacian distribution, has the role ofreflecting the shape of the ensemble distribution extractedfrom the training data on the distribution obtained by esti-mation. From the viewpoint of speech recognition based onlikelihood comparison between hypotheses, it is desirablethat the likelihood be further improved by removing theconstraints of the Gaussian distribution.

These two goals cannot always be achieved simulta-neously, and the whole training process should be control-led in order to realize the estimation with a compromisebetween the two. In maximum likelihood determination ofthe distribution parameter estimate, it is difficult to applythe analytical procedure as in the case of the locationparameter. Furthermore, the results of preliminary experi-ments indicate that the convergence of the gradient methodis not always good, and that the method is not suited to thetraining of the hidden Markov model, in which a largenumber of iterative computations are required.

Consequently, the distribution shape parameter isestimated considering only the match with the distributionshape. Improvement of the likelihood is performed in thelater estimation of the location and the scale parameters.

More precisely, the distribution shape parameter isestimated from the kurtosis, which is a statistic related tothe probability distribution shape. The kurtosis is given bynormalizing the fourth moment of the probability distribu-tion to the square of the second moment.

Fig. 2. Several shapes of generalized Laplaciandistribution.

(8)

(9)

(10)

(11)

(12)

35

For the generalized Laplacian distribution, the kurto-sis is calculated as

The value is determined only by the distribution shapeparameter. By equating Eq. (13) and the kurtosis obtainedfrom the training data sample, an estimate of the distribu-tion shape parameter giving a better shape match should beobtained.

The sample kurtosis for {x0, . . . , xM−1} is

The result of Eq. (13) is considered as a function of ρ. Let

Using the inverse function of κ(e), the value of ρ thatequates Eq. (13) and the sample kurtosis is obtained asfollows:

3.3. Extension to generalized Laplacianmixture distribution

The above parameter estimation is extended to theparameter estimation of the generalized Laplacian mixturedistribution

As regards the estimation of the location parameter, first theprobability p(i, l) of existence of each component for thecomponent distribution l is considered. Equation (9) isrewritten as

where

for all i.As regards the scale parameter and the distribution

shape parameter, Eqs. (12) and (16) are rewritten as follows,using the existence probability p(i, l):

4. Experiment

4.1. Training of generalized Laplacianmixture output distribution HMM

An HMM with the Gaussian mixture output distribu-tion was used as the initial model, and the acoustic modelwith the generalized Laplacian mixture output distributionwas trained.

(Conditions of experiment and implementation)The natural uttered speech of 99 males and 131

females (total utterance time approximately 200 minutes)from the ATR travel arrangement corpus [5, 6] was used.Tables 2 and 3 show the specifications of the unspecifiedspeaker initial acoustic model formulated as the initialmodel. It is a state-shared phoneme environment-dependent

(13)

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

36

HMM (HMnet with 800 states) based on maximum likeli-hood sequential state partitioning [7].

The structural conditions are as follows. The covari-ance matrix is always diagonal. The estimation of theparameter vector in the generalized Laplacian mixture dis-tribution can be performed independently for each element.The component distribution in the model after training hasdifferent shapes depending on the dimension. In Ref. 3, theexperiment is performed by setting the shape parameter incommon to all dimensions for each component distribution.The experiment in this paper considers the shape estimationof a more flexible multidimensional distribution than thatin Ref. 3.

In the calculation of Euler’s gamma function in thegeneralized Laplacian density function, the Lanczos for-mula [8, 9] is used:

where we set γ = 5, N = 6. The coefficients cn (n = 0, 1, 2,3, 4, 5, 6) are set exactly the same as in Ref. 8. In thecalculation of the inverse function of κ(ρ) in Eq. (21), it isnoted that κ(ρ) in Eq. (15) is monotone decreasing, and thevalue is calculated using a table prepared beforehand.

(Training procedure)We wish to improve the match with the distribution

shape and to improve the likelihood further by deleting the

constraints of the Gaussian distribution forming the basisof the initial model and by realizing shape-matched estima-tion. The detailed training procedure is as follows.

1. Forward calculation is applied to the initial modeland the training data. The total likelihood thus obtained isstored.

2. Backward calculation is applied to the initial modeland the training data. Combined with the result of theforward calculation, the existence probability of each sam-ple in the distribution is calculated.

3. The distribution shape parameter is reestimatedusing Eq. (21).

4. The following steps 5 to 8 are iterated until the totallikelihood converges.

5. The location parameter is reestimated by iterativeevaluation of Eq. (18).

6. The scale parameter is reestimated by Eq. (20).7. The mixture weight and the state transition prob-

ability are reestimated.8. Forward and backward calculations are performed.

The total likelihood and the existence probability of eachsample in the distribution are calculated.

9. If the total likelihood does not exceed the storedvalue, the procedure ends, with the model before the rees-timation of the distribution shape parameter defined as thefinal result. If the total likelihood exceeds the stored value,the new value is stored in place of the old value and theprocedure goes back to step 3.

The training factor [Eq. (18)] used in updating thelocation parameter in Eq. (5) is specified for each multidi-mensional component distribution. It is set as follows forthe k-th update of the location parameter vector m l in thecomponent distribution:

Here, ∇→

fl is the gradient vector of the multidimensionaldensity function fi with respect to the location parametervector m l, and | ⋅ | represents the size of the vector.

The distribution shape parameter is updated onlywhen the likelihood improvement achieved by iterativecalculation of the location and the scale parameters by steps5 to 8 converges. The reason is that improvement of thelikelihood is not always guaranteed in the shape matchestimation, and the procedure may have an adverse effectin the maximization of the likelihood. In fact, in a prelimi-nary experiment performed before the main experiment, thelikelihood was degraded in most distributions when the

Table 2. Conditions of acoustic analysis (1)

Table 3. Topological conditions of initial model

(22)

(23)

37

estimation of the shape parameter was included in theprocedure.

Since the estimation of the distribution shape parame-ter uses higher-order statistics, the performance may bedegraded due to this adverse effect unless a large numberof samples can be used to maintain high reliability. Conse-quently, a threshold is set beforehand for the number ofsamples for a distribution, and the estimation is performedonly when the sum of the existence probabilities of thesamples for each distribution (i.e., the expected number ofsamples to be used in the estimation) exceeds the threshold.

Using the above conditions and procedure, a pho-neme environment-dependent HMM with the generalizedLaplacian mixture output distribution was constructed us-ing the same data as in the training of the initial model.

4.2. Continuous speech recognition experiment

Using the constructed acoustic model, a continuousspeech recognition experiment was performed. Table 4shows the language model and the speech recognition en-gine used in the experiment. As test data, natural utteredspeech of 17 males and 25 females (total number of utter-ances 551, total utterance period approximately 40 min-utes) from the ATR travel arrangement corpus, differentfrom the training data, was used. The performance of theinitial model for the above test data was 72.6% in terms ofword accuracy (accuracy for the top-ranked word).

Figure 3 shows the speech recognition performanceafter training as a function of the threshold for the numberof samples. When the threshold is too low, there is anadverse effect due to estimation with low reliability, and therecognition performance is degraded from the initial model.When the threshold is too high, the number of generalizedLaplacian distributions in the model is decreased, and therecognition performance approaches that of the initialmodel. The recognition performance is maximized whenthe threshold is approximately 150. Compared to the initialmodel, the word accuracy is improved by approximately15% (error reduction rate of 5.5%). For this case, thenumber of component distributions estimated as general-

Table 4. Specifications of language model and searchengine

Fig. 4. Improvement of performance for each speaker.

Fig. 3. Speech recognition performance with varyingthreshold numbers of training samples.

38

ized Laplacian distributions is 2023 (52,598 as one-dimen-sional distributions), which corresponds to approximately25% of the whole model.

Figure 4 shows the improvement (or degradation) ofthe word accuracy at peak performance. For most of thespeakers, the performance is improved, indicating that theproposed method is uniformly effective over the wholebody of test data, regardless of the speaker. Figure 5 showsthe change of performance in comparison to the conven-tional approach when the number of mixtures is increasedin the Gaussian mixture output distribution. It is evidentthat, even if the number of mixtures is increased to 30, theword accuracy is improved by only about 0.5%, which isinferior to the performance of the generalized Laplacianmixture output distribution with 10 mixtures.

4.3. Consistency of effect for various acousticfeatures

In modeling the distribution shape, the effectivenessmay depend on the nature of the samples considered asobjects of modeling. Consequently, similar training was

performed under acoustic analysis conditions based on thelinear prediction analysis (LPC) cepstrum coefficient,which is different from the case shown in Table 2, and theeffectiveness of the model was examined. Table 5 shows theconditions of acoustic analysis in this case. The structuralconditions for the initial model were the same as in Table3.

Figure 6 shows the peak performance in the continu-ous speech recognition experiment described in Section4.2, with a sample number threshold of 150. Almost thesame performance as in Section 4.2 is obtained, indicatingthat the performance is improved not only for MFCC, butalso for the LPC cepstrum.

5. Discussion

5.1. Improvement of shape match by trainingand performance improvement

In the training procedure shown in Section 4.1, aftermatching to the distribution shape by estimating the distri-bution shape parameter, the sample existence probabilitywas iteratively recalculated for the maximum likelihoodestimation of other parameters. Thus it may happen that themodeled distribution shape does not follow the distributionshape provided by the samples.

Consequently, we investigated the extent to whichmatching to the distribution shape is achieved in the courseof training, and how the match contributes to the improve-ment of speech recognition performance. The index ofshape matching is defined as the absolute value of the

Fig. 5. Comparison with mixture-inflated initial models.

Table 5. Conditions of acoustic analysis (2)

Fig. 6. Improvement in performance under condition ofLPC-cepstrum-based acoustic analysis.

39

difference between the kurtosis of each distribution ob-tained by modeling and the kurtosis of the samples at thetime when the immediately following maximum likelihoodestimation converges. Figure 7 shows the change of theindex at each convergence of the iterative calculation inmaximum likelihood estimation, for sample number thresh-old of 150.

It is evident that the mismatch of the distributionshape decreases as a whole, and that the variance (standarddeviation) of the index also decreases. The finally obtainedlogarithmic likelihood (–4.61 × 10–6) is less than the value(–4.41 × 10–6) for the case of Section 4.2 in which thenumber of mixtures in the Gaussian mixture distributionsis increased to 30. Thus, it is well observed that the perform-ance improvement by the proposed training procedure im-proves the performance, not only by improving thelikelihood, but also by improving the match to the distribu-tion shape.

5.2. Spread of distribution shape parameterafter training

The actual values of the shape parameters of thevector elements (one-dimensional distributions) were ex-amined to determine the component distributions obtainedby training as a generalized Laplacian distribution. Figure8 shows a histogram of 52,598 one-dimensional distribu-tions obtained by training as a generalized Laplacian distri-bution with a step size of 0.1 for a sample number thresholdof 150.

The distribution shape parameter ρ has a peak near2.0, which is the value in the initial model, and spreadsmostly to lower values. When training is performed withfixed ρ for all one-dimensional distributions, the recogni-tion performance is better (word accuracy 72.9%) than theinitial model when the value is somewhat less than 2.0. This

result in Fig. 8 agrees with the result of the preliminaryexperiment.

In Ref. 3, on the other hand, the peak of the valuecorresponding to ρ is obtained near 0.8, and is less than 1.0.But in the experiment reported in this paper, the recognitionperformance when ρ is fixed near 1.0 is inferior to that ofthe initial model (word accuracy 71.2% for ρ = 1.0). Thus,even if the estimation result is that the peak of ρ is near 1.0,the recognition performance is not better than that of theexperiment reported in this paper. These results suggest thatthe value of ρ needed for better performance depends onthe object speech data. In other words, it will be necessaryto set the value of ρ automatically according to the speechdata.

As regards the spread of the distribution shape pa-rameter, it seems interesting to examine the dependence onthe acoustic feature parameters and the elements of theindividual kinds of acoustic parameter. In this experiment,however, no remarkable difference was observed. The rea-son seems to be as follows. In constructing a distributionshape matched to the acoustic parameters for the mixturedistribution as a whole, the generalized Laplacian distribu-tion is used as a component with only an indirect contribu-tion. Thus, there is no particular direct effect on theproperties of the acoustic feature parameter.

5.3. Computation time for training andrecognition

In using the generalized Laplacian mixture outputdistribution, a practical problem is the computation timeneeded for training and recognition. In the following, thecomputation time at the present stage, as well as its reduc-tion, are discussed.

(Computation time for training)The computation time needed in the retraining from

the initial model under the experimental conditions of Sec-tion 4.1 is approximately 70 hours on an Alpha workstation

Fig. 7. A process of reduction in distribution-shapemismatch.

Fig. 8. A distribution of values of shape parameter.

40

with a CPU speed of 500 MHz. This is approximately 10times the computation time for retraining when the numberof mixtures is increased in the Gaussian mixture distribu-tion. The major reason for the long computation time is theuse of the gradient method in the estimation of the locationparameter, and new operations must be implemented forcalculations related to the generalized Laplacian densityfunction.

In particular, for the former, it is necessary to accel-erate the convergence by some means in the estimation ofthe location parameter. Essentially, it seems necessary tointroduce an analytic estimation procedure, even if approxi-mate. As a reference, using the sample mean regardless ofthe distribution shape in the estimation of the locationparameter, the performance is slightly improved. In otherwords, there is a possibility that an equivalent improvementcan be achieved even if rigorous solution is not used.

(Computation time for recognition)The computation time needed for recognition is ap-

proximately 3 times that in the case of the Gaussian mixtureoutput distribution with the same number of mixtures. Thecomputation time is also 3 times as long when the distribu-tion shape parameter is chosen as 2.0 in the framework ofthe generalized Laplacian distribution. In other words, thelong computation time is not due to the effect of the distri-bution shape on the search process, but is almost entirelydue to the calculation of the generalized Laplacian densityfunction.

In dealing with this problem, it is possible to improvethe speed by table look-up for the operation to be imple-mented. In this case, for xr, for example, the two-dimen-sional table must be prepared by applying quantizations toboth x and r. Thus, there is a need to investigate the trade-offbetween reduction of the table size and limiting quantiza-tion error.

6. Conclusions

This paper has proposed an acoustic modeling ap-proach based on the generalized Laplacian distribution,which can represent a wider range of distribution shapes,including the Laplacian and Gaussian distributions. Thegeneralized Laplacian distribution is formulated and amethod of estimation of the distribution parameters is pre-sented. An acoustic model with the generalized Laplacianmixture output distribution is constructed by retraining thehidden Markov model (HMM) with the Gaussian mixtureoutput distribution. A speech recognition experiment showsthat the recognition performance is better than when theGaussian mixture distribution is used.

The purpose of this paper has been to refine the modelwith a balance between improvement of the likelihood andimprovement of matching. Better recognition performance

is obtained than with an acoustic model retrained by in-creasing the number of mixtures. This indicates the effec-tiveness of the approach in which the distribution shape isintroduced together with the likelihood as an evaluationcriterion for refining the acoustic model. It is noted that theeffectiveness is consistent and was exhibited for almost allspeakers in the test data set.

Although the improvement of the recognition per-formance obtained in this study is not very remarkable, thefollowing observations may be made. The recognition per-formance has its maximum at a training sample numberthreshold near 150. Consequently, a greater performanceimprovement is expected by using training data at a scalein which several hundred samples can be obtained for eachdistribution. It is also likely that additional factors affect thequality of the distribution shape parameter, and the key todevelopment of the proposed method is appropriate selec-tion of the estimation object distribution based on thesefactors.

It is also necessary to attempt a reduction of thecomputation times for training and recognition. In particu-lar, it is a crucial to find a method of reducing the compu-tation time for training when large-scale training data areused.

Acknowledgments. The author is grateful for sug-gestions by President S. Yamamoto and Mr. Y. Kousaka,head of the Second Laboratory, and for discussions withmembers of ATR.

REFERENCES

1. Beyerleni P, Ullrich M, Wilcox P. Modelling anddecoding of context dependent phones in the Philipslarge vocabulary continuous speech recognition sys-tem. Proc EUROSPEECH-97, p 1163–1166,Rhodes.

2. Nakamura A. Acoustic model based on generalizedLaplacian distribution. Proc Aut Conv Acoust SocJpn p 123–124, 1998.

3. Basu S, Micchelli CA, Olsen P. Maximum likelihoodestimates for exponential type density families. ProcICASSP-99, p 2066–2069, Phoenix.

4. Padmanabhana M, Saon G, Basu S, Huang J, ZweigG. Recent improvements in voicemail transcription.Proc EUROSPEECH-99, p 503–506, Budapest.

5. Morimoto T, Uratani N, Takezawa T, Furuse O, So-bashima Y, Iida H, Nakamura A, Sagiska Y, HiguchiN, Yamazaki Y. A speech language database forspeech translation research. Proc ICSLP-94, p 1791–1794, Yokohama.

6. Nakamura A, Matsunaga S, Shimizu T, Tonomura M,Sagisaka. Japanese speech databases for robust

41

speech recognition. Proc ICSLP-96, p 2199–2022,Philadelphia.

7. Ostendorf M, Singer H. HMM topology design usingmaximum likelihood successive state splitting. Com-puter Speech and Language 1997;11:17–41.

8. Press WH, Teukolsky SA, Vetterling WT, FlanneryBP. Numerical recipes in C. Cambridge UniversityPress; 1988.

9. Lanczos C. Numerical analysis. J Soc Ind Appl Math1968;1B:86–88.

10. Masataki H, Sagisaka Y. Variable order N-gram gen-eration by word class splitting and consecutive wordgrouping. Proc ICASSP-96, p 188–191, Atlanta.

11. Shimizu T, Yamamoto H, Kousaka Y. Reduction ofword hypothesis for large vocabulary continuousspeech recognition. Trans IEICE 1996;J79-D-II:2117–2124.

AUTHOR

Atsushi Nakamura (member) received his B.S. degree from the Department of Informatics, Faculty of Engineering,Kyushu University, in 1985. He completed the M.E. program in 1987 and joined NTT, where he was principally engaged inresearch on communication service control. From 1994 to 1999 he was involved in research on natural language speechrecognition at the ATR Interpreting Telephony Research Laboratory. Since 2000, he has been affiliated with NTT Communica-tions Science Basic Research Laboratories. He is a member of the Acoustic Society of Japan.

42