129966864160453838[1]
DESCRIPTION
TRANSCRIPT
1
Smoothing Hidden Markov Models by Using an
Adaptive Signal Limiter for Noisy Speech Recognition
Wei-Wen Hung
Department of Electrical Engineering Ming Chi Institute of Technology
Taishan, 243, Taiwan, Republic of China E-mail : [email protected]
FAX : 886-02-2903-6852; Tel. : 886-02-2906-0379
and
Hsiao-Chuan Wang
Department of Electrical Engineering National Tsing Hua University
Hsinchu, 30043, Taiwan, Republic of China E-mail : [email protected]
FAX : 886-03-571-5971; Tel. : 886-03-574-2587
Paper No. : 1033. (second review)
Corresponding Author : Hsiao-Chuan Wang
Key Words : hidden Markov model (HMM), hard limiter, adaptive signal limiter
(ASL), autocorrelation function, arcsin transformation.
2
Smoothing hidden Markov models by using an adaptive signal limiter for noisy speech recognition
Wei-Wen Hung and Hsiao-Chuan Wang
Department of Electrical Engineering, National Tsing Hua University
Hsinchu, 30043, Taiwan, Republic of China
Abstract. When a speech recognition system is deployed in the real world, environmental interference will
make noisy speech signals and reference models mismatched and cause serious degradation in recognition
accuracy. To deal with the effect of environmental mismatch, a family of signal limiters has been successfully
applied to a template-based DTW recognizer to reduce the variability of speech features in noisy conditions.
Though simulation results indicate that heavily smoothing can effectively reduce the variability of speech
features in low signal-to-noise ratio (SNR), it would also cause the loss of information in speech features.
Therefore, we suggest that the smoothing factor of a signal limiter should be related to SNR and adapted on a
frame by frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous
and dynamic spectral features of reference models and test speech. By smoothing spectral features, the
smoothed covariance matrices of reference models can be obtained by means of maximum likelihood (ML)
estimation. A speech recognition task for multispeaker isolated Mandarin digits has been conducted to
evaluate the effectiveness and robustness of the proposed method. Experimental results indicate that the
adaptive signal limiter can achieve significant improvement in noisy conditions and is more robust than the
hard limiter over a wider range of SNR values.
Key words. hidden Markov model (HMM), hard limiter, adaptive signal limiter (ASL), autocorrelation
function, arcsin transformation.
This research has been partially sponsored by the National Science Council, Taiwan, ROC, under contract
number NSC-88-2614-E-007-002.
3
LIST OF FIGURES AND TABLES
Fig. 1 Block diagram for implementing a speech recognizer with adaptive signal limiter.
Fig. 2 The various LPC log magnitude spectra of utterance ‘1’ in clean condition.
(a) LPC log magnitude spectra without signal limiter.
(b) LPC log magnitude spectra with hard limiter.
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Fig. 3 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise.
(a) LPC log magnitude spectra without signal limiter.
(b) LPC log magnitude spectra with hard limiter.
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Fig. 4 The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.
(a) LPC log magnitude spectra without signal limiter.
(b) LPC log magnitude spectra with hard limiter.
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)
Fig. 5 The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise. (a) Comparison of average log likelihoods without signal limiter.
(b) Comparison of average log likelihoods with hard limiter.
(c) Comparison of average log likelihoods with adaptive signal limiter.
(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Fig. 6 The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise. (a) Comparison of average log likelihoods without signal limiter.
(b) Comparison of average log likelihoods with hard limiter.
(c) Comparison of average log likelihoods with adaptive signal limiter.
(δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)
Table 1. Comparison of digit recognition rates (%) for white noise.
(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Table 2. Comparison of digit recognition rates (%) for factory noise.
4
(δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)
Table 3. Comparison of digit recognition rates (%) for F16 noise.
(δmin .=00, δmax .=10, SNR dBLB=15 , SNR dBUB=35 .)
Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.
5
1. Introduction
When a speech recognition system trained in a well-defined environment is used in the real world
applications, the acoustic mismatch between training and testing environments will degrade its recognition
accuracy severely. This acoustic mismatch is mainly caused by a wide variety of distortion sources, such
as ambient additive noises, channel effect and speaker’s Lombard effect. During the past several decades,
researchers focused their attentions in dealing with the mismatch problem and tried to narrow the
mismatch gap. There are many algorithms have been proposed and successfully applied for robust
speech recognition. Generally speaking, the methods for handling noisy speech recognition could be
roughly classified into the following approaches (Sankar and Lee, 1996). The first approach tries to
minimize the distance measures between reference models and testing signals by adaptively adjusting
speech signals in feature space. For example, Mansour and Juang (Mansour and Juang, 1989) found that
the norm of a cepstral vector is shrunk under noise contamination. Therefore, they used a first-order
equalization method to adapt the cepstral means of reference models so that the shrinkage of speech
features can be adequately compensated. Likewise, Carlson and Clement (Carlson and Clement, 1994)
also proposed a weighted projection measure (WPM) for recognition of noisy speech in the framework
of continuous density hidden Markov model (CDHMM). In addition, the norm shrinkage of cepstral
means will also lead to the reduction of HMM covariance matrices. Thus, Chien et al., (Chien, 1997a;
Chien et al., 1997b) proposed a variance adapted and mean compensated likelihood measure
(VA-MCLM) to adapt the mean vector and covariance matrix simultaneously.
The second approach estimates a transformation function in model space for transforming reference
models into testing environment and thus the environmental mismatch gap can be effectively reduced. In
the literature, there were a number of techniques compensating ambient noise effect in model space.
Among them, one of the most promising techniques is the so-called parallel model combination (PMC).
In the PMC algorithm, Varga and Moore (Varga and Moore, 1992a) adapted the statistics of reference
6
models to meet the testing conditions by optimally combining the reference models and noise model in
linear spectral domain. In the later few years, several related works have been successively reported for
improving the performance of PMC method. Flores and Young (Flores and Young, 1992) integrated
spectral subtraction (SS) and PMC methods to seek for further improvement in recognition accuracy. In
addition, Gales and Young (Gales and Young, 1995) extended PMC scheme to include the effect of
convolutional noise.
In the third approach, a more robust feature representation is developed in signal space so that the
speech feature is invariant or less susceptible to environmental variations. In this approach, Lee and Lin
(Lee and Lin, 1993) developed a family of signal limiters as a preprocessor to smooth speech signals.
When a speech signal is passed through a signal limiter with zero smoothing factor (i.e., a hard limiter),
the hard limiting operation preserves the sign of an input speech signal and ignores its magnitude. Thus,
the hard-limited speech signal is only affected by ambient noises when the signal-to-noise ratio (SNR) is
relatively low. This smoothing process for feature vectors has been shown to be effective for reducing the
variability of feature vectors in a noisy environment and make them less affected by ambient noises over a
wide range of SNR values. Experimental results for recognition of 39-word alpha-digit vocabulary also
demonstrate that an equivalent gain of 5-7 dB in SNR can be achieved for a template-based DTW
recognizer.
However, from the experimental results reported by Lee and Lin (Lee and Lin, 1993), we can also
observe that the recognition accuracy using a hard limiter for clean speech becomes worse. This
phenomenon may be explained as follows. For an utterance, the amplitudes of unvoiced segments are
generally much lower than the amplitudes of voiced segments. Heavily smoothing can reduce feature
variability of the speech segments with low SNR, but it also causes the loss of some important
informations embedded in the clean segments and the segments with high SNR. Therefore, a signal limiter
with fixed smoothing factor might not work well for the all segments in a speech utterance. We suggest
7
that the smoothing factor of a signal limiter should be related to SNR value and adapted on a frame by
frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and
dynamic spectral features of hidden Markov models (HMM) and testing speech signals. In addition, in
order to moderately reflect the variation of model covariance due to application of signal limiting
operation to the state statistics of word models, the adaptation of covariance matrix is also performed in
the sense of maximum likelihood (ML) estimation.
The layout of this paper is as follows. In the subsequent section, we describe the detailed formulation
of the proposed adaptive signal limiter and its extension to the framework of a continuous density hidden
Markov model. In Section 3, we investigate the behavior of LPC spectra of a speech utterance and its
signal-limited version under the influence of various ambient noises. In addition, a series of experiments
were conducted to compare the discriminability of different signal limiters in various noisy conditions.
Some experiments for recognition of multispeaker isolated Mandarin digits were performed in Section 4
to evaluate the effectiveness and robustness of the proposed method in presence of ambient noises.
Finally, a conclusion is drawn in Section 5.
2. Smoothing hidden Markov models by using an adaptive signal limiter
In this section, we describe the detailed formulation of the proposed adaptive signal limiter (ASL) and
its extension to the framework of an HMM-based speech recognizer.
2.1 Representation of the underlying hidden Markov models
Conventionally, for a continuous density hidden Markov model (CDHMM), the output likelihood
measure of tth− frame in the testing utterance { }Y y ctdt t Tyt= = ≤ ≤[ , ],1 based on the
statistics of i th− state of word model { }Λ Λ Σ( ) ( , ),, , ,
w i Sw i w i w i w= = ≤ ≤µ 1 can be
8
characterized by a multivariate Gaussian probability density function (pdf) and formulated as
py p y yt w i w i t w iT
w i t w i( ) ( ) exp ( ) ( )
, , , , ,Λ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −
−−
21
2
121π µ µ , (1)
where µw i w i w ic d, , ,[ , ]= denotes the mean vector of i th− state of word model Λ( )w and
consists of p− order cepstral vector cw i, and p− order delta cepstral vector dw i,. Σw i, denotes
the covariance matrix of i th− state of word model Λ( )w and is simplified as a diagonal matrix, i.e.,
Σw i w i w i w idiag p, , , ,[ () () ( )]= ⋅ ⋅ ⋅ ⋅σ σ σ2 2 21 2 2 . However, in order to adequately reflect the
variation of dynamic spectral features due to application of a signal limiting operation to instantaneous
spectral features, the representation of state statistics in a conventional hidden Markov model is modified
slightly. In our approach, the mean vector µw i w i w ic d, , ,[ , ]= of i th− state of the word model
Λ( )w is indirectly represented by the normalized autocorrelation vectors of a five-frame context
window (Lee and Wang, 1995), that is [ , , , , ],, ,, ,, ,, ,,r r r r rw i w i w i w i w i− −2 1 0 1 2 , where
r r r pw ij w ij w ijT
,, ,, ,,[ (),, ( )]= ⋅⋅ ⋅1 , j=0 denotes the instantaneous frame, j=-1, -2 the left context
frames and j=1, 2 the right context frames. The estimation of those normalized autocorrelation vectors
in a five-frame context window is proceeded as follows. Firstly, a conventional hidden Markov model is
trained for each word by means of the segmental k-means algorithm. Then, based upon the obtained
word models, each frame in the training utterances is labeled with its decoded state identity by using the
Viterbi decoding algorithm. Those instantaneous, left-context and right-context autocorrelation vectors
corresponding to the same state identity are collected and averaged to obtain the indirect representation
of the underlying hidden Markov models. For example, the normalized autocorrelation vectors of i th−
state of the word model Λ( )w can be formulated by
[ , , , , ]
[ , , , , ]
,,, ,, ,, ,,1 ,,
, , , , ,,r r r r r
r r r r r
Nw i w i wi w i w i
wtu
wtu
wtu
wtu
wtu
ut
s− −
− − + +
=∑
2 1 0 2
2 1 1 2
(2)
9
where rw tu, represents the normalized autocorrelation vector of the u th− training utterance, t th−
frame of word w. Above summation includes all the Ns frames which are labeled with the state
identity i of word model Λ( )w .
Based upon this indirect representation, the analysis equations of linear predictive coding (LPC) model
can be expressed in matrix form as
R a rw ij w ij w ij,, ,, ,,⋅ = forj= − ⋅⋅ ⋅2 2,,, (3)
where Rw ij,, is an autocorrelation matrix of the form
R
r r r p
r r r p
r p r p r
w ij
w ij w ij w ij
w ij w ij w ij
w ij w ij w ij
,,
,, ,, ,,
,, ,, ,,
,, ,, ,,
() () ( )
() () ( )
( ) ( ) ()
=
−−
− −
0 1 1
1 0 2
1 2 0
LL
M M O ML
. (4)
Since the autocorrelation matrix is Toeplitz symmetric and positive definite, the LPC coefficient vector
a a a a pw ij w ij w ij w ijT
,, ,, ,, ,,[ () () ()]= 1 2 L can be solved efficiently by the Levinson-Durbin
recursion method (Rabiner and Juang, 1993). Once we obtain the LPC coefficient vector for Eq. (3), the
corresponding cepstral vector cw ij,, can be recursively calculated by using the LPC to cepstral
coefficient conversion formula
c m a m kmc k a m kw ij w ij w ij w ij
k
m
,, ,, ,, ,,( ) ( ) ( ) ( ) ( )= + ⋅ ⋅ −=
−
∑1
1
, 1≤ ≤m p. (5)
Finally, the cepstral vector of instantaneous frame, i.e., cw ij,, forj=0, is used as the mean vector,
cw i,, of i th− state of word model Λ( )w . In addition, the corresponding delta cepstral vector
dw i, can also be calculated by using the following equation :
d
jc
jw i
w ijj
j
j
j,
,,
=
⋅=−
=
=−
=
∑
∑
2
2
2
2
2. (6)
10
2.2 Formulation of the adaptive signal limiter
For recognition of noisy speech, it has been observed that employing a signal limiter to smooth a
speech signal in time domain leads to significant performance improvement. The basic theory of a signal
limiter can be roughly described as follows (Lee and Lin, 1993). When a signal x is passed through a
signal limiter, the signal limiting operation is equivalent to performing a nonlinear transformation on the
input signal and so that the corresponding output signal y can be essentially characterized by an error
function of the form :
y sxK
t dtx
= =⋅ ⋅
⋅ − ⋅∫() exp[( )] ,2
22
2 2
0π σσ (7)
where K is a scaling constant and σ 2 is a tunable factor for adjusting the smoothing degree of a
signal limiting operation.
In light of above pronounced smoothing property, a signal limiter can be readily extended to the
processing of speech signals in a noisy environment. Consider an input speech signal x, approximated
by a zero mean, stationary Gaussian process with variance σx2, has the density function as
gxx
x x
() exp .=⋅ ⋅
⋅ −⋅
1
2 22
2
2π σ σ
(8)
Then, the output y of a signal limiter has the density function expressed as (see Appendix A)
( ) ,2
1exp))(()( 2
2
⋅⋅⋅−−⋅==
x
xK
xshyhσδ
δδ
(9)
where x s y= −1( ). δ denotes the smoothing factor of a signal limiter and is defined as δ σ σ= 2 2x.
The larger the value of δ , the smaller the value of output signal y. When the smoothing factor δ
approaches 0, the corresponding signal limiter changes into a hard limiter of the form
y fxK ifx
ifxK ifx
= =>=
− <
()
2 00 0
2 0. (10)
A signal limiting operation can also be interpreted as an arcsin transformation in autocorrelation domain.
Assume that the autocorrelation functions of input speech signal x and its signal-limited output y are
denoted as rx()τ and ry()τ , respectively. Then, the normalized autocorrelation function of a
signal-limited output y can be formulated as (Lee and Lin, 1993)
11
[ ]
[ ]r
rr
r
r
ry
y
y
x()()
()
sin ()( )
sin( )τ
τ τ δ
δ≡ =
+
+
−
−0
1
11
1
1, (11)
where rr r rx x x() () ()τ τ≡ 0 is the normalized autocorrelation function of the input speech signal x.
By properly adjusting the smoothing factor δ , various degrees of smoothing effect can be obtained.
When δ approaches infinity, the normalized autocorrelation function of an input speech signal rrx()τ
is almost equal to the normalized autocorrelation function of the corresponding signal-limited output rry()τ . Furthermore, in the case of δ = 0, the normalized autocorrelation function of the signal-limited
output rry()τ is reduced to the following equation (see Appendix B) :
[ ]r rr ry x() sin ().τ
πτ= ⋅ −2 1 (12)
In the literature presented by Lee and Lin (Lee and Lin, 1993), they used a hard limiter as a
pre-processor to reduce the variability of feature vectors in noisy conditions. That is, a pre-determined
smoothing factor is used throughout a speech signal. However, it is known that the segments of clean
speech with less energy are influenced most by ambient noises and thus require heavily smoothing. As to
the clean segments and the segments with high SNR, excessively smoothing not only destroys their
distinct features but also reduces the discriminability of speech features in a noisy environment. Therefore,
we propose an adaptive signal limiter (ASL) in which the smoothing factor δ is related to SNR and
adapted on a frame by frame basis. In the proposed adaptive signal limiter, the smoothing factor δ is
empirically formulated as :
( )δ
δ
δ δδ
δ
( ) ,
min
max minmin
max
SNR
ifSNRSNR
SNR SNRSNRSNR ifSNR SNRSNR
ifSNRSNR
LB
UB LBLB LB UB
UB
=
<
−−
⋅ − + ≤ ≤
>
(13)
and SNRE
Es
n
≡ ⋅10 10log( ), (14)
where δ δmin max, , ,SNR SNRLB UB are tuning constants, Es is the frame energy of a clean speech
signal and En is the noise energy. In the subsequent experiments, the arcsin transformation shown in
Eq.(11)-(14) are used to compute the normalized autocorrelation of a signal-limited signal rather than
directly applying the nonlinear operation of Eq. (7) on the input signal. This is because of that the
underlying hidden Markov models are indirectly represented by the LPC-based spectral features. The
LPC spectral features can be efficiently calculated from autocorrelation function by means of Eq.(5).
12
Moreover, comparing with the signal limiting operation shown in Eq.(7), the arcsin transformation
requires less computation cost.
2.3 Adaptations of dynamic spectral feature and covariance matrix
When a signal limiting operation is performed on the autocorrelation function of a speech signal, it not
only smooths instantaneous spectral vectors but also leads to reduction of the corresponding dynamic
spectral features and model covariance matrices. Therefore, in order to achieve higher consistency, the
adaptations of a model’s dynamic spectral features and its covariance matrices are necessary. This
adaptation procedure is proceeded as follows. When the t th− frame yt of a testing utterance Y is
evaluated on the state Λw i, , the cepstral vectors ctj, of its context frames ytj, , for j− ≤ ≤2 2,
are first transformed to give the corresponding normalized autocorrelation vectors rtj, . Then, those
normalized autocorrelation vectors r r r ptj tj tjT
, , ,[ (),, ()]= ⋅ ⋅ ⋅1 are processed by the following
arcsin transformation :
[ ]
[ ]
~ ()
sin()
( )
sin( )
,,
,
,
,
r
r
SNR
SNR
tj
tj
tj
tj
τ
τ
δ
δ
=+
+
−
−
1
1
1
1
1
for − ≤ ≤2 2j and 1≤ ≤τ p. (15)
In above equation, the SNRtj, variable is determined by
SNRE E
Etjtj n
n, log
( ),= ⋅
−
−10 10 (16)
where Et is the tth− frame energy in the testing utterance Y . En is the noise energy and can be
roughly estimated by selecting the lowest energy in the testing utterance Y , i.e.,
{ }E E E En Ty= ⋅⋅ ⋅min, ,,1 2 . Once the smoothed autocorrelation vectors ~ ,,r for jtj − ≤ ≤2 2, were
obtained, the smoothed testing cepstral vector ~,ctj of ~,ytj can be calculated by means of the LPC to
cepstrum conversion formula. Moreover, the corresponding smoothed testing delta cepstral vector ~dt
can also be solved by using the following equation :
~
~,
d
jc
jt
tjj
j
j
j=
⋅=−
=
=−
=
∑
∑2
2
2
2
2, (17)
13
and thus, the smoothed testing feature vector ~ [~,~]y cdt t t= can be taken as the term ~ [~ ,
~], ,y c dt t t0 0= .
Similarly, in order to avoid introducing mismatch between testing speech signals and reference models,
the mean vector of state Λw i, should be also smoothed by using Eq. (11) with the same smoothing
factor, and thus its smoothed version ~ [~ ,~], , ,µw i w i w ic d= can be obtained. On the other hand, by
substituting ~ [~ ,~], , ,µw i w i w ic d= and ~ [~,
~]y cdt t t= into Eq.(1), we may obtain
~(~(~ , ))( ) exp (~ ~ ) (~ ~ ), , , , , ,
py p y yt w i w i w i t w iT
w i t w iµ π µ µΣ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −
− −
21
2
12 1 . (18)
By taking differential of logarithm of Eq. (18) with respect to Σw i, and setting the result to zero, we can
obtain the optimal smoothed covariance matrix ~,Σw i which maximize the likelihood function in Eq.(18),
that is (see Appendix C)
~
~( )~ ( )
( )
~( )
~( )
( ).,
,
,
,
,
,Σ Σw i
t w i
w i
t w i
w im
m p
m
m p
w i
c m c m
m
d m d m
p m
p=
−
+
−
+
⋅⋅
=
=
=
=
∑∑σ σ
2
11
2
2 (19)
Finally, the resulting smoothed output likelihood measure can be rewritten as :
~(~~ ) ( )~
exp (~ ~ )~
(~ ~ ), , , , ,
py p y yt w i w i t w iT
w i t w iΛ Σ Σ= ⋅ − ⋅ ⋅ − ⋅ − ⋅ ⋅ −
− −
21
2
12 1π µ µ . (20)
2.4 Implementation of a speech recognizer with adaptive signal limiter
To be more detailed, the overall system diagram for implementing a HMM-based speech recognizer
with adaptive signal limiter is depicted in Fig. 1. In the training phase, we first train a set of word models
by using the segmental k-means algorithm and Viterbi decoding method (Juang and Rabiner, 1990). Also,
the state statistics of a word model are indirectly represented by the normalized autocorrelation vectors of
a five-frame context window. When a testing utterance Y is to be recognized, we first use Eq.(15) and
Eq.(16) to estimate the frame-dependent smoothing factor and perform arcsin transformation on the
normalized autocorrelation vectors rtj, . When the arcsin-transformed vectors ~,rtj are obtained, we
can solve the smoothed cepstral vector ~,ctj and its delta cepstral vector by LPC to cepstrum
conversion formula and Eq. (17). Moreover, the same smoothing factor is also used to smooth the state
14
statistics of word models. Once the smoothed autocorrelation vectors ~,,rw ij are obtained, the
smoothed cepstral vectors ~,,cw ij can likewise be calculated by means of the LPC to cepstrum
conversion formula. Moreover, the corresponding smoothed delta cepstral vector ~,dw i and covariance
matrix ~,Σwi can also be solved by using the Eq.(6) and Eq.(19). Finally, by substituting ~yt,
~,µw i and
~,Σwi into Eq. (20), we can obtain the smoothed output likelihoods.
(Figure 1 is about here.)
3. Effectiveness and robustness of the adaptive signal limiter
3.1 Database and experimental conditions
A multispeaker (50 male and 50 female speakers) isolated Mandarin digit recognition (Lee and Wang,
1994) was conducted to demonstrate the effectiveness and robustness of the proposed adaptive signal
limiter. There are three sessions of data collection in the digit database. For each session, every speaker
uttered a set of 10 Mandarin digits. Speech signals are sampled at 8 KHz. Each frame contains 256
samples with 128 samples overlapped, and is multiplied by a 256-point Hamming window. Endpoints are
not detected so that each utterance still contains about 0.1~0.5 seconds of pre-silence and post-silence.
Each digit is modeled as a left-to-right HMM without jumps in which the output of each state is a
2-mixture Gaussian distribution of feature vectors. Each word model contains seven to nine states
including pre-silence and post-silence states. The feature vector is indirectly represented by the 12-order
normalized autocorrelation vectors of a five-frame context window. This representation can be then
transformed into a 12-order cepstral vectors and a 12-order delta cepstral vector. Moreover,
NOISEX-92 noise database (Varga et al., 1992b) was used for generating noisy speech. The
subsequent experiments were conducted to examine the following problems : (1) influence of signal
limiters on the LPC spectra of clean speech, (2) influence of signal limiters on the LPC spectra of noisy
speech, and (3) effects of signal limiters on speech discriminability in a noisy environment.
15
3.2 Influence of signal limiters on LPC spectra of clean speech
A sample utterance of Mandarin digit ‘1’ uttered by a male speaker is used to demonstrate the
influence of signal limiters on LPC spectra of clean speech. The 12-order LPC spectrum analysis is
performed on a 32 msec window with 16 msec frame shift. To observe the spectral variation in frequency
domain, we ploted the LPC spectra of 15 consecutive frames extracted from the middle portion of the
sample utterance. Figure 2 shows the log LPC spectra of the sample utterance ‘1’ in the cases of without
signal limiter, with hard limiter and with adaptive signal limiter. From this figure, we can observe that the
formants of utterance ‘1’ occur about at the positions of 200Hz, 1950Hz, 3100 Hz and 3350Hz. After
applying a signal limiter, it is noted that parts of the original spectra become more smoothed and their
formant peaks are broaden. Especially, in the case of using a hard limiter, the second, third and fourth
formants are severely suppressed. Since the location and spacing of formant frequencies are highly
correlated with the shape of a vocal tract, this suppression will reduce the discriminability of speech
utterances and lead to misrecognition. On the other hand, we can also find that the spectral shape in the
case of using the adaptive signal limiter is almost unaffected. This is mainly due to that an adaptive signal
limiter employing larger smoothing factor is useful to keep the arcsin-transformed autocorrelation function
almost unchanged in clean condition.
(Figure 2 is about here.)
3.3 Influence of signal limiters on LPC spectra of noisy speech
In this subsection, we explore the influence of signal limiters on LPC spectra of noisy speech. This is
shown in Fig. 3 and Fig. 4, where we plot the LPC spectra of the same utterance shown in Fig. 2 with 20
dB additive white Gaussian noise and factory noise, respectively. When a white noise is added to clean
speech, there gradually appears an abnormal formant peak in the LPC spectra of distorted utterance ‘1’
at about 1125 Hz ~ 1625 Hz as shown in Fig. 3 (a). This phenomenon also happens in the case of
16
adding a factory noise to clean speech. In the case of adding factory noise, the abnormal formant peak
occurs at about 1000 Hz ~ 1375 Hz. However, comparing with the baseline case, the spectral distortion
in the LPC spectra with signal limiter are less pronounced. This property verifies the robustness of signal
limiters in a noisy environment. In addition, a comparison of Fig. 3, Fig. 4 and Fig. 2 shows that
excessively smoothing autocorrelation function will suppress parts of formant peaks and lose some
important informations about the shape of a vocal tract. Instead of using a fixed smoothing factor, an
adaptive signal limiter adaptively adjusting the degree of smoothness can not only effectively reduce the
variability of speech features, but also preserve more useful spectral information embedded in a speech
signal.
(Figure 3 and figure 4 are about here.)
3.4 Effects of signal limiters on speech discriminability in a noisy environment
In this subsection, we evaluate the robustness of signal limiters in noisy conditions. First, the first two
sessions of database were used to train a set of word models by using the segmental k-means algorithm.
To generate noisy speech, white Gaussian noise and factory noise were separately added to the 100
utterances of Mandarin digit ‘1’ in the third session. Those distorted utterances were then evaluated on
the 10 word models to obtain maximum log likelihoods. For each word model, we can find the average
log likelihoods by averaging the accumulation of all log likelihoods corresponding to the same word
model. In Fig. 5 and Fig. 6, we plot the average log likelihoods of utterance ‘1’ as a function of SNR
values in the cases of white Gaussian noise and factory noise, respectively. When the underlying
environment is getting noisy, i.e., below a SNR threshold, utterance ‘1’ is easily misrecognized as
utterance ‘7’. For white noise, the SNR thresholds occur at about 20 dB, 15 dB and 7 dB for the cases
of without signal limiter, with hard limiter and with adaptive signal limiter, respectively. Similarly, for
factory noise, the SNR thresholds occur at about 15 dB, 10 dB and 3 dB for the cases of without signal
17
limiter, with hard limiter and with adaptive signal limiter, respectively. These experimental results reveal
that an equivalent gain of about 12 ~ 13 dB and 7 ~ 8 dB in SNR can be achieved when the adaptive
signal limiter is compared with the baseline and hard limiter for recognition of utterance ‘1’ in noisy
conditions, respectively.
(Figure 5 and figure 6 are about here.)
4. Experimental results and discussion
In this section, a multispeaker (50 males and 50 females) recognition of isolated Mandarin digits (Lee
and Wang, 1994) was conducted to demonstrate the merits of the proposed method. The experimental
setup and underlying database have been described in subsection 3.1. In the experiments we conducted,
a conventional hidden Markov model without incorporating any signal limiters is referred as a baseline
system. The ambient noises including white Gaussian noise, F16 noise and factory noise were separately
added to clean speech with predetermined SNRs at 20, 15, 10, 5 and 0 dB to generate various noisy
speech signals. Moreover, the parameters used in the proposed adaptive signal limiter under different
noisy conditions are determined empirically as follows. Firstly, the smoothing factor δ is initially set to
0 and increased with increment ∆δ = 01. while SNRLB and SNRUB are kept constant. It is
observed that when the smoothing factor δ is beyond 1, smoothing operation has little effect on digit
recognition rates. This phenomenon also happens in the cases of using different sets of parameters
SNRLB and SNRUB. Therefore, the maximum value of smoothing factor can be well approximated by
setting δmax .=10 and employed throughout all experiments. Similarly, we chose a SNR lower bound
from the interval 030~ dB while a SNR upper bound from the interval 2050~ dB with increment
5dB to test which set of SNR parameters can achieve better digit recognition accuracy.
In Table 1, we assess the recognition accuracy of baseline, parallel model combination (PMC),
18
baseline with hard limiter and baseline with adaptive signal limiter for recognition of noisy speech under
the influence of a white noise. From the experimental results, we can find that the baseline with hard
limiter improves the recognition accuracy at low SNR and performs worse at high SNR and clean
condition. This is mainly because that oversmoothing autocorrelation function severely distorts some
important spectral informations embedded in original speech signals. On the other hand, the improvement
of the proposed adaptive signal limiter is remarkable due to the helps of adaptively adjusting the
smoothing factor. The adaptive signal limiter further outperforms the hard limiter. This means that using
larger smoothing factors for clean condition and high SNR is as important as using smaller smoothing
factors for low SNR.
(Table 1 is about here.)
Moreover, we also find that the PMC method is superior to the proposed adaptive signal limiter
technique in recognition accuracy. This superiority is mainly due to that the PMC method decomposes
the concurrent processes of speech and background noise, and so that the environmental mismatch can
be effectively reduced by optimally combining those two processes in the linear spectral domain. In
contrast, the environmental mismatch is not compensated during the signal limiting operation. The
proposed adaptive signal limiter can be considered as a weighting function which neglects the speech
segments with low SNR by heavily smoothing their features in autocorrelation domain. This smoothing
operation not only reduces feature variability in noisy conditions but also inevitably deteriorates parts of
characteristics of speech features. Therefore, it is intuitive that the PMC method has better recognition
accuracy as comparing with the proposed method. However, those comparison results do not indicate
that the proposed method is useless for noisy speech recognition. For the segments with low SNR (e.g.,
distorted unvoiced segment), the adaptive signal limiter seems to be more effective than the PMC method
in some noisy conditions. This implies that model adaptation is useful for high and medium SNRs while
feature smoothing is more feasible for low SNR. As described in the paper proposed by C. H. Lee and
19
C. H. Lin (Lee and Lin, 1993), a signal limiter can be combined with other noise-robust speech
recognition techniques to obtain an additional performance improvement. Therefore, it is expected that by
properly integrating the adaptive signal limiter with other noise-robust speech recognition techniques, such
as WPM, PMC methods, further improvement in recognition accuracy could be obtained.
Likewise, the comparison of different methods in the presence of factory noise and F16 noise are also
illustrated in Table 2 and Table 3, respectively. We can observe that the proposed method consistently
achieves remarkable improvement in recognition accuracy. This result verifies the effectiveness and
robustness of the adaptive signal limiter for speech recognition in white noise as well as colored noises.
As far as the computation time is concerned, the adaptive signal limiter needs fewer computation time
than the PMC method. The reduction in CPU time is about 25%. A detail of CPU time for different
methods is shown in the Table 4.
(Table 2, Table 3, and Table 4 are about here.)
5. Conclusion
In this paper, we explore the influence of a hard limiter on LPC spectra of clean and noisy speech. It is
found that excessively smoothing in autocorrelation domain of a speech signal will suppress parts of
formant peaks and reduce the discriminability of speech features in noisy conditions. Based upon the
weakness of a hard limiter, an adaptive signal limiter is proposed to improve its robustness. In our
approach, the smoothing degree of a signal limiter is related to SNR value and adaptively determined on
a frame by frame basis. That is, the smaller the SNR value of a speech frame, the smaller the smoothing
factor of a signal limiter. Experimental results verify that an adaptive signal limiter outperforms a hard
limiter at various SNRs. This improvement is mainly due to that an adaptive signal limiter not only reduces
feature’s variability in low SNR, but also preserves some important informations bearing in the speech
segments with high SNR.
20
Acknowledgement
The authors would like to thank Dr. Lee-Min Lee of Mingchi Institute of Technology, Taipei, Taiwan,
for his enthusiasm in providing experiences for implementing the new representation of hidden Markov
model with five-frame context window.
References
Carlson, B. A., Clement, M. A., 1994. A projection-based likelihood measure for speech recognitionin
noise. IEEE Trans. on Speech and Audio Processing. Vol. 2, pp. 97-102.
Chien, J. T., 1997a. Speech recognition under telephone environments. Ph.D. Thesis. Department of
Electrical Engineering, National Tsing Hua University, Taiwan, R.O.C.
Chien, J. T., Lee, L. M., Wang, H. C., 1997b. Extended studies on projection-based likelihood measure
for noisy speech recognition. revised in IEEE Trans. on Speech and Audio Processing.
Flores, J. A. N., Young, S. J., 1992. Continuous speech recognition in noise using spectral subtraction
and HMM adaptation. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). San
Francisco. Vol. 1, pp. 409-412.
Gales, M. J. F., Young, S. J., 1995. Robust speech recognition in additive and convolutional noise using
parallel model combination. Computer Speech and Language, Vol. 4, pp. 352-359.
Juang, B. H., Rabiner, L. R., 1990. The segmental k-means algorithm for estimating parameters of
hidden Markov models. IEEE Trans. on Acoustics, Speech, Signal Proc., 38(9) : 1639-1641,
September.
Lee, C. H., Lin, C. H., 1993. On the use of a family of signal limiters for recognition of noisy speech.
Speech Communication, Vol. 12, pp. 383-392.
Lee, L. M., and Wang, H. C., 1994. A study on adaptation of cepstral and delta cepstral coefficients for
21
noisy speech recognition. Proc. of Int. Conf. Spoken Language Processing (ICSLP). Yokohama,
Japan. pp. 1011-1014.
Lee, L. M., Wang, H. C., 1995. Representation of hidden Markov model for noise adaptive speech
recognition. Electronics Letters, Vol. 31, No. 8, pp. 616-617.
Mansour, D., Juang, B. H., 1989. A family of distortion measures based upon projection operation for
robust speech recognition. IEEE Trans. on Acoustics, Speech, Sig5nal Processing, Vol. 37, pp.
1659-1671.
Rabiner, L., and Juang, B. H., 1993. Fundamentals of Speech Recognition, Englewood Cliffs, New
Jersey, Prentice-Hall, pp. 112-117.
Sankar, A., Lee, C. H., 1996. A maximum-likelihood approach to stochastic matching for robust speech
recognition. IEEE Trans. on Speech and Audio Processing, Vol. 4, pp. 190-202.
Varga, A. P., Moore, R. K., 1992a. Hidden Markov model decomposition of speech and noise. IEEE
Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), San Francisco. pp. 845-848.
Varga, A. P., Steeneken, H.J.M., Tomlinson, M., Jones, D., 1992b. The NOISEX-92 study on the
effect of additive noise on automatic speech recognition, Technical Report, DRA Speech Research
Unit, Malvern, England.
22
Table 1. Comparison of digit recognition rates (%) for white noise. ( 0.0min =δ , 0.1max =δ , SNR dBLB=20 , SNR dBUB=30 .)
SNRs Methods
clean 20 dB 15 dB 10 dB 5 dB 0 dB
baseline 98.9 80.2 65.7 48.8 25.6 10.6 PMC 98.7 92.2 84.6 72.7 59.3 47.1
hard limiter 90.6 76.8 68.5 55.8 35.9 21.4 adaptive limiter 95.2 85.1 76.4 68.1 58.5 49.7
Table 2. Comparison of digit recognition rates (%) for factory noise. ( 0.0min =δ , 0.1max =δ , dBSNRLB 10= , dBSNRUB 40= .)
SNRs Methods
clean 20 dB 15 dB 10 dB 5 dB 0 dB
baseline 98.9 91.2 81.4 65.9 46.9 25.4 PMC 98.7 95.0 91.8 82.3 73.2 52.5
hard limiter 90.6 86.3 80.2 71.3 57.5 30.0 adaptive limiter 94.9 91.9 87.8 77.7 69.2 53.3
Table 3. Comparison of digit recognition rates (%) for F16 noise. ( 0.0min =δ , 0.1max =δ , SNR dBLB=15 , SNR dBUB=35 .)
SNRs Methods
clean 20 dB 15 dB 10 dB 5 dB 0 dB
baseline 98.9 91.1 78.9 65.2 43.9 21.0 PMC 98.7 95.9 92.5 87.4 68.1 44.5
hard limiter 90.6 84.9 77.1 67.8 54.6 29.4 adaptive limiter 95.1 91.4 85.3 78.7 61.9 42.2
Table 4. Comparison of computation costs based on Pentium II-266 MHz Personal Computer.
Methods baseline PMC hard limiter adaptive limiter
CPUTimerecognition
(sec) 0.203
4.038
0.291
2.981
23
δ
(model adaptation)
δ
Fig. 1. Block diagram for implementing a speech recognizer with adaptive signal limiter.
0 375750112515001875225026253000337537501
6
11
-2
-1
0
1
2
3
4
magnitude (dB)
frequency (Hz)
frame index
baseline-clean
(a) LPC log magnitude spectra without signal limiter.
word models
segmental k-means
arcsin transform
autocorr. → LPC LPC→ cepstrum
estimate smoothing factor δ
autocorr. vectors of a context window
arcsin transform
speech recognizer
training utterances
autocorr. vectors of context windows
find smoothed delta cepstrum and covariance matrix
testing utterances
autocorr. → LPC LPC→ cepstrum
find smoothed delta cepstrum
recognition results
24
0 5001000150020002500300035001
713
-0.5
0
0.5
1
1.5
2magnitude (dB)
frequency (Hz)
frame index
hard limiter-clean
(b) LPC log magnitude spectra with hard limiter.
0 5001000150020002500300035001
611
-1.5-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
magnitude (dB)
frequency (Hz)
frame index
adaptive limiter-clean
(c) LPC log magnitude spectra with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Fig. 2. The various LPC log magnitude spectra of utterance ‘1’ in clean condition.
25
0 5001000150020002500300035001
611
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
magnitude (dB)
frequency (Hz)
frame index
baseline-white20dB
(a) LPC log magnitude spectra without signal limiter.
0 5001000150020002500300035001
611
-0.5
0
0.5
1
1.5
2
magnitude (dB)
frequency (Hz)
frame index
hard limiter-white20dB
(b) LPC log magnitude spectra with hard limiter.
26
0 5001000150020002500300035001
6
11
-1
-0.5
0
0.5
1
1.5
2
2.5magnitude (dB)
frequency (Hz)
frame index
adaptive limiter-white20dB
(c) LPC log magnitude spectra with adaptive signal limiter.
(δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Fig. 3. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB white noise.
0 5001000150020002500300035001
611
-1.5
-1-0.5
0
0.5
11.5
2
2.5
3
3.5
magnitude (dB)
frequency (Hz)
frame index
baseline-factory20dB
(a) LPC log magnitude spectra without signal limiter.
27
0 5001000150020002500300035001
6
11
-0.5
0
0.5
1
1.5
2
magnitude (dB)
frequency (Hz)
frame index
hard limiter-factory20dB
(b) LPC log magnitude spectra with hard limiter.
0 5001000150020002500300035001
611
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
magnitude (dB)
frequency (Hz)
frame index
adaptive limiter-factory20dB
(c) LPC log magnitude spectra with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)
Fig. 4. The various LPC log magnitude spectra of utterance ‘1’ distorted by 20 dB factory noise.
28
word '1' in white noiseusing baseline system
0
100
200
300
400
500
600
700
800
900
0dB 5dB 10dB 15dB 20dB cleanSNR values
log likelihoods
model 0 model 1 model 2model 3 model 4 model 5model 6 model 7 model 8model 9
(a) Comparison of average log likelihoods without signal limiter.
word '1' in white noiseusing hard limiter
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0dB 5dB 10dB 15dB 20dB cleanSNR values
log likelihoods
model 0 model 1 model 2model 3 model 4 model 5model 6 model 7 model 8model 9
(b) Comparison of average log likelihoods with hard limiter.
29
word '1' in white noiseusing adaptive signal limiter
600
700
800
900
1000
1100
1200
1300
1400
1500
0dB 5dB 10dB 15dB 20dB cleanSNR values
log likelihoods
model 0 model 1model 2 model 3model 4 model 5model 6 model 7model 8 model 9
(c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=20 , SNR dBUB=30 .)
Fig. 5. The average log likelihoods of utterance ‘1’ evaluated on various word models in white noise.
word '1' in factory noiseusing baseline system
200
300
400
500
600
700
800
900
0dB 5dB 10dB 15dB 20dB cleanSNR values
log likelihoods
model 0 model 1model 2 model 3model 4 model 5model 6 model 7model 8 model 9
(a) Comparison of average log likelihoods without signal limiter.
30
word '1' in factory noiseusing hard limiter
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0dB 5dB 10dB 15dB 20dB cleanSNR values
log likelihoods
model 0 model 1 model 2model 3 model 4 model 5model 6 model 7 model 8model 9
(b) Comparison of average log likelihoods with hard limiter.
word '1' in factory noiseusing adaptive signal limiter
600
800
1000
1200
1400
1600
0dB 5dB 10dB 15dB 20dB cleanSNR values
log likelihoods
model 0 model 1
model 2 model 3
model 4 model 5
model 6 model 7
model 8 model 9
(c) Comparison of average log likelihoods with adaptive signal limiter. (δmin .=00, δmax .=10, SNR dBLB=10 , SNR dBUB= 40 .)
Fig. 6. The average log likelihoods of utterance ‘1’ evaluated on various word models in factory noise.