chapter 4 speech enhancement using multi-band...
Post on 26-Sep-2020
1 Views
Preview:
TRANSCRIPT
55
CHAPTER 4
SPEECH ENHANCEMENT USING MULTI-BAND WIENER FILTER
4.1 Introduction
In real environmental conditions the speech signal may be
superimposed by the environmental interference. In general, the
spectrum of this noise varies with frequency. Since environmental
noise is colored, original speech spectrum will not be affected
uniformly by this noise. Recently researchers focussed on this issue
and some findings are given in [64-66]. Sub-band single channel
speech enhancement systems are developed by dividing the whole
noisy speech spectrum into frequency sub-bands [65], [67-68] and
this division is based on frequency response characteristics of human
ear. From the literature, Bark scale is well suited for sub-banding. The
power spectral density of colored noise is not constant, where as white
noise has constant value. Thereby doing speech enhancement by
multiplying the noisy speech spectrum with a same weighting factor
will distort the speech signal. Hence, it is necessary to perform speech
enhancement using different weighting factors for different sub-bands.
Thereby, speech distortion will be reduced while most of the musical
noise is eliminated.
This chapter deals with multi-band noise suppression techniques,
in which different weighting factors are used in each sub-band. To
resembling the frequency response characteristic of human ear,
56
frequency spectrum noise speech is divided into sub-bands based on
nonlinear Bark scale.
4.2 Multi-band Speech Enhancement
In the earlier proposed spectral subtraction [8], [39]. The authors
have assumed that noise and speech signal are uncorrelated, then the
cross correlation terms between clean speech signal and noise signal
are neglected. If we considered speech with background noise )(ny , as
combination of original speech signal )(nx , which is additively affected
by background noise )(nd then the noisy signal is given by
)()()( ndnxny (4.2.1)
This assumption is true while speech signal is stationary but it is not
so in. By applying the FFT to (4.2.1), at the thm frame and thk frequency
bin, )(ny can be represented as:
),(),(),( kmDkmXkmY (4.2.2)
where ),( kmY , ),( kmX and ),( kmD are the DFT coefficients of speech with
background noise, clean speech and noise signals respectively. The
power spectrum noise corrupted speech signal is given by
),(),(),(),(),(),(),(222
kmDkmXkmDkmXkmDkmXkmY (4.2.3)
where ),( kmD and ),( kmX denote their complex conjugates. The
function 2),( kmX is power spectrum of original speech signal.
In spectral subtraction, 2),( kmX is estimated from (4.2.3). In
equation (4.2.3), the terms 2),( kmD , ),(),( kmDkmX and ),(),( kmDkmX are
not possible to obtain their values directly and we use expected values
57
instead of their true values. 2
),( kmDE is obtained using method
proposed in the chapter 3. In power spectrum subtraction,
assumption is made on )(nd as it is zero mean and independent to
)(nx and then expected values of the terms ),(),( kmDkmX and
),(),( kmDkmX will be zero. Finally based on those assumptions the
original speech signal can be obtained as
2
),(),(),(ˆ 22
kmDEkmYkmX (4.2.4)
4.2.1 Cross correlation Terms in Spectral Subtraction
The assumption of the earlier researchers that statistical
characteristics of speech and noise are not correlated, limits the
performance of the spectral subtraction algorithm. This is not a valid
assumption in real world environments. The autocorrelation
sequences of one frame of a clean speech, together with the
background and noisy version of the same speech signal are shown in
Fig.4.1. From the graphical illustration shown in Fig. 4.1, it is
observed that autocorrelation sequence of speech signal with
background noise is not same as that obtained by summing the
autocorrelation sequences of the noise and original speech signal.
Fig.4.1 shows the existence of the cross correlation between the clean
speech signal and noise signal [69-70].
Therefore, the terms ),(),( * kmDkmX and ),(),(* kmDkmX cannot be
neglected. By including these terms, the musical noise from the
processed speech signal can be reduced. As the discussion is on single
58
channel noise suppression techniques, it is not possible to have the
samples of original speech. In order to approximate and include the
cross correlation terms, a spectrum of noisy speech is used to
estimate
Fig.4.1 Verification of cross correlation existence between original
speech and noise signal.
),(),( * kmDkmY and ),(),(* kmDkmY instead of
),(),( * kmDkmX and ),(),(* kmDkmX . This cross correlation between
spectra of speech signal with background noise and noise signal is
obtained using the cross correlation coefficient, . This coefficient is
given by
10,.
.
dy
dyyd (4.2.5)
59
where
1
0
)()(1 N
i
yd iDiYM
1
0
)(1 N
i
y iYM
1
0
)(ˆ1 N
i
d iDM
1
0
22 )(1 N
i
yy iYM
1
0
22 )(ˆ1 N
i
dd iDM
where M is length of the FFT frame. The value is proportional to the
degree of cross correlation between clean speech and noise.
4.2.2 Mel-scale Spectral Subtraction
In most of the cases, the environment noise has its power spectral
density changing with frequency and is unlike white noise. This noise
is known as color noise, which affects the speech spectrum differently
at different frequency bands. Considering machine noise, which has
characteristics as low frequencies, contains most of its energy. A
multi-band spectral subtraction was proposed by S.Kamath et al. [65],
in which noisy speech spectrum is divided into sub-bands based on
linear frequency spacing approach. However they did not consider any
cross correlation terms.
In this approach, frequency spectrum of noisy signal is divided by
using non-linear spacing between sub-bands for spectral subtraction
and cross correlation terms are also included in subtraction process.
In this work mel-scale is used for dividing the human voice frequency
range 31 sub-bands. Table 4.1 shows frequency ranges for sub-bands.
60
Using (4.2.6) input noisy speech signal frequency components are
converted into mel-scale.
1
700log2595 10
fmf (4.2.6)
And the sub-band noisy speech spectrum is obtained by using
KikmYmYik
i ,..,1,),()(2
)(
(4.2.7)
where i is the sub-band number, K=31 is the total number of sub-
bands and )(ik is the frequency index related to lower and upper
boundaries of the sub-band i, depending on the lower and upper
frequency boundary of the critical band i.
Table 4.1 Sub-bands on mel-scale
Sub-band Number of Bins Frequency(Hz)
0
1
….
12
13
14
15
16
17
18
19
20
21
22
23
1
1
….
1
2
2
2
2
2
2
2
3
3
4
6
0-31
31-62
….
375-406
406-469
469-531
531-594
594-656
656-719
719-781
781-844
844-938
938-1031
1031-1156
1156-1344
61
24
25
26
27
28
29
30
31
6
8
9
10
12
14
18
9
1344-1531
1531-1781
1781-2063
2063-2375
2375-2750
2750-3188
3188-3750
3750-4000
According to mel-scale, the speech spectrum is divided into K
number of sub-bands and enhancement is done by subtracting the
estimated noise in each sub-band. The original speech signal
spectrum can be obtained in the ith sub-band using
otherwisekmD
kmXifkmDkmYkmDkmYkmX
i
iiiii
,),(ˆ
0),(ˆ,),(ˆ),(),(ˆ),(),(ˆ
2
222
2
(4.2.8)
for 1 ii kmk
where ik and 1ik are related to starting and ending frequencies of the
ith sub-band and i is the forgetting factor of the ith sub-band. The
choice of the value of parameter i dictates the amount of noise
eliminated. To provide the best trade-off between audible distortion
and residual noise peaks, this subtraction factor, i
should be
selected appropriately. If 1i , then it indicates over spectral
subtraction and in this case noisy speech spectrum is over
attenuated. To avoid this over attenuation, flooring parameter is used
62
and it will give the minimum value to the gain of the subtraction filter
but masking occurs to the residual noise.
Equation (4.2.8) represents estimated enhanced speech signal
spectrum which have to converge at the original speech signal.
However, enhanced speech signal of this method is not converging to
the original speech signal because of the non-stationary nature of
speech signals. This spectral subtraction is also represented as a
filter, with filter gain ),( kmG and value of this gain is in the range 0 to
1. Now enhanced speech spectrum is obtained by passing the noisy
speech signal through this filter and it is given by
1),(0);,().,(),(ˆ kmGkmYkmGkmX (4.2.9)
comparing (4.2.4) and (4.2.5), we have ),( kG as
),(
),(ˆ
1),(kmY
kmD
SNRkmG
i
i
post
. (4.2.10)
where
1
1
2
2
),(ˆ
),(
i
i
i
i
w
wm
i
w
wm
i
post
kmD
kmY
SNRis a posterior SNR. According to (4.2.10),
this filter gain function is dependent on a posterior SNR. In the
regions of estimated noise higher than the noisy speech power
spectrum, ),( kmG is equal to zero. Hence there is a trade-off between a
posterior SNR and attenuation of noisy speech. While a posterior SNR
is increasing the attenuation of the noisy speech reduces.
Implementation of this approach is simple because a posterior SNR
can be obtained easily.
63
4.2.3 Multi-band Wiener Filter
Based on frequency response characteristics of human ear, here a
multi-band approach based on non-linear sub-bands is proposed.
According to psychoacoustics of human ear, a spectral gain for
enhancement is proposed in [68]. Here Bark scale is used for dividing
the human voice frequency range into 24 sub-bands which is suitable
to represent band pass filtering nature of human ear. The relationship
between the input frequency and Bark is given in (4.2.11) and is also
2
5.7arctan5.3)76.0arctan(13)(
fffb . (4.2.11)
Graphically representation is given in Fig.4.2. Out of 24 sub-bands,
Fig.4.2 (a) critical band rate and (b) frequency
18 sub-bands are sufficient to represent sampled speech signal with
sampling frequency 8 KHz. Noisy speech power spectrum on Bark
scale is given by
KikYbYi
i 2,1,)()()(
2
. (4.2.12)
where i indicates index of sub-band, K=18 is the total number of sub-
bands and )(i represents the frequency range of the sub-band i.
Now conventional Wiener filtering is used in each sub-band for
noise elimination and noisy speech is multiplied with this filter gain in
each sub-band as
64
)()(ˆ kYGkX iii (4.2.13)
Wiener filter is designed by applying MMSE criteria between clean
speech signal and processed sub-band signals. To derive the Wiener
filter gain, cost function is formulated in each sub-band, which is
given by
2
)()(ˆ kXkXE iii . (4.2.14)
where )(ˆ kX i and )(kX i denote the estimated and ideal sub-band
speech signals in the ith sub-band respectively. By substituting
(4.2.13) in (4.2.14) and simplifying, we get i as
)()(12)()(1 2222kDkXEGGkDEGkXEG iiiiiiiii . (4.2.15)
where )(kDi the estimated noise spectrum, is zero mean and assumed
that it is uncorrelated with )(kX i in each sub-band. The (4.2.15) can
be simplified to
)()(1 2222kDEGkXEG iiiii (4.2.16)
By the differentiation of (4.2.16) w.r.t weighting factor iG to zero and
the weighting factor iG is found to be
)()(
)(22
2
kDEkXE
kXEG
ii
i
i
2
2
22
2
i
i
ii
i
Y
S
DS
S
. (4.2.17)
Where 2
iS , 2
iD and 2
iY indicate estimated power of original speech,
background noise and speech with background noise in the ith sub-
band respectively.
65
By considering cross correlation between )(kX i and )(kDi and
differentiating (4.2.15) w.r.t iG we have
)()(2)()(
)()()(22
2
kDkXEkDEkXE
kDkXEkXEG
iiii
iiii
. (4.2.18)
However it is not possible to estimate cross correlation term, since we
are dealing with single channel, no reference signal for original speech
is available but we have noisy speech signal )(kYi . Hence, cross
correlation between )(kX i and )(kDi is estimated using the available
)(kYi and estimated )(kDi , then
)(.)()()()( kDkDkXEkDkYE iiiii
)()()()()( 2 kDEkDkYEkDkXE iiiii . (4.2.19)
By considering the cross correlation between )(kYi and )(kDi ,
)()()()( kDkYkDkYE iiii (4.2.20)
where is the cross correlation coefficient which relates the cross
correlation between noise corrupted speech and noise signals in a
sub-band. This is given in (4.2.5). By substituting (4.2.20) and
(4.2.19) in (4.2.18), filter gain iG can be obtained as
2
22 )()()()(
i
iiiii
YE
kDEkDkYkXEG
)(
1)(
)()()(
2
k
kDE
kDkYk
Gi
i
ii
i
i
(4.2.21)
66
where )(ki is a priori SNR in a ith sub-band and given by
2
2
)(
)()(
kDE
kXEk
i
i
i and similarly )(ki is a posterior SNR and defined as
2
2
)(
)()(
kDE
kYEk
i
i
i . Calculation of a priori SNR is not possible directly.
Then it is estimated using the approach given in [34][71] and is given
by
1)(1
)(ˆ
)(ˆ
)(ˆ2
2
kP
kDE
kXk i
i
i
i . (4.2.22)
where )(ˆ kX i estimator of the original speech is signal in the ith sub-
band,
2
)(ˆ kDE i is the estimation of the noise power in the thi sub-
band, ][P performs half wave rectification and denotes smoothing
factor. Initial a priori SNR is obtained using the ML estimate and is
given by )1()( kEk ii .
4.3 Modified A priori SNR
This section deals with the modified a priori SNR. Estimation and
updating of )(ˆ ki is controlled by the smoothing factor . Approach
given in [67] analyzed the estimation a priori SNR and noise reduction
based on smoothing factor. When closer to1, then less amount of
residual noise is remains in enhanced speech signal but transient
distortion will be more. Balancing these effects researchers [8][72] set
the value in the range 0.95-0.99. If we choose the constant value ,
67
then )(ˆ ki fails to follow sudden changes in the speech signal
magnitude and after a certain time delay it will follow the a posterior
SNR. Similarly after a certain time delay )(ˆ ki will follow the downfalls
in the a posterior SNR, these downfalls correspond to the decrement
in the speech signal magnitude. It is convenient to use smaller values
to in these transitional areas. To get better performance with the a
priori SNR )(ˆ ki an adaptation method is proposed for smoothing
parameter in [72], in which authors made assumption that the
statistical characteristics of the background noise do not change with
frame and are stationary. According to [72], is defined in terms of
frame energy k im kmYE2
),( and is given by
1
1
,max1
EE
EE (4.3.1)
In this thesis, a modified smoothing factor is proposed which is
defined using MMSE criteria. This factor is suitable for adaptively
changing the a priori SNR corresponding to changes in speech signal
magnitude. Modified a priori SNR can be obtained
]1),([),(1),1(~
),(),(ˆ kmPkmkmkmkm iiiii . (4.3.2)
Where 22
),1(/),1(ˆ),1(~
kmDEkmXkm iii and which is similar to a
priori SNR ),( kmi . Here cost function J is formulated based on
MMSE criteria to obtain modified smoothing factor.
),1(~
/),(),(ˆ2
kmkmkmEJ iii (4.3.3)
68
Substituting (4.3.2) into (4.3.3), an expression for J can be obtained
as
2222 )1),((),(1),(),1(~
),( kmkmkmkmkmJ iiii . (4.3.4)
Now differentiating J with respect to ),( kmi and setting this value to
zero. Then smoothing factor ),( kmi has been modified as
2
)1,(
),1(~
),(1
1,
km
kmkmkm
i
ii
opt
i
. (4.3.5)
In the derivation ),( kmi the following terms are substituted:
1),(2),(21),(22
kmkmkmE iii
and ),(2),(/),(244
kmkmDEkmXE iii ,it is the fourth order
moment, discussion on this moment is given in[69]. Initially, it is not
possible to have ),( kmi then 1),(),( kmPkm ii is substituted
for ),( kmi to get approximated value of ),( kmi . This modified
smoothing factor will have smaller value, while sudden changes in the
speech signal magnitude or in a posterior SNR. When speech signal
has uniform magnitude variation, then value of the smoothing
parameter is nearer to 1. Fig.4.3 illustrates the comparison of
variations in a posterior with obtained smoothing parameter and
parameter given in [72]. The major difference between the obtained
smoothing parameter with parameter given in [72] is that modified
factor changes for every frame and frequency bin. This characteristic
is differing from parameter given in [72].
69
4.4 Simulation Results
The evaluation of multi-band single channel speech enhancement
methods is discussed in this chapter. For this evaluation, simulations
are carried out with the NOIZEUS, database [61] in MATLAB. This
database provides noisy speech samples at different environments at
different SNRs. For this performance evaluation airport, car, babble,
train, restaurant, railway station and street environment noises are
considered at 0 dB, 5dB, 10 dB and 15dB SNR levels.
Fig.4.3 Variation of smoothing parameter ( ) (a) Noisy speech (b)
Proposed ),( km (in dotted line) and aposteriori SNR, )(k (in
solid line) (c) m of [65] (in dotted line) and aposteriori SNR,
)(k (in solid line) (d) apriori SNR, )(k estimate using
proposed ),( km for 37k (in solid line) and apriori SNR
estimate using m .
The following performance measures are used in this evaluation
procedure: segmental SNR and Noise Reduction NR values. Higher
70
values of these measures indicate significant reduction in the
background noise [54]. Details of these performance measures are
already discussed in section 2.6. The performance of the mel-scale
spectral subtraction and multi-band Wiener filter is compared with
power spectral subtraction and Wiener filter.
Table 4.2 shows the output average segmental SNR values of
enhanced speech signals obtained with proposed method and with
other methods. The performance of the multi-band Wiener filter
method almost outperforms that of the power spectral subtraction,
Wiener filter and mel-scale spectral subtraction. Fig.4.4 illustrates
graphical representation of the comparison of output average
segmental SNR values for different noise environment conditions.
Table 4.3 illustrates the comparison of noise reduction values.
Form this table it is clear that background noise reduction from noisy
speech signal with proposed method is better compared to other
methods. Similar comparison is given in Fig.4.5. The timing
waveforms of the enhanced speech are given in Fig.4.6, where original
speech signal is corrupted with airport noise at 0 dB SNR. Similar
comparison is given in Fig.4.7 in terms of spectrograms. From these
results and listening tests it is clear that multi-band Wiener filter
efficiently removes the background noise compared to other methods.
71
Table 4.2 Output average segmental SNR values (dB) Type of noise and SNR (dB)
Power Spectral Subtraction (PSS)
Wiener Filter (WF)
Mel-scale
Spectral Subtraction
(Mel-scale SS)
Multi-band Wiener Filter
(MWF)
Airport-0 -5.55 -4.06 -4.37 -2.39
Airport-5 -2.61 -2.23 -2.57 0.67
Airport-10 -0.39 -0.68 -0.06 0.43
Airport-15 0.18 0.77 1.88 3.13
Babble-0 -5.87 -4.40 -4.59 -1.91
Babble-5 -1.96 -1.80 -1.39 0.05
Babble-10 -0.55 -0.16 0.03 2.36
Babble-15 0.27 0.76 2.71 3.06
Car-0 -4.55 -4.48 -3.93 -1.02
Car-5 -1.48 -1.90 -1.65 1.69
Car-10 0.23 -0.08 0.68 2.40
Car-15 0.48 0.75 2.31 2.71
Street-0 -4.21 -3.16 -2.88 -1.97
Street-5 -1.82 -2.28 -2.13 -0.29
Street-10 -0.74 -0.31 1.20 2.42
Street-15 0.45 0.58 2.25 2.48
Train-0 -3.69 -3.45 -3.45 -2.13
Train-5 -0.98 -1.75 -0.86 0.93
Train-10 -1.95 -0.81 -0.39 1.69
Train-15 0.46 0.49 2.62 2.57
Restaurant-0 -17.73 -5.11 -5.49 -3.44
Restaurant-5 -12.96 -2.72 -3.61 -0.15
Restaurant-10 -8.40 -0.66 -0.49 1.28
Restaurant-15 -3.43 0.49 1.80 2.47
Station-0 -4.21 -3.72 -3.62 0.51
Station-5 -1.82 -1.83 -1.93 1.18
Station-10 -0.74 -0.36 0.95 2.39
Station-15 0.45 0.76 2.72 2.86
72
Fig.4.4 Comparison of output average segmental SNR values for
different noises (a) airport (b) babble (c) car (d) restaurant (e) station and (f) street
73
Table 4.3 Noise reduction values (dB) Type of noise and SNR (dB)
Power Spectral
Subtraction (PSS)
Wiener Filter (WF)
Mel-scale
Spectral Subtraction
(Mel-scale SS)
Multi-band
Wiener Filter (MWF)
Airport-0 18.74 13.15 25.37 25.00
Airport-5 19.98 16.95 24.68 25.98
Airport-10 20.29 21.79 23.21 24.01
Airport-15 22.78 22.91 23.04 26.22
Babble-0 18.12 13.21 24.73 25.9
Babble-5 20.19 16.50 24.70 25.85
Babble-10 21.54 19.74 23.91 26.60
Babble-15 21.55 23.29 22.99 26.12
Car-0 18.45 13.38 25.63 26.30
Car-5 21.08 16.17 25.11 27.69
Car-10 22.68 19.48 24.20 26.89
Car-15 23.70 22.96 23.07 25.81
Street-0 19.08 13.97 25.06 25.34
Street-5 19.38 16.53 25.43 25.36
Street-10 22.64 19.64 24.14 25.86
Street-15 23.21 23.54 23.54 25.24
Train-0 17.45 13.59 25.87 25.63
Train-5 20.06 16.61 24.89 26.64
Train-10 21.12 19.87 24.20 25.96
Train-15 23.72 23.68 22.86 25.44
Restaurant-0 19.16 13.92 25.41 23.50
Restaurant-5 20.19 16.69 25.13 25.42
Restaurant-10 21.43 19.95 24.15 25.61
Restaurant-15 22.53 23.32 23.10 25.42
Station-0 19.06 13.10 25.44 28.37
Station-5 20.91 16.12 24.81 27.11
Station-10 22.47 19.48 23.63 26.76
Station-15 23.86 22.91 22.92 25.87
74
Fig.4.5 Comparison of noise reduction values with various
enhancement techniques and multi-band Wiener filter for different noises (a) airport (b) babble (c) car (d) street (e) train and (f) restaurant.
75
Fig.4.6 Timing waveforms of (a) the original speech (b) speech with
background noise and the enhanced speech using (c) power spectral subtraction (d) Wiener filter (e) mel-scale spectral subtraction and (f) multi-band Wiener filter
76
Fig.4.7 Comparison of spectrograms of (a) the original speech (b)
speech with background noise and the enhanced speech using (c) power spectral subtraction (d) Wiener filter and (e) mel-scale spectral subtraction (f) multi-band Wiener filter
4.5 Conclusion
This chapter described the development of a new multi-band
speech enhancement system that considered the affect on speech
signal spectrum due to colored noise. In particular, mel-scale spectral
subtraction and multi-band Wiener filter methods are examined.
Performance of these methods is compared with power spectral
subtraction and Wiener filtering methods. From simulation results it
77
is observed that multi-band Wiener filter method gives better
performance than other methods.
Chapter 5 describes reduction of harmonic distortion caused
during multi-band wiener filter speech enhancement.
top related