speech signal enhancement through adaptive wavelet thresholding

www.elsevier.com/locate/specom

Speech Communication 49 (2007) 123–133

Speech signal enhancement through adaptive wavelet thresholding

Michael T. Johnson a,*, Xiaolong Yuan b, Yao Ren a,1

a Department of Electrical and Computer Engineering, 1515 W. Wisconsin Avenue, Milwaukee, WI 53233, United Statesb Motorola Electronics Ltd, No. 108 Jain Guo Road, Chao Yang District, Beijing 100022, PR China

Received 14 March 2006; received in revised form 29 September 2006; accepted 11 December 2006

Abstract

This paper demonstrates the application of the Bionic Wavelet Transform (BWT), an adaptive wavelet transform derived from a non-linear auditory model of the cochlea, to the task of speech signal enhancement. Results, measured objectively by Signal-to-Noise ratio(SNR) and Segmental SNR (SSNR) and subjectively by Mean Opinion Score (MOS), are given for additive white Gaussian noise as wellas four different types of realistic noise environments. Enhancement is accomplished through the use of thresholding on the adaptedBWT coefficients, and the results are compared to a variety of speech enhancement techniques, including Ephraim Malah filtering,iterative Wiener filtering, and spectral subtraction, as well as to wavelet denoising based on a perceptually scaled wavelet packet trans-form decomposition. Overall results indicate that SNR and SSNR improvements for the proposed approach are comparable to those ofthe Ephraim Malah filter, with BWT enhancement giving the best results of all methods for the noisiest (�10 db and �5 db input SNR)conditions. Subjective measurements using MOS surveys across a variety of 0 db SNR noise conditions indicate enhancement qualitycompetitive with but still lower than results for Ephraim Malah filtering and iterative Wiener filtering, but higher than the perceptuallyscaled wavelet method.� 2007 Elsevier B.V. All rights reserved.

Keywords: Adaptive wavelets; Bionic Wavelet Transform; Speech enhancement; Denoising

1. Introduction

Speech enhancement is an important problem within thefield of speech and signal processing, with impact on manycomputer-based speech recognition, coding and communi-cation applications. The underlying goal of speechenhancement is to improve the quality and intelligibilityof the signal, as perceived by human listeners. Existingapproaches to this task include traditional methods suchas spectral subtraction (Boll, 1979; Deller et al., 2000), Wie-ner filtering (Deller et al., 2000; Haykin, 1996), andEphraim Malah filtering (Ephraim and Malah, 1984).Wavelet-based techniques using coefficient thresholding

0167-6393/$ - see front matter � 2007 Elsevier B.V. All rights reserved.

doi:10.1016/j.specom.2006.12.002

* Corresponding author. Tel.: +1 414 288 0631; fax: +1 414 288 5579.E-mail addresses: [email protected] (M.T. Johnson),

[email protected] (X. Yuan), [email protected] (Y. Ren).1 Tel.: +1 414 288 7451.

approaches have also been applied to speech enhancement(Donoho, 1995; Guo et al., 2000), and more recently anumber of attempts have been made to use perceptuallymotivated wavelet decompositions coupled with variousthresholding and estimation methods (Bahoura and Rouat,2001; Chen et al., 2004; Cohen, 2001; Fu and Wan, 2003;Hu and Loizou, 2004; Lu and Wang, 2003).

Recently, the Bionic Wavelet Transform (BWT) (Yaoand Zhang, 2001, 2002) has been proposed as a methodfor wavelet decomposition of speech signals. The BWTwas originally designed for applications in speech coding,with particular emphasis on the possibility of using it forencoding of cochlear implant signals. The BWT model,which will be described in more detail in the next section,is based on an auditory model of the human cochlea, cap-turing the non-linearities present in the basilar membraneand translating those into adaptive time-scale transforma-tions of the underlying mother wavelet. Motivated by thecommunicative connection between the speech production

mailto:[email protected]



124 M.T. Johnson et al. / Speech Communication 49 (2007) 123–133

system and the auditory system, the work presented hereuses the BWT in combination with existing wavelet denois-ing techniques to construct a new adaptive wavelet thres-holding method for speech enhancement, with the ideathat the improved representational capability of the BWTon speech signals could lead to better separation of signaland noise components within the coefficients and thereforebetter enhancement results.

In Section 2, we give a detailed overview of waveletdecompositions, wavelet thresholding techniques, and theBWT, as well as a brief discussion of the baseline enhance-ment methods being used for comparison. Section 3 intro-duces the new approach and outlines the experimentalmethod, including the experiments, the data set, the noisemodels, and the evaluation metrics. Results of these exper-iments are presented and discussed in Section 4, followedby overall conclusions in Section 5.

2. Background

2.1. Wavelet analysis (Debnath, 2002; Jaffard et al., 2001;

Walnut, 2002)

The Continuous Wavelet Transform (CWT) of a signalx(t) is given by

X CWTða; sÞ ¼ xðtÞ;ua;sðtÞ� �

¼ 1ffiffiffiap

ZxðtÞu� t � s

a

� �dt; ð1Þ

where s and a represent the time shift and scale variables,respectively, and u(Æ) is the mother wavelet chosen for thetransform. Given that this mother wavelet satisfies a basicadmissibility criteria (Daubechies, 1992), the inverse trans-form also exists. The idea of the wavelet originated with theGabor Transform (Gabor, 1946), a windowed FourierTransform designed such that the duration of the timelocalization window varied with frequency. A wavelet rep-resentation offers advantages over traditional Fourier anal-ysis in that the time support of the wavelet used to performthe correlation in Eq. (1) varies as a function of scale, sothat the analysis window length matches with the frequencyof interest, trading off time and frequency resolution.

(3,0)

(4,0) (4,

(0,0)

(1,0) (1,1)

(2,0) (2,1)

(3,0) (3,1)

(4,0) (4,1)

Fig. 1. A Discrete Wavelet Transform (left) and full Wavelet Packet Transforbranches at each node representing a matched pair of low-pass and high-pass

The time s and scale a variables of the CWT can bediscretized and in many cases still provide for completerepresentation of the underlying signal, provided that themother wavelet meets certain requirements. This may beviewed as a type of multiresolution analysis, where at eachscale the signal is represented at a different level of detail.In the case where the discretization of the time and scalevariables is dyadic in nature, so that a = 2m and s = n2m,an efficient implementation may be obtained through theuse of a Quadrature Mirror Filter (QMF) decompositionat each level, where matching low-pass and high-pass filter-bank coefficients characterize the convolution with themother wavelet, and downsampling by 2 at each levelequates to the doubling of the time interval according toscale. This multiresolution filter bank implementation isreferred to as a Discrete Wavelet Transform (DWT), andexists provided that the family of wavelets generated bydyadic scaling and translation forms an orthonormal basisset. The BWT used in the current work is based on theMorlet mother wavelet, for which a direct DWT represen-tation is not possible, so a CWT coupled with fast numer-ical integration techniques is used instead to generate a setof discretized wavelet coefficients.

A further generalization of the DWT is the WaveletPacket Transform (WPT), also based on a filter bankdecomposition approach. In this case the filtering processis iterated on both high and low frequency components,rather than continuing only on low frequency terms as witha standard dyadic DWT. A comparison between a DWTand a WPT is shown in Fig. 1.

The depth of the Wavelet Packet Tree shown in Fig. 1can be varied over the available frequency range, resultingin configurable filterbank decomposition. This idea hasbeen used to create customized Wavelet Packet Transformswhere the filterbanks match a perceptual auditory scale,such as the Bark scale, for use in speech representation,coding, and enhancement (Bahoura and Rouat, 2001; Chenet al., 2004; Cohen, 2001; Fu and Wan, 2003; Hu andLoizou, 2004; Lu and Wang, 2003). The use of bark-scaleWPT for enhancement has so far indicated a small butsignificant gain in overall enhancement quality due to this

(0,0)

(1,0) (1,1)

(2,0) (2,1) (2,2) (2,3)

(3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,7)

1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,7) (4,8) (4,9) (4,10)(4,11) (4,12)(4,13) (4,14)(4,15)

m (right), represented as filterbank decompositions, with the left and rightwavelet filters followed by downsampling.

M.T. Johnson et al. / Speech Communication 49 (2007) 123–133 125

perceptual specialization. This perceptual WPT, using audi-tory critical band scaling following Cohen’s work (Cohen,2001) as shown in Fig. 2, is implemented in this work as areference method for comparison to the new technique.

The degree to which a particular set of wavelet coeffi-cients form a useful or compact representation of a signalis a function of how well the mother wavelet matches withthe underlying signal characteristics, as well as the timesand scales selected. For application to signal enhancement,often referred to in the literature as wavelet denoising, thecoefficient magnitudes are reduced after comparison to athreshold, as described in more detail in the following sec-tion. With a good choice of representation, this threshold-ing will remove noise while maintaining signal properties.

To address the fact that many types of signals have sub-stantial non-stationarity and may not be well-representedby a single fixed set of parameters, it is possible to makethe wavelet transform adaptive, such that characteristicsof the transform change over time as a function of theunderlying signal characteristics. There are several possibleapproaches to adaptive wavelet enhancement, includingadaptation of the wavelet basis, adaptation of the waveletpacket configuration, direct adaptation of the time andscale variables, or adaptation of the thresholds or thres-holding algorithms used. Of these, the most commonapproach is to use a time-varying threshold or gain func-tion based on an a priori energy or SNR measure (Bahouraand Rouat, 2001; Chen et al., 2004; Cohen, 2001; Fu andWan, 2003; Hu and Loizou, 2004; Lu and Wang, 2003).

The BWT decomposition used here is both perceptuallyscaled and adaptive. The initial perceptual aspect of thetransform comes from the logarithmic spacing of the base-line scale variables, which are designed to match basilar-membrane spacing. Two adaptation factors then controlthe time-support used at each scale, based on a non-linearperceptual model of the auditory system, as described indetail in the following section.

2.2. Bionic Wavelet Transform

The BWT was introduced in (Yao, 2001, Yao andZhang, 2001) as an adaptive wavelet transform designed

(0,0)

(1,0)

(2,0)

(3,0) (3,1)

(4,0) (4,1) (4,2)

(5,0) (5,1) (5,2) (5,3) (5,4) (5,5) (5

(6,0) (6,1) (6,2) (6,3) (6,4) (6,5) (6,6) (6,7) (6,12

Fig. 2. Perceptually scaled Wavelet Packet Transformation, with leaf-nod

specifically to model the human auditory system. The basisfor this transform is the Giguere–Woodland non-lineartransmission line model of the auditory system (Giguere,1993; Giguere and Woodland, 1994), an active-feedbackelectro-acoustic model incorporating the auditory canal,middle ear, and cochlea. The model yields estimates ofthe time-varying acoustic compliance and resistance alongthe displaced basilar membrane, as a function of thephysiological acoustic mass, cochlear frequency-positionmapping, and feedback factors representing the activemechanisms of the outer hair cells. The net result can beviewed as a method for estimating the time-varying qualityfactor Qeq of the cochlear filter banks as a function of theinput sound waveform. See Giguere and Woodland (1994),Zheng et al. (1999) and Yao and Zhang (2001) for completedetails on the elements of this model.

The adaptive nature of the BWT is captured by a time-varying linear factor T(a,s) that represents the scaling ofthe cochlear filter bank quality factor Qeq at each scale overtime. Incorporating this directly into the scale factor of aMorlet wavelet, we have:

X BWTða; sÞ ¼1

T ða; sÞ ffiffiffiapZ

xðtÞ~u� t � sT ða; sÞa

� �e�jx0

t�sað Þ dt;

ð2Þ

where

~uðtÞ ¼ e� t

T 0

� �2

ð3Þ

is the amplitude envelope of the Morlet wavelet, T0 is theinitial time-support and x0 is the base fundamentalfrequency of the unscaled mother wavelet, here taken asx0 = 15,165.4 Hz for the human auditory system, perYao and Zhang’s original work (Yao and Zhang, 2001).The discretization of the scale variable a is accomplishedusing pre-determined logarithmic spacing across the de-sired frequency range, so that the center frequency at eachscale is given by the formula xm = x0/(1.1623)m,m = 0,1,2, . . . For this implementation, based on Yaoand Zhang’s original work for cochlear implant coding(Yao and Zhang, 2002), coefficients at 22 scales,

(1,1)

(2,1) (2,2) (2,3)

(3,2) (3,3) (3,4) (3,5) (3,6) (3,7)

(4,3) (4,4) (4,5) (4,6) (4,7)

,6) (5,7)

)(6,13)

e center frequencies following an approximately critical-band scaling.

Hard threshold

Soft threshold

Nonlinear threshold

Original Value

NewValue

Fig. 3. Threshold mapping functions.


m = 7, . . . ,28, are calculated using numerical integration ofthe continuous wavelet transform. These 22 scales corre-spond to center frequencies logarithmically spaced from225 Hz to 5300 Hz. (Although the scales used here matchthose from Yao and Zhang’s original work, empirical var-iation of the number of scales and frequency placementshowed minimal effect on the overall enhancement results.)

The BWT adaptation factor T(a,s) for each scale andtime is computed using the update equation:

T ða; sþ DsÞ ¼ 1

1� G1Cs

Csþ X BWT a;sð Þj j

� �1þ G2

oos X BWT a; sð Þ � ;

ð4Þ

where G1 is the active gain factor representing the outerhair cell active resistance function, G2 is the active gain fac-tor representing the time-varying compliance of the Basilarmembrane, and Cs = 0.8 is a constant representing non-linear saturation effects in the cochlear model (Yao andZhang, 2001). In practice, the partial derivative of Eq. (4)is approximated using the first difference of the previouspoints of the BWT at that scale.

From Eq. (2), it can be seen that the adaptation factorT(a,s) affects the duration of the amplitude envelope ofthe wavelet, but does not affect the frequency of the associ-ated complex exponential. Thus, one useful way to think ofthe BWT is as a mechanism for adapting the time supportof the underlying wavelet according to the quality factorQeq of the corresponding cochlear filter model at eachscale. The key parameters T0, G1, and G2 will be discussedin detail in Sections 4.1 and 4.2.

It can be shown (Yao and Zhang, 2002) that the result-ing BWT coefficients XBWT(a,s) can be calculated as aproduct of the original WT coefficients XWT(a,s) and amultiplying constant K(a,s) which is a function of theadaptation factor T(a,s). For the Morlet wavelet, thisadaptive multiplying factor can be expressed as

X BWTða; sÞ ¼ Kða; sÞX WTða; sÞ;

Kða; sÞ ¼ffiffiffipp

CT 0ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þ T 2ða; sÞq ; ð5Þ

where C is a normalizing constant computed from the inte-gral of the squared mother wavelet. This representationyields an efficient computational method for computingBWT coefficients directly from the original WT coefficientswithout needing to perform the numerical integration ofEq. (2) at each time and scale.

There are several key differences between the discretizedCWT using the Morlet wavelet, used for the BWT, and afilterbank-based WPT using an orthonormal wavelet suchas the Daubechies family, as used for the comparative base-line method. One is that the WPT is perfectly reconstruct-able, whereas the discretized CWT is an approximationwhose exactness depends on the number and placementof frequency bands selected. Another difference, relatedto this idea, is that the Morlet mother wavelet consists of

a single frequency with an exponentially decaying time sup-port, whereas the frequency support of the orthonormalwavelet families used for DWTs and WPTs covers abroader bandwidth. The Morlet wavelet is thus more‘‘frequency focused’’ along each scale, which is what per-mits the direct adaptation of the time support with minimalimpact on the frequency support, the central mechanism ofthe BWT adaptation.

2.3. Wavelet denoising

If an observed signal includes measurement error orambient noise, the result is an additive signal model givenby

y ¼ xþ n; ð6Þ

where y is the noisy signal, x is the original clean signal,and n is the additive noise component. It can easily be seenfrom Eq. (1) above that the wavelet coefficients are alsoadditive, so that Y = X + N, where the matrix notationrepresents the set of coefficients across the selected timesand scales. Given a well-matched wavelet representation,noise characteristics will tend to be characterized acrosstime and scale by smaller coefficients while signal energywill be concentrated in larger coefficients. This offers thepossibility of using thresholding to separate the signal fromthe noise. There are a wide variety of basic thresholding ap-proaches (Antoniadis and Oppenheim, 1995; Donoho,1995), including:

• Hard thresholding, where all coefficients below a pre-defined threshold value are set to zero.

• Soft thresholding, where in addition the remaining co-efficients are linearly reduced in value.

• Non-linear thresholding, where a smooth function isused to map the original coefficients to a new set, avoid-ing abrupt value changes.

Illustrations of hard, soft, and non-linear thresholdingoperations are shown in Fig. 3. For any of theseapproaches, the threshold parameter may be either a fixedvalue or a level-dependent value that is a function of thewavelet decomposition level. One of the key elements forsuccessful wavelet denoising is the selection of the thresh-


old. Optimization approaches for determining this valuehave been well-studied. Two of the most common methodsare universal thresholding and Stein’s unbiased risk estima-tor (SURE) (Donoho, 1995), typically implemented with asoft-thresholding function.

Universal thresholding uses the threshold value

T ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 log N

p; ð7Þ

where r is an estimate of the noise variance and N compen-sates for signal length. The variance r is typically obtainedusing the median absolute deviation (MAD) measure withrespect to time, r ¼MADðX DWTðn;mÞÞ=0:6745, evaluatedat a specific scale level m.

The SURE method (Donoho, 1995; Donoho and John-stone, 1995; Johnstone and Silverman, 1997) uses the value

T ¼ arg min06T6r

ffiffiffiffiffiffiffiffiffiffi2 log Np r2N þ

XN

n¼1

minðx½n�; T 2Þ�(

� 2r2I x½n�j j 6 Tð Þ ); ð8Þ

where x[n] is the time-domain input signal and I is an indi-cator function. Note that the range over which the SUREthreshold is considered is based on a maximum value equalto the universal threshold, so that the SURE threshold isalways less than the universal threshold.

Thresholding can be done universally across all waveletdecomposition levels, referred to as level-independentthresholding, or else the threshold level can be varied ateach level, level-dependent thresholding. In this case, theabove formulas still apply, but with a level dependentthreshold Tm calculated at each level using a scale-depen-dent variance estimate rm ¼MADðX DWTðn;mÞÞ=0:6745.

Other thresholding approaches include minimax thres-holding (Donoho, 1995), which sets a universal thresholdindependent of any signal information (and is thereforegood primarily for completely unknown conditions), andheuristic SURE (Donoho, 1995), which uses a significancetest to decide between the universal threshold and theSURE threshold.

The speech enhancement experiments presented here usethe SURE method, with a level-independent soft-threshold.Empirical evaluations across the different thresholdingselection methods for this task showed only minor varia-tion in results, with slight performance benefit using theSURE approach on speech enhancement tasks comparedto the universal and heuristic SURE methods.

2.4. Comparative baseline methods

To evaluate the effectiveness of using this new BWT-based method for enhancement of speech signals, we com-pare it to other standard approaches on this task. Thisincludes comparing to the spectral subtraction, Wiener fil-tering, and Ephraim Malah filtering methods, all commonapproaches within this area. In addition, we also compare

to a perceptually scaled WPT denoising implementation. Inthis section some background on these existing enhance-ment methods is provided.

Spectral subtraction (Boll, 1979; Deller et al., 2000) is astraightforward technique based on literal subtraction ofthe Fourier Transform (FT) magnitude components ofthe estimated noise spectrum from the signal spectrum,on a frame by frame basis. Noise components are typicallyestimated from regions of the signal where there is nospeech present, and phase characteristics are taken directlyfrom the noisy FT. The resulting enhancement equation isgiven by

x ¼ IFFT

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiY xð Þj j2 � N xð Þj j2

q\Y xð Þ

� �: ð9Þ

Wiener filtering (Deller et al., 2000; Haykin, 1996) isaccomplished by using the signal and noise spectral charac-teristics to estimate the optimal noise reduction filter, givenby

H ¼ SxðxÞSxðxÞ � SnðxÞ

; ð10Þ

where Sx(x) and Sn(x) are the true power spectral densitiesof the clean signal and noise. As with spectral subtraction,the noise spectrum is typically estimated from regions ofthe signal where there is no speech present. However sincethere is no direct way to get an estimate of the speech spec-trum, this is usually accomplished via an iterative proce-dure where Sx(x) is initialized using the noisy signalspectrum. In each iteration H is estimated via Eq. (10),after which the filter is applied and an improved estimateof the signal is used to determine a better clean signal spec-trum estimate Sx(x). This process is repeated until conver-gence, usually just a few iterations.

The Ephraim Malah filter approach (Ephraim andMalah, 1984) is based on deriving a minimum mean squareestimator (MMSE) for the clean speech spectral amplitudesX k ¼ Akejhk (Ephraim and Malah, 1984) or log spectralamplitudes (LSA) (Ephraim and Malah, 1985) given acomplex Gaussian random variable model for the FourierTransform coefficients of both the clean speech and thenoise, assuming independence across frequency bins. TheLSA estimator gives somewhat better enhancement results,using the derived estimation formula for the clean signalFourier transform coefficient in each frequency bin givenby

Ak ¼nk

1þ nke

12

R1mk

e�tt dt

� �Rk; ð11Þ

nk ¼kxðkÞknðkÞ

; mk ¼nk

1þ nkck; ck ¼

R2k

knðkÞ; ð12Þ

where Rk is the noisy speech Fourier transform magnitudein the kth frequency bin, and kn(k) and kx(k) are the aver-age noise and signal powers in each bin. Noise power kn(k)is typically estimated from initial silence regions in the

CWT YWT [a,n]

Adapt to BWTYBWT[a,n] = K [a,n] YWT[a,n]

ThresholdY BWT [a,n] = Thresh{YBWT [a,n] }

Revert to CWTYWT [a,n] = (1/K [a,n]) Y´BWT [a,n]

Noisy speech y[n]

Enhanced speech y´ [n]

Inverse CWT

Fig. 4. Block diagram of the new BWT enhancement algorithm.


waveform, while kx(k) is a moving average of spectrallysubtracted noisy spectra R2

k � knðkÞ �

. Improved resultsare obtained when the a priori SNR nk is estimated directlyand includes smoothing to implicitly adjust for speech pres-ence probability, such as via the well-known ‘‘decision-di-rected method’’. The overall result is an algorithm whichadaptively tracks and adjusts estimates of both noise andsignal amplitudes, and uses these estimates to adjust the de-gree of enhancement, which has significant impact onreducing artifacts often present in spectral subtractionand Wiener filtering. In this work, we use the standardEphraim Malah implementation based on the LSA algo-rithm and decision-directed a priori SNR estimation(Ephraim and Malah, 1985).

For comparison to another wavelet-based approach, wehave implemented a level 6 perceptually scaled WPT, fol-lowing the critical-band filter bank arrangement used in(Cohen, 2001). The result is a 21-band decomposition thatapproximates the spacing of the auditory bark scale, verysimilar to the spacing used for our BWT technique. ADaubechies-5 mother wavelet is used for the decomposi-tion, and thresholding is accomplished with the same levelindependent SURE approach used for the BWT.

In all four baseline methods, as well as the newapproach being tested, an initial segment of the waveforms,which contains no speech, is used to estimate the noise lev-els. For the spectral subtraction and Wiener filteringmethod, this consists of power spectral estimation usingthe Fourier transform, while for the Ephraim Malah filterthis determines initial kn(k) and kx(k) values. For the wave-let methods this segment is used to re-scale the signal so toprovide an estimated unity noise variance, matching theassumption used in the SURE threshold selectionalgorithm.

3. Experimental method

The enhancement method presented here is based onapplying a wavelet thresholding technique to BWT coeffi-cients. A block diagram of the overall approach is shownin Fig. 4.

Continuous wavelet coefficients are computed using dis-crete convolutions at each of the 22 scales based directlyon Eq. (1), with a 16 kHz sampling rate, the same rateas the speech data used for the experiments. Eqs. (4)and (5) are used to calculate the K factor representingthe time-support adaptation of the continuous waveletcoefficients. The selection of the T0, G1, and G2 parametersfor these equations is discussed in detail in Sections 4.1and 4.2. To ensure stability of the overall signal variancefor comparison to the threshold, the K factor term isshifted to a mean value of unity (variance/range are notadjusted). Thresholding is applied using the SURE thres-holding method discussed in Section 2.3. Signal recon-struction is accomplished through another discreteconvolution at each scale, followed by a weighted summa-tion across the scales.

3.1. Data sets and noise characteristics

Ten utterances taken from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (Garofolo et al.,1993) were used to evaluate the new algorithm. The sam-pling rate of the data is 16 kHz, and each sentence usedhas at least 100 ms of silence at the beginning of the utter-ance that can be used to estimate noise statistics for thecomparative methods.

Two sets of additive noise experiments were imple-mented on this data. In the first, white Gaussian noisewas added to the sentences at SNR levels of �10, �5, 0,+5, and +10 dB. In the second, specific noise characteris-tics including F-16 cockpit noise, Volvo car interior noise,ambient pink noise, and babble noise (multiple talkers),was added at a 0 dB SNR level to evaluate how well themethods work with non-white and relatively non-station-ary noise sources.

Signal-to-noise ratio (SNR) and segmental signal-to-noise ratio (SSNR) (Deller et al., 2000) are used as objec-tive measurement criteria for both sets of experiments.SSNR is computed by calculating the SNR on a frame-by-frame basis over the signal and averaging theses values,and has been shown to have a higher correlation with per-ceived quality than does a direct SNR metric. The formulafor SSNR is

SSNR ¼ 1

M

XM�1

j¼0

10log10

XNðjþ1Þ

n¼Njþ1

x2ðnÞ½xðnÞ � xðnÞ�2

" #; ð13Þ

where M is the number of frames, each of length N, andx(n) and xðnÞ are the original and enhanced signals, respec-tively. SNR is computed using the inner term shown in theabove equation summed across the entire signal.

For the non-stationary noise cases, a subjective percep-tual measure called Mean Opinion Score (MOS) (Delleret al., 2000) was used to augment the objective SNR andSSNR measures for evaluating the perceived quality of


the enhanced waveforms. MOS is computed by having agroup of listeners rate the quality of the speech on a five-point scale, then averaging the results. For these tests weused a group of 10 listeners in calculating MOS results.For all measures, results are averaged across the 10 utter-ances used as examples, giving a single evaluation metricfor each method and noise type combination.

For spectral subtraction, Wiener filtering, and EphraimMalah filtering, the signal is divided into 25 ms windowswith 12 ms overlap between frames. Frequency analysis isdone using a Hanning window, and noise estimation isaccomplished using the first three frames of the signal.Coefficient thresholding for both the perceptually scaledWPT and the BWT are done using a soft level-independentthresholding function based on the SURE technique, asdescribed in Section 2.4. Implementation was done usingthe Matlab Wavelet toolbox (The MathWorks Inc., 2003).

4. Results

The implementation of the BWT decompositiondepends strongly on three primary parameters: T0, G1,and G2. As discussed in Section 2.2, the T0 parameter con-trols the base time-support of the un-adapted mother wave-let, while G1 and G2 control active gain factors representingouter hair cell resistance and Basilar membrane compli-ance, respectively. These parameters have been investigatedin detail, and are discussed in Sections 4.1 and 4.2. Overallresults of the enhancement experiments are presented inSection 4.3.

-7 -6.5 -6 -5.50

2

4

6

8

10

SN

R (

dB)

-7 -6.5 -6 -5.5-4

-3

-2

-1

0

1

log10

log10

SS

NR

(dB

)

Fig. 5. SNR and SSNR versus T0 for

4.1. T0 effects

The expression for a standard real Morlet wavelet is

uðtÞ ¼ e�t2=2 cosð5tÞ ð14Þwith frequency x0 = 5, time support T 0 ¼

ffiffiffi2p

, and a nettime–frequency product of T 0x0 ¼ 5

ffiffiffi2p� 7:07. Scaling

this to the base frequency of 15,165.4 Hz used in theBWT, the effective time support to match the standardMorlet would be T 0 ¼ ð5

ffiffiffi2pÞ=ð2pð15; 165:4ÞÞ �

:00007421. From this it can be seen that the T0 value of0.0005 from Yao and Zhang’s original work (Yao andZhang, 2001) is about seven times longer than that of astandard Morlet wavelet, resulting in several additional cy-cles in the initial mother wavelet.

Since the primary adaptation mechanism involves varia-tion of the wavelet time support, the impact of the initial T0

value was investigated. This was done by turning offthe adaptation mechanism (i.e. G1 = G2 = 0 so thatT(a,s) = 1) and investigating the SNR and SSNR resultingfrom thresholding the discretized CWT coefficientsdirectly, for the AWGN noise case with input SNR = 0 db.Results are shown in Fig. 5.

It can be seen from the above plots that there is asubstantial range across which the overall results areconsistent, while either extremely long or extremely shorttime-support values substantially hurt performance. Inaccordance with this result, and because the time supportvalue corresponding to a standard Morlet wavelet is inthe middle part of the stable range indicated by these

-5 -4.5 -4 -3.5 -3

-5 -4.5 -4 -3.5 -3 ( T0 )

( T0 )=0.00007421

T0

T0

=0.00007421

AWGN with initial SNR = 0 db.


results, it was decided to use this standard Morlet time-support value rather than one with extended time support,i.e. we selected T0 = .00007421 for use in these experiments.

4.2. G1 and G2 effects

G1 and G2 are parameters of a non-linear adaptationmodel, and thus are sensitive to signal amplitude character-istics. Each of these factors independently controls theamount of adaptation for a specific component of the audi-tory model, with very small values resulting in almost noadaptation (neither doing harm nor providing assistanceto the underlying wavelet decomposition) and very largevalues resulting in saturation effects that lead to excessivecoefficient fluctuation and poor overall results. This wasexamined by varying the two factors together and plottingthe corresponding SNR and SSNR results, again for the

010.8

0.40.2

05

6

7

8

9

G1

SN

R (

dB)

0.6

010.8

0.40.2

0-2

-1.5

-1

-0.5

0

0.5

1

G1

SS

NR

(dB

)

0.6

Fig. 6. SNR and SSNR versus G1 and G2

AWGN noise case with input SNR = 0 db. Results areshown in Fig. 6.

The overall result indicated above is that the SNR andSSNR results of the BWT enhancement algorithm arestable over a fairly wide range of parameter values. Thiswould indicate that the exact choice of values is not greatlyimportant. For the final implementation and evaluation,we selected values of G1 = 0.6 and G2 = 75, which arelocated at an approximate local maximum near the centerportion of the stable region, indicated on the axes in Fig. 6.

4.3. Overall results compared to baseline methods

4.3.1. AWGN noise condition across range of SNR values

SNR and SSNR results for the white noise experimentsare shown in Figs. 7 and 8. Methods compared includeEphraim Malah filtering, iterative Wiener filtering, spectral

2040

6080

100

G2

75

2040

6080

100

G2

75

for AWGN with initial SNR = 0 db.

-10 -5 0 5 10

-10

-5

0

5

10

15

Out

put S

NR

(dB

)

Input SNR (dB)

Input SNR baseline

Ephraim Malah

Iterative WienerSpectral Subtraction

PWT

BWT

Fig. 7. SNR results for white noise case at �10, �5, 0, +5, and +10 dBSNR levels.

-8.8 -7.1 -4.8 -2 1.24-10

-8

-6

-4

-2

0

2

4

6

Out

put S

SN

R (

dB)

Input SSNR (dB)

Input SSNR baseline

Ephraim Malah

Iterative WienerSpectral Subtraction

PWT

BWT

Fig. 8. SSNR results for white noise case at �10, �5, 0, +5, and +10 dBSNR levels.

F16 cockpit car interior pink babble-2

0

2

4

6

8

10

12

14

16S

NR

Impr

ovem

ent (

dB)

Ephraim Malah

Iterative Wiener

Spectral SubtractionPWT

BWT

Fig. 9. SNR comparisons for varying noise conditions at 0 dB SNR.

-2

0

2

4

6

8

10

SS

NR

Impr

ovem

ent (

dB)

Ephraim Malah

Iterative Wiener

Spectral SubtractionPWT

BWT

F16 cockpit car interior pink babble

Fig. 10. SSNR comparisons for varying noise conditions at 0 dB SNR.


subtraction, perceptually scaled WPT denoising (PWT),and the proposed BWT denoising (BWT), as well as theoriginal noisy SNR and SSNR values for referencingdegree of improvement. From these figures, the proposedBWT thresholding method and the Ephraim Malah filterclearly have the best performance for this noise condition.Each of these gave about 12 db improvement at the lowerSNRs decreasing to about 8 db improvement at the higherSNRs. For SSNR, the improvement ranged from 6 to 9 dbat the lower SNRs decreasing to 4–5 db at the higherSNRs. The BWT method shows the best SSNR improve-ment by a substantial margin at the �10 db noise case,obtaining nearly 3 db better performance than any of theother methods, and is better at the �5 db condition as well.Note that the other wavelet-based approach, the PWT

method, was a close third behind the Ephraim Malahand BWT results, and in fact did slightly better than theEphraim Malah method in terms of SSNR for the worstnoise case. At the +5 db and +10 db SNR noise cases,the Ephraim Malah filter shows stronger performance thanthe other methods tested, including the proposed BWTmethod.

4.3.2. Realistic noise conditions at 0 db SNRSNR and SSNR improvement across varying realistic

noise conditions at 0 dB SNR are shown in Figs. 9 and10. Here the results are given as net improvement, so thatrelative effectiveness can be seen for all four noise condi-tions as a function of enhancement method. EphraimMalah filtering substantially outperforms the other meth-ods in nearly all cases. The BWT thresholding approachoutperforms the remaining three, but is competitive withEphraim Malah only for the F-16 cockpit noise and pink


noise conditions. In a single instance, SSNR results for thepink noise condition, the BWT method outperformsEphraim Malah filtering. One interesting observation isthat the PWT perceptually scaled wavelet denoisingapproach does not compare as favorably for these realisticnoise conditions as it did for the AWGN noise, with resultsthat are closer to that of iterative Wiener filtering than tothe BWT and Ephraim Malah methods, and demonstratingparticularly poor performance in the car interior condition.

4.3.3. Mean opinion score results across all test conditions

Subjective results using Mean Opinion Scores (MOS)for these same conditions are shown in Fig. 11 and Table1. Ten subjects were surveyed and asked to rate the sen-tences for quality, using a standard MOS scale where 5indicates excellent quality and imperceptible distortionand 1 indicates unsatisfactory quality with annoying andobjectionable distortion.

There are several interesting differences between theMOS results and the SSNR results. While the relativeMOS for Ephraim Malah and BWT methods is in line withthe corresponding SSNR values for the most part, both theiterative Wiener filter and the perceptually scaled waveletdenoising received much higher ratings than would havebeen suggested by SSNR. The net result is that the iterativeWiener filtering method was second to Ephraim Malah inoverall opinion score, followed by the BWT and PWTmethods. The most likely explanation for the differences

F16 cockpit car interior pink babble White noise0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MO

S

Ephraim Malah

Spectral Subtraction

Iterative WienerPWT

BWT

Fig. 11. MOS comparisons for varying noise conditions at 0 dB SNR.

Table 1Average MOS by enhancement method

Method Average MOS

Ephraim Malah 3.23Iterative Wiener filter 3.03Bionic wavelet transform (BWT) 2.80Perceptual wavelet packet (PWT) 2.41Spectral subtraction 2.07

between SSNR and MOS results (aside from the relativelylow sample size of the MOS survey) is the presence orabsence of short-term artifacts in the enhanced waveforms,which are known to have significant impact on perceptionbut have only mild influence on SNR and SSNR values.

5. Conclusions

A new method for speech signal enhancement usingwavelets has been presented. This method is based on theBionic Wavelet Transform and its incorporated Giguere–Woodland auditory model, resulting in a transform withperceptually motivated frequency scaling and adaptive timesupport at each scale corresponding to a non-linear modelof the underlying cochlear system. Enhancement resultsdemonstrate performance that is competitive with someof the best methods in the signal processing field, includingEphraim Malah filtering.

Future work on this approach will include the develop-ment of time-varying thresholding techniques based ontracking the a priori SNR, such as those used in theEphraim Malah comparative method and in some recentperceptual wavelet denoising techniques (Cohen, 2001), aswell as continued work in generalizing the BWT itself toenhancement and speech intelligibility improvement. Giventhe strong initial performance of the Bionic Wavelet Trans-form representation, continued work in this direction isexpected to lead to additional improvement in overallsignal enhancement.

References

Antoniadis, A., Oppenheim, G. (Eds.), 1995. Wavelets and Statistics.Springer-Verlag, New York.

Bahoura, M., Rouat, J., 2001. Wavelet speech enhancement based on theteager energy operator. IEEE Signal Process. Lett. 8 (1), 10–12.

Boll, S.F., 1979. Suppression of acoustic noise in speech using spectralsubtraction. IEEE Trans. Acoustics Speech Signal Process. 27, 113–120.

Chen, S.-H., Chau, S.Y., Want, J.-F., 2004. Speech enhancement usingperceptual wavelet packet decomposition and teager energy operator.J. VLSI Signal Process. Systems 36 (2–3), 125–139.

Cohen, I., 2001. Enhancement of speech using bark-scaled wavelet packetdecomposition. Paper presented at the Eurospeech 2001, Denmark.

Daubechies, I., 1992. Ten Lectures on Wavelets. Society for Industrial andApplied Mathematics, Philadelphia.

Debnath, L., 2002. Wavelet Transforms and their Applications. Birkha-user, Boston.

Deller, J.R., Hansen, J.H.L., Proakis, J.G., 2000. Discrete-Time Process-ing of Speech Signals, second ed. IEEE Press, New York.

Donoho, D.L., 1995. Denoising by soft thresholding. IEEE Trans.Inform. Theory 41 (3), 613–627.

Donoho, D.L., Johnstone, I.M., 1995. Adapting to unknown smoothnessvia wavelet shrinkage. J. Amer. Statist. Assoc. 90, 1200–1224.

Ephraim, Y., Malah, D., 1984. Speech enhancement using a minimummean-square error short-time spectral amplitude estimator. IEEETrans. Acoust. Speech Signal Process. ASSP-32 (6), 1109–1121.

Ephraim, Y., Malah, D., 1985. Speech Enhancement using a minimummean-square error log-spectral amplitude estimator. IEEE Trans.Acoust. Speech Signal Process. ASSP-33 (2), 443–445.

Fu, Q., Wan, E.A., 2003. Perceptual wavelet adaptive denoising of speech.Paper presented at the Eurospeech, Geneva.


Gabor, D., 1946. Theory of communications. J. Inst. Electr. Eng. Lond.93, 429–457.

Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallett, D., Dahlgren, N.,et al., 1993. TIMIT Acoustic–Phonetic Continuous Speech Corpus:Linguistic Data Consortium.

Giguere, C. 1993. Speech processing using a wave digital filter model ofthe auditory periphery. Ph.D., University of Cambridge, Cambridge,UK.

Giguere, C., Woodland, P.C., 1994. A computational model of theauditory periphery for speech and hearing research. J. Acoust. Soc.Amer. 95 (1), 331–342.

Guo, D., Zhu, W., Gao, Z., Zhang, J., 2000. A study of waveletthresholding denoising. Paper presented at the International Confer-ence on Signal Processing, Beijing, PR China.

Haykin, S., 1996. Adaptive Filter Theory, third ed. Prentice Hall, UpperSaddle River, New Jersey.

Hu, Y., Loizou, P.C., 2004. Speech enhancement based on waveletthresholding the multitaper spectrum. IEEE Trans. Speech AudioProcess. 12 (1), 59–67.

Jaffard, S., Meyer, Y., Ryan, R.D., 2001. Wavelets: Tools for Science andTechnology. Society for Industrial and Applied Mathematics,Philadelphia.

Johnstone, I.M., Silverman, B.W., 1997. Wavelet threshold estimators fordata with correlated noise. J. Roy. Statist. Soc., Ser. B (Gen.) 59, 319–351.

Lu, C.-T., Wang, H.-C., 2003. Enhancement of single channel speechbased on masking property and wavelet transform. Speech Commun.41 (2–3), 409–427.

The MathWorks Inc., 2003. Matlab.Walnut, D.F., 2002. An Introduction to Wavelet Analysis. Birkhauser,

Boston.Yao, J., 2001. An active model for otoacoustic emissions and its

application to time–frequency signal processing. Ph.D., The ChineseUniversity of Hong Kong, Hong Kong.

Yao, J., Zhang, Y.T., 2001. Bionic wavelet transform: a new time–frequency method based on an auditory model. IEEE Trans. Biomed.Eng. 48 (8), 856–863.

Yao, J., Zhang, Y.T., 2002. The application of bionic wavelet transform tospeech signal processing in cochlear implants using neural networksimulations. IEEE Trans. Biomed. Eng. 49 (11), 1299–1309.

Zheng, L., Zhang, Y.-T., Yang, F.-S., Ye, D.-T., 1999. Synthesis anddecomposition of transient-evoked otoacoustic emissions based on anactive auditory model. IEEE Trans. Biomed. Eng. 46 (9), 1098–1106.

speech signal enhancement through adaptive wavelet thresholding

Documents