joint dereverberation and residual echo suppression of...

15
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 1 Joint Dereverberation and Residual Echo Suppression of Speech Signals in Noisy Environments Emanu¨ el A. P. Habets (Member, IEEE) Sharon Gannot (Senior Member, IEEE), Israel Cohen (Senior Member, IEEE), Piet C.W. Sommen Abstract— Hands-free devices are often used in a noisy and reverberant environment. Therefore, the received microphone signal does not only contain the desired near-end speech signal but also interferences such as room reverberation that is caused by the near-end source, background noise, and a far-end echo signal that results from the acoustic coupling between the loudspeaker and the microphone. These interferences de- grade the fidelity and intelligibility of near-end speech. In the last two decades post-filters have been developed that can be used in conjunction with a single microphone acoustic echo canceller to enhance the near- end speech. In previous works spectral enhancement techniques have been used to suppress residual echo and background noise for single microphone acoustic echo cancellers. However, dereverberation of the near-end speech was not addressed in this context. Recently, practically feasible spectral enhancement techniques to suppress reverberation have emerged. In this paper we derive a novel spectral variance estimator for the late reverberation of the near-end speech. The advantage of the developed estimator is that it can be used when the source-microphone distance is smaller than the critical distance. Residual echo will be present at the output of the acoustic echo canceller in case the acoustic echo path cannot be completely modelled by the adaptive filter. A spectral variance estimator for the so-called late residual echo that results from the deficient length of the adaptive filter is derived. Both estimators are based on a statistical reverberation model. The model parameters depend on the reverberation time of the room, which can be obtained using the estimated acoustic echo path. A novel post-filter is developed which suppresses late reverberation of the near-end speech, residual echo, and background noise, and maintains a constant residual background noise level. Experimental results demonstrate the beneficial use of the developed system for reducing reverberation, residual echo and background noise. Index Terms—Dereverberation, Residual Echo Suppression, Acoustic Echo Cancellation. I. I NTRODUCTION C ONVENTIONAL and mobile telephones are often used in a noisy and reverberant environment. When such a device is used in hands-free mode the distance between the desired speaker (commonly called near-end speaker) and the microphone is usually larger than the distance encountered in handset mode. Therefore, the received microphone signal is degraded by the acoustic echo of the far-end speaker, room reverberation and background noise. This signal degradation may lead to total unintelligibility of the near-end speaker. Acoustic echo cancellation is the most important and well-know technique to cancel the acoustic echo [1]. This technique enables one to conveniently use a hands-free device while maintaining a high user satisfaction in terms of low speech distortion, high speech intelligi- bility, and acoustic echo attenuation. The acoustic echo cancellation problem is usually solved by using an adaptive filter in parallel to the Manuscript received MONTH DAY, YEAR; revised MONTH DAY, YEAR. This work was supported by Technology Foundation STW, applied science division of NWO and the technology programme of the Ministry of Economic Affairs and by the Israel Science Foundation (grant no. 1085/05). E.A.P. Habets and S. Gannot are with the School of Engineering, Bar-Ilan University, Ramat-Gan, Israel. I. Cohen is with the Dept. of Electrical Engineering, Technion - Israel Institute of Technology, Haifa, Israel. P.C.W. Sommen is with the Dept. of Electrical Engineering, Technische Universiteit Eindhoven, Eindhoven, The Netherlands. acoustic echo path [1,2,3,4]. The adaptive filter is used to generate a signal that is a replica of the acoustic echo signal. An estimate of the near-end speech signal is then obtained by subtracting the estimated acoustic echo signal, i.e., the output of the adaptive filter, from the microphone signal. Sophisticated control mechanisms have been proposed for fast and robust adaptation of the adaptive filter coefficients in realistic acoustic environments [4,5]. In practice there is always residual echo, i.e., echo that is not suppressed by the echo cancellation system. The residual echo results from i) the deficient length of the adaptive filter, ii) the mismatch between the true and the estimated echo path, and iii) non-linear signal components. It is widely accepted that echo cancellers alone do not provide sufficient echo attenuation [3,4,5,6]. Turbin et al. compared three post-filtering techniques to reduce the residual echo and concluded that the spectral subtraction technique, which is commonly used for noise suppression, was the most efficient [7]. In a reverberant environment there can be a large amount of so-called late residual echo due the deficient length of the adaptive filter. In [6] Enzner proposed a recursive estimator for the short-term Power Spectral Density (PSD) of the late residual echo signal using an estimate of the reverberation time of the room. The reverberation time was estimated directly from the estimated echo path. The late residual echo was suppressed by a spectral enhancement technique using the estimated short-term PSD of the late residual echo signal. In some applications, like hands-free terminal devices, noise reduc- tion becomes necessary due to the relatively large distance between the microphone and the speaker. The first attempts to develop a combined echo and noise reduction system can be attributed to Grenier et al. [8,9] and to Yasukawa [10]. Both employ more than one microphone. A survey of these systems can be found in [4,11]. Beaugeant et al. [12] used a single Wiener filter to simultaneously suppress the echo and noise. In addition psychoacoustic properties were considered in order to improve the quality of the near-end speech signal. They concluded that such an approach is only suitable if the noise power is sufficiently low. In [13] Gustafsson et al. proposed two post-filters for residual echo and noise reduction. The first post-filter was based on the Log Spectral Amplitude estimator [14] and was extended to attenuate multiple interferences. The second post-filter was psychoacoustically motivated. In case the hands-free device is used in a noisy reverberant environment, the acoustic path becomes longer, and the microphone signal contains reflections of the near-end speech signal as well as noise. Martin and Vary proposed a system for joint acoustic echo cancellation, dereverberation and noise reduction using two microphones [15]. A similar system was developed by D¨ orbecker and Ernst in [16]. In both papers dereverberation was performed by exploiting the coherence between the two microphones as proposed by Allen et al. in [17]. Bloom [18] found that this dereverberation approach had no statistically significant effect on intelligibility, even though the measured average reverberation time and the perceived reverberation time were considerably reduced by the processing. It should however be noted that most hands-free devices are equipped with a single microphone.

Upload: others

Post on 31-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 1

Joint Dereverberation and Residual Echo Suppression ofSpeech Signals in Noisy Environments

Emanuel A. P. Habets (Member, IEEE) Sharon Gannot (Senior Member, IEEE),Israel Cohen (Senior Member, IEEE), Piet C.W. Sommen

Abstract— Hands-free devices are often used in a noisy and reverberantenvironment. Therefore, the received microphone signal does not onlycontain the desired near-end speech signal but also interferences such asroom reverberation that is caused by the near-end source, backgroundnoise, and a far-end echo signal that results from the acoustic couplingbetween the loudspeaker and the microphone. These interferences de-grade the fidelity and intelligibility of near-end speech. In the last twodecades post-filters have been developed that can be used in conjunctionwith a single microphone acoustic echo canceller to enhance the near-end speech. In previous works spectral enhancement techniques havebeen used to suppress residual echo and background noise for singlemicrophone acoustic echo cancellers. However, dereverberation of thenear-end speech was not addressed in this context. Recently, practicallyfeasible spectral enhancement techniques to suppress reverberation haveemerged. In this paper we derive a novel spectral variance estimatorfor the late reverberation of the near-end speech. The advantage of thedeveloped estimator is that it can be used when the source-microphonedistance is smaller than the critical distance. Residual echo will be presentat the output of the acoustic echo canceller in case the acoustic echopath cannot be completely modelled by the adaptive filter. A spectralvariance estimator for the so-called late residual echo that results fromthe deficient length of the adaptive filter is derived. Both estimators arebased on a statistical reverberation model. The model parameters dependon the reverberation time of the room, which can be obtained usingthe estimated acoustic echo path. A novel post-filter is developed whichsuppresses late reverberation of the near-end speech, residual echo, andbackground noise, and maintains a constant residual background noiselevel. Experimental results demonstrate the beneficial use of the developedsystem for reducing reverberation, residual echo and background noise.

Index Terms—Dereverberation, Residual Echo Suppression, AcousticEcho Cancellation.

I. INTRODUCTION

CONVENTIONAL and mobile telephones are often used in anoisy and reverberant environment. When such a device is

used in hands-free mode the distance between the desired speaker(commonly called near-end speaker) and the microphone is usuallylarger than the distance encountered in handset mode. Therefore,the received microphone signal is degraded by the acoustic echo ofthe far-end speaker, room reverberation and background noise. Thissignal degradation may lead to total unintelligibility of the near-endspeaker.

Acoustic echo cancellation is the most important and well-knowtechnique to cancel the acoustic echo [1]. This technique enables oneto conveniently use a hands-free device while maintaining a high usersatisfaction in terms of low speech distortion, high speech intelligi-bility, and acoustic echo attenuation. The acoustic echo cancellationproblem is usually solved by using an adaptive filter in parallel to the

Manuscript received MONTH DAY, YEAR; revised MONTH DAY, YEAR.This work was supported by Technology Foundation STW, applied sciencedivision of NWO and the technology programme of the Ministry of EconomicAffairs and by the Israel Science Foundation (grant no. 1085/05).

E.A.P. Habets and S. Gannot are with the School of Engineering, Bar-IlanUniversity, Ramat-Gan, Israel.

I. Cohen is with the Dept. of Electrical Engineering, Technion - IsraelInstitute of Technology, Haifa, Israel.

P.C.W. Sommen is with the Dept. of Electrical Engineering, TechnischeUniversiteit Eindhoven, Eindhoven, The Netherlands.

acoustic echo path [1,2,3,4]. The adaptive filter is used to generatea signal that is a replica of the acoustic echo signal. An estimateof the near-end speech signal is then obtained by subtracting theestimated acoustic echo signal, i.e., the output of the adaptive filter,from the microphone signal. Sophisticated control mechanisms havebeen proposed for fast and robust adaptation of the adaptive filtercoefficients in realistic acoustic environments [4,5]. In practice thereis always residual echo, i.e., echo that is not suppressed by the echocancellation system. The residual echo results from i) the deficientlength of the adaptive filter, ii) the mismatch between the true andthe estimated echo path, and iii) non-linear signal components.

It is widely accepted that echo cancellers alone do not providesufficient echo attenuation [3,4,5,6]. Turbin et al. compared threepost-filtering techniques to reduce the residual echo and concludedthat the spectral subtraction technique, which is commonly usedfor noise suppression, was the most efficient [7]. In a reverberantenvironment there can be a large amount of so-called late residualecho due the deficient length of the adaptive filter. In [6] Enznerproposed a recursive estimator for the short-term Power SpectralDensity (PSD) of the late residual echo signal using an estimateof the reverberation time of the room. The reverberation time wasestimated directly from the estimated echo path. The late residualecho was suppressed by a spectral enhancement technique using theestimated short-term PSD of the late residual echo signal.

In some applications, like hands-free terminal devices, noise reduc-tion becomes necessary due to the relatively large distance betweenthe microphone and the speaker. The first attempts to develop acombined echo and noise reduction system can be attributed toGrenier et al. [8,9] and to Yasukawa [10]. Both employ more thanone microphone. A survey of these systems can be found in [4,11].Beaugeant et al. [12] used a single Wiener filter to simultaneouslysuppress the echo and noise. In addition psychoacoustic propertieswere considered in order to improve the quality of the near-endspeech signal. They concluded that such an approach is only suitableif the noise power is sufficiently low. In [13] Gustafsson et al.proposed two post-filters for residual echo and noise reduction. Thefirst post-filter was based on the Log Spectral Amplitude estimator[14] and was extended to attenuate multiple interferences. The secondpost-filter was psychoacoustically motivated.

In case the hands-free device is used in a noisy reverberantenvironment, the acoustic path becomes longer, and the microphonesignal contains reflections of the near-end speech signal as wellas noise. Martin and Vary proposed a system for joint acousticecho cancellation, dereverberation and noise reduction using twomicrophones [15]. A similar system was developed by Dorbeckerand Ernst in [16]. In both papers dereverberation was performed byexploiting the coherence between the two microphones as proposedby Allen et al. in [17]. Bloom [18] found that this dereverberationapproach had no statistically significant effect on intelligibility, eventhough the measured average reverberation time and the perceivedreverberation time were considerably reduced by the processing. Itshould however be noted that most hands-free devices are equippedwith a single microphone.

Page 2: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 2

A single microphone approach for dereverberation is the applica-tion of complex cepstral filtering of the received signal [19]. Bees etal. [20] demonstrated that this technique is not useful to dereverberatecontinues reverberant speech due to so-called segmentation errors.They proposed a novel segmentation and weighting technique toimprove the accuracy of the cepstrum. Cepstral averaging then allowsto identify the Acoustic Impulse Response (AIR). Yegnanarayanaand Murthy [21] proposed another single microphone dereverberationtechnique in which a time-varying weighting function was applied tothe Linear Prediction (LP) residual signal. The weighing functiondepends on the Signal to Reverberation Ratio (SRR) of the rever-berant speech signal and was calculated using the characteristicsof the reverberant speech in different SRR regions. Unfortunately,these techniques are not accurate enough in a practical situation anddo not fit in the framework of the post-filter which is commonlyformulated in the frequency domain. Recently, practically feasiblesingle microphone speech dereverberation techniques have emerged.Lebart proposed a single microphone dereverberation method basedon spectral subtraction of late reverberant energy [22]. The latereverberant energy is estimated using a statistical model of the AIR.This method was extended to multiple microphones by Habets [23].Recently, Wen et al. presented results obtained from a listening testusing the algorithm developed by Habets [24]. These results showedthat the algorithm in [23] can significantly increase the subjectivespeech quality. The methods in [22] and [23] do not require anestimate of the AIR. However, they do require an estimate of thereverberation time of the room which might be difficult to estimateblindly. Furthermore, both methods do not consider any interferencesand implicitly assume that the source-receiver distance is larger thanthe so called critical distance, which is the distance at which the directpath energy is equal to the energy of all reflections. In case the source-receiver distance is smaller than the critical distance the contributionof the direct path results in over-estimation of the reverberant energy.Since this is the case in many hands-free applications the latterproblems need to be addressed.

In this paper we develop a post-filter which follows the traditionalsingle microphone Acoustic Echo Canceller (AEC). The developedpost-filter jointly suppresses reverberation of the near-end speaker,residual echo and background noise. In Section II the problemis formulated. The near-end speech signal is estimated using anOptimally-Modified Log Spectral Amplitude (OM-LSA) estimatorwhich requires an estimate of the spectral variance of each interfer-ence. This estimator is briefly discussed in Section III. In addition,we discuss the estimation of the a priori Signal to InterferenceRatio (SIR), which is necessary for the OM-LSA estimator. The lateresidual echo and the late reverberation spectral variance estimatorsrequire an estimate of the reverberation time. A major advantage ofthe hands-free scenario is that due to the existence of the echo anestimate of the reverberation time can be obtained from the estimatedacoustic echo path. In Section IV we derive a spectral varianceestimator for the late residual echo using the same statistical model ofthe AIR that is used in the derivation of the late reverberant spectralvariance estimator. In Section V the estimation of the late reverberantspectral variance in presence of additional interferences and directpath is investigated. An outline of the algorithm and discussionsare presented in Section VI. Experimental results that demonstratethe beneficial use of the developed post-filter are presented inSection VII.

II. PROBLEM FORMULATION

An AEC with post-filter and a Loudspeaker Enclosure Microphone(LEM) system are depicted in Fig. 1. The microphone signal isdenoted by y(n) and consists of a reverberant speech component

+ + a(n)

h(n)

ν(n)

d(n) d(n)

LEM system

x(n)

ze(n)

e(n)

− z(n) s(n)y(n)

he(n)

Post-Filter

AEC

Fig. 1. Acoustic echo canceller with post-filter.

z(n), an acoustic echo d(n), and a noise component v(n), where ndenotes the discrete time index.

The reverberant speech component z(n) results from the convo-lution of the acoustic impulse response, denoted by a(n), and theanechoic near-end speech signal s(n).

In this work we assume that the coupling between the loudspeakerand the microphone can be described by a linear system that can bemodelled by a finite impulse response. The acoustic echo signal d(n)is then given by

d(n) =

Nh−1∑

j=0

hj(n)x(n− j), (1)

where hj(n) denotes the j th coefficient of the acoustic echo path attime n, Nh is the length of the acoustic echo path, and x(n) denotesthe far-end speech signal.

In a reverberant room the length of the acoustic echo path isapproximately given by fs T60, where fs denotes the samplingfrequency in Hz and T60 denotes the reverberation time in seconds[2]. At a sampling frequency of 8 kHz, the length of the acoustic echopath in an office with a reverberation time of 0.5 seconds wouldbe approximately 4000 coefficients. Due to practical reasons, e.g.,computational complexity and required convergence time, the lengthof the adaptive filter, denoted by Ne, is smaller than Nh. The tail partof the acoustic echo path has a very specific structure. In Section IVit is shown that this structure can be exploited to estimate the spectralvariance of the late residual echo which is related to the part of theacoustic echo path that is not modelled by the adaptive filter. As anexample we use a standard Normalized Least Mean Square (NLMS)algorithm to estimate part of the acoustic echo path h. The updateequation for the NLMS algorithm is given by

he(n+ 1) = he(n) + µx(n)e(n)

xT (n)x(n) + δNLMS, (2)

where he(n) = [he,0(n), he,1(n), . . . , he,Ne−1(n)]T is the esti-mated impulse response vector, µ (0 < µ < 2) denotes thestep-size, δNLMS (δNLMS > 0) the regularization factor, andx(n) = [x(n), . . . , x(n − Ne + 1)]T denotes the far-end speechsignal state-vector. It should be noted that other, more advanced,algorithms can be used, e.g., Recursive Least Squares (RLS) or AffineProjection (AP), see for example [4] and the references therein.Since he(n) is sparse, one might use the Improved ProportionateNLMS (IPNLMS) algorithm proposed by Benesty and Gay [25].These advanced techniques are beyond the scope of this paper whichfocuses on the post-filter.

The estimated echo signal can be calculated using

d(n) =

Ne−1∑

j=0

he,j(n)x(n− j). (3)

Page 3: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 3

The residual echo signal can now be defined as

er(n) , d(n)− d(n). (4)

In general the residual echo signal er(n) is not zero because ofthe deficient length of the adaptive filter, the system mismatch, andnon-linear signal components that cannot be modelled by the linearadaptive filter. While many residual echo suppressions [5,7] focus onthe residual echo that results from the system mismatch we focuson the late residual echo that results from a deficient length adaptivefilter.

Double-talk occurs during periods when the far-end speaker and thenear-end speaker are talking simultaneously and can seriously affectthe convergence and tracking ability of the adaptive filter. Double-talkdetectors and optimal step-size control methods have been presentedto alleviate this problem [4,5,26,27]. These methods are out of thescope of this article. In this paper we adapt the filter in those periodswhere only the far-end speech signal is active. These periods havebeen chosen by using an energy detector that was applied to thenear-end speech signal.

The ultimate goal is to obtain an estimate of the anechoic speechsignal s(n). While the AEC estimates and subtracts the far-endecho signal a post-filter is used to suppress the residual echo andbackground noise. The post-filter is usually designed to estimate thereverberant speech signal z(n) or the noisy reverberant speech signalz(n) + v(n). The reverberant speech signal z(n) can be dividedinto two components: i) the early speech component ze(n), whichconsists of a direct sound and early reverberation that is causedby early reflections, and ii) the late reverberant speech componentzr(n), which consists of late reverberation that is caused by thereflections that arrive after the early reflections, i.e., late reflections.Independent research [24,28,29] has shown that the speech qualityand intelligibly are most affected by late reverberation. In addition,it has been shown that the first reflections that arrive shortly after thedirect path usually contribute to speech intelligibility. Therefore, wefocus on the estimation of the early speech component ze(n).

The observed microphone signal y(n) can be written as

y(n) = z(n) + d(n) + v(n)

= ze(n) + zr(n) + d(n) + v(n).(5)

Using (4) and (5) the error signal e(n) can be written as

e(n) = y(n)− d(n)

= ze(n) + zr(n) + er(n) + v(n).(6)

Using the Short-Time Fourier Transform (STFT), we have in thetime-frequency domain

E(l, k) = Ze(l, k) + Zr(l, k) + Er(l, k) + V (l, k), (7)

where k represents the frequency bin, and l the time frame. In thenext section we show how the spectral component Ze(l, k) can beestimated.

III. GENERALIZED POST-FILTER

In this section the post-filter is developed that is used to jointlysuppress late reverberation, residual echo and background noise.When residual echo and noise are suppressed Gustafsson et al. [30]and Jeannes et al. [11] concluded that the best result is obtainedby suppressing both interferences together after the AEC. The mainadvantage of this approach is that the residual echo and noisesuppression does not suffer from the existence of a strong acousticecho component. Furthermore, the AEC does not suffer from thetime-varying noise suppression. A disadvantage is that the inputsignal of the AEC has a low Signal to Noise Ratio (SNR). To

overcome this problem, algorithms have been proposed where besidesthe joint suppression, a noise reduced signal is used to adapt the echocanceller [31].

Here, a modified version of the OM-LSA estimator [32] is usedto obtain an estimate of the spectral component Ze(l, k). Giventwo hypotheses, H0(l, k) and H1(l, k), which indicate, early speechabsence and early speech presence, respectively, we have

H0(l, k) : E(l, k) = Zr(l, k) + Er(l, k) + V (l, k),

H1(l, k) : E(l, k) = Ze(l, k) + Zr(l, k) + Er(l, k) + V (l, k).

Let us define the the spectral variance of the early speech compo-nent, the late reverberant speech component, the residual echo signaland the background noise, as λze , λzr , λer and λv , respectively. Thea posteriori SIR is then defined as

γ(l, k) =|E(l, k)|2

λzr(l, k) + λer (l, k) + λv(l, k), (8)

and the a priori SIR is defined as

ξ(l, k) =λze(l, k)

λzr(l, k) + λer (l, k) + λv(l, k). (9)

The spectral variance λv(l, k) of the background noise v(n) canbe estimated directly from the error signal e(n), e.g., by using themethod proposed by Martin in [33] or by using the Improved MinimaControlled Recursive Averaging (IMCRA) algorithm proposed byCohen [34]. The latter method was used in our experimental study.The spectral variance estimators for λer (l, k) and λzr (l, k) arederived in Sections IV and V, respectively. The a priori SIR cannot be calculated directly since the spectral variance λze(l, k) isunobservable. Different estimators can be used to estimate the a prioriSIR, e.g., the Decision Direct estimator developed by Ephraim andMalah [35] or the recursive causal or non-causal estimators developedby Cohen [36]. In the sequel the Decision Directed estimator is usedfor the estimation of the a priori SIR. The Decision Directed basedestimator is given by [35]

ξ(l, k) = max

{η|Ze(l − 1, k)|2

λ(l − 1, k)+ (1− η) max{ψ(l, k), 0}, ξmin

},

(10)where ψ(l, k) = γ(l, k)− 1 is the instantaneous SIR,

λ(l, k) , λzr(l, k) + λer (l, k) + λv(l, k), (11)

and ξmin is a lower-bound on the a priori SIR. The weighting factorη (0 ≤ η ≤ 1) controls the tradeoff between the amount of noisereduction and distortion. A typical weight factor is 9.98. Although(10) can be used to calculate the total a priori SIR it does notallow to make different tradeoffs for each interference. One can gainmore control over the estimation of the a priori SIR by estimatingit separately for each interference. More information regarding thisand combining the separate a priori SIRs can be found in AppendixA.

When the early speech component ze(n) is assumed to be active,i.e., H1(l, k) is assumed to be true, the Log Spectral Amplitude(LSA) gain function is used. Under the assumption that ze(n) and theinterference signals are mutually uncorrelated the LSA gain functionis given by [14]

GH1(l, k) =ξ(l, k)

1 + ξ(l, k)exp

(1

2

∫ ∞

ζ(l,k)

e−t

tdt

), (12)

where

ζ(l, k) =ξ(l, k)

1 + ξ(l, k)γ(l, k). (13)

Page 4: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 4

When the early speech component ze(n) is assumed to be inactive,i.e., H0(l, k) is assumed to be true, a lower-bound GH0(l, k) isapplied. In many cases the lower-bound GH0(l, k) = Gmin is used,where Gmin specifies the maximum amount of interference reduction.To avoid speech distortions Gmin is usually set between -12 and -18 dB. However, in practice the residual echo and late reverberationneeds to be reduced more than 12-18 dB. Due to the constant lower-bound the residual echo will still be audible in some time-frequencyframes [32]. Therefore, GH0(l, k) should be chosen such that theresidual echo and the late reverberation is suppressed down to residualbackground noise floor given by Gmin V (l, k). In case GH0(l, k) isapplied to those time-frequency frames where hypothesis H0(l, k) isassumed to be true we obtain

Ze(l, k) = GH0(l, k) (Zr(l, k) + Er(l, k) + V (l, k)) . (14)

The desired solution for Ze(l, k) is

Ze(l, k) = Gmin V (l, k). (15)

The least squares solution for GH0(l, k) is obtained by minimizing

E{|GH0(l, k) (Zr(l, k) + Er(l, k) + V (l, k))−Gmin V (l, k)|2

}.

Assuming that all interferences are mutually uncorrelated we obtain

GH0(l, k) = Gminλv(l, k)

λzr(l, k) + λer (l, k) + λv(l, k). (16)

The results of an informal listening test showed that the obtainedresidual interference was more pleasant than the residual interferencethat was obtained using GH0(l, k) = Gmin.

The OM-LSA spectral gain function, which minimizes the mean-square error of the log-spectra, is obtained as a weighted geometricmean of the hypothetical gains associated with the speech presenceprobability denoted by p(l, k) [37]. Hence, the modified OM-LSAgain function is given by

GOM-LSA(l, k) = {GH1(l, k)}p(l,k) {GH0(l, k)}1−p(l,k). (17)

The speech presence probability p(l, k) was efficiently estimatedusing the method proposed by Cohen in [37].

The spectral speech component Ze(l, k) of the early speech com-ponent can now be estimated by applying the OM-LSA spectral gainfunction to each spectral component E(l, k), i.e.,

Ze(l, k) = GOM-LSA(l, k) E(l, k). (18)

The early speech component ze(n) can then be obtained using theinverse STFT and the weighted overlap-add method [38].

IV. LATE RESIDUAL ECHO SPECTRAL VARIANCE ESTIMATION

In Fig. 2 a typical AIR and its Energy Decay Curve (EDC) aredepicted. The EDC is obtained by backward integration of the squaredAIR [39], and is normalized with respect to the total energy of theAIR. In Fig. 2 we can see that the tail of the AIR exhibits anexponential decay, and the tail of the EDC exhibits a linear decay.

Enzner [6] proposed a recursive estimator for the short-termPSD of the late residual echo which is related to hr(n) =[hNe , . . . , hNh−1]T . The recursive estimator exploits the fact thatthe exponential decay rate of the AIR is directly related to thereverberation time of the room, which can be estimated using theestimated echo path he. Additionally, the recursive estimator requiresa second parameter that specifies the initial power of the late residualecho.

In this section an essentially equivalent recursive estimator isderived, starting in the time-domain rather than directly in thefrequency domain as in [6]. Enzner applied a direct fit to the

Time [s]

Am

plitu

de

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16-0.5

0

0.5

(a) Typical Acoustic Impulse Response.

Time [s]

Ener

gy[d

B]

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16-40

-30

-20

-10

0

(b) Normalized Energy Decay Curve of (a).

Fig. 2. Typical acoustic impulse response and related energy decay curve.

log-envelope of the estimated echo path to estimate the requiredparameters, viz, the reverberation time and the initial power ofthe late residual echo, which are both assumed to be frequencyindependent. It should, however, be noted that these parameters areusually frequency dependent [40]. Furthermore, in many applicationsthe distance between the loudspeaker and the microphone is small,which results in a strong direct echo. The presence of a strong directecho results in an erroneous estimate of both the reverberation timeand the initial power (cf. [41]). Therefore, we propose to applya linear curve fit to part of the EDC, which exhibits a smootherdecay ramp. Details regarding the estimation of the reverberationtime T60(k) and the initial power can be found in Appendix B andAppendix C, respectively.

Using a statistical reverberation model and the estimated rever-beration time the spectral variance of the late residual echo can beestimated. In the sequel we assume that Nh =∞. The late residualecho er(n) can then be expressed as

er(n) =

∞∑

j=0

hr,j(n)xr(n− j), (19)

where xr(n) = x(n−Ne).The spectral variance of er(n) is defined as

λer(l, k) , E{|Er(l, k)|2

}. (20)

In the STFT domain we can express Er(l, k) as [42]

Er(l, k) =

NDFT−1∑

k′=0

∞∑

i=0

Hr,i(l, k, k′) X

(l − i− Ne

R, k′), (21)

where R denotes the number of samples between two successiveSTFT frames, NDFT denotes the length of the Discrete FourierTransform (DFT), Hr,i(l, k, k

′) may be interpreted as the response toan impulse δ(l−i, k−k′) in the time-frequency domain (note that theimpulse response is translation varying in the time- and frequency-axis), and i denotes the coefficient index. Note that Ne should bechosen such that Ne/R is an integer value.

Polack proposed a statistical reverberation model where the AIRis described as one realization of a non-stationary process [43]. Themodel is given by h(n) = b(n) e

−ρ nfs ∀ n ≥ 0, where b(n) is a

white Gaussian noise with zero mean and ρ denotes the decay ratewhich is related to the reverberation time T60 of the room. Using thismodel it can be shown that

E{Hr,i(l, k, k′)Hr,i+τ (l, k, k′)} = 0 ∀τ 6= 0 ∀l. (22)

Page 5: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 5

Using statistical room acoustics it can be shown that correlationbetween different frequencies drops rapidly with increasing |k − k′|[44]. Therefore, the correlation between the cross-bands k 6= k′ canbe neglected, i.e.,

E{Hr,i(l, k, k′)Hr,i(l, k, k

′)} = 0 ∀k 6= k′ ∀l. (23)

Using (20)-(23) we can express λer(l, k) as

λer(l, k) = E

{K∑

k′=0

∞∑

i=0

∣∣Hr,i(l, k, k′)∣∣2∣∣∣∣X(l − i− Ne

R, k′)∣∣∣∣

2}

=

∞∑

i=0

E{|Hr,i(l, k)|2

}E

{∣∣∣∣X(l − i− Ne

R, k

)∣∣∣∣2},

(24)

where Hr,i(l, k) , Hr,i(l, k, k).Using Polack’s statistical reverberation model the energy envelope

of Hr,i(l, k) can be expressed as

E{|Hr,i(l, k)|2} = c(l − i, k) αi(k), (25)

where c(l− i, k) denotes the initial power of the late residual echo inthe kth sub-band at time (l−i)R, α(k) = e

−2ρ(k) Rfs (0 ≤ α(k) < 1),

and δ(k) denotes the frequency dependent decay rate. The decay rateρ(k) is related to the frequency dependent reverberation time T60(k)through

ρ(k) ,3 ln(10)

T60(k). (26)

Using (25), and the fact that λx(l, k) = E{|X(l, k)|2}, we canrewrite (24) as

λer(l, k) =

∞∑

i=0

c(l − i, k) αi(k) λx

(l − i− Ne

R, k

). (27)

By using i′ = l − i and extracting the last term of the summationin (27) we can derive a recursive expression for λer(l, k) such thatonly the spectral variance λx(l − Ne

R, k) is required, i.e.,

λer(l, k) =

l∑

i′=−∞

c(i′, k) αl−i′(k) λx

(i′ − Ne

R, k

)

=

l−1∑

i′=−∞

c(i′, k) αl−i′(k) λx

(i′ − Ne

R, k

)

+ c(l, k) λx

(l − Ne

R, k

)

= α(k)λer(l − 1, k) + c(l, k) λx

(l − Ne

R, k

).

(28)

Given an estimate of the reverberation time T60(k) (see Ap-pendix B), an estimate of the exponential decay rate ρ(k) is obtainedusing (26). Using the initial power c(l, k) (see Appendix ??) we cannow estimate λer(l, k) using

λer(l, k) = e−2ρ(k) R

fs λer(l−1, k)+ c(l, k) λx

(l − Ne

R, k

), (29)

where λx(l, k) can be calculated using

λx(l, k) = ηxλx(l − 1, k) + (1− ηx)|X(l, k)|2, (30)

where ηx (0 ≤ ηx < 1) denotes the smoothing parameter. In general,a value ηx = exp

(− Rfs 12 ms

)yields good results.

V. LATE REVERBERANT SPECTRAL VARIANCE ESTIMATION

In this section we develop an estimator for the late reverberantspectral variance of the near-end speech signal z(n).

In [22] it was shown that, using Polack’s statistical room impulseresponse model [43], the late reverberant energy can be estimateddirectly from the short-term PSD of the reverberant signal using

λzr(l, k) = αNrR (k)λz

(l − Nr

R, k

). (31)

The parameter Nr (in samples) controls the time instance (measuredwith respect to the arrival time of the direct sound) of which the latereverberation starts, and is chosen such that Nr

Ris an integer value. In

general, Nr is chosen between 20 and 60 ms. While 20-35 ms yieldsgood results in case the SRR is larger than 0 dB, a value larger than35 ms is preferred in case the SRR is larger than 0 dB.

In [22,23] it was implicitly assumed that the energy of the directpath was small compared to the reverberant energy. However, in manypractical situations, the source is close to the microphone, and thecontribution of the energy related to the direct path is larger thanthe energy related to all reflections. In case the contribution of thedirect path is ignored the late reverberant spectral variance will beover-estimated. Since this over-estimation results in a distortion ofthe early speech component we need to compensate for the energyrelated to the direct path.

In Section V-A it is shown how an estimate of the spectral varianceof the reverberant spectral component Z(l, k) can be obtained whichis required to calculate (31). In Section V-B a method is developedto compensate for the energy contribution of the direct path.

A. Reverberant Spectral Variance Estimation

The spectral variance of the reverberant spectral componentZ(l, k), i.e., λz(l, k), is estimated by minimizing

E{(|Z(l, k)|2 − |Z(l, k)|2

)2}, (32)

where Z(l, k) = GSP(l, k)E(l, k).As shown in [45] this leads to the following spectral gain function

GSP(l, k) =

√ξSP(l, k)

1 + ξSP(l, k)

(1

γSP(l, k)+

ξSP(l, k)

1 + ξSP(l, k)

)(33)

where

ξSP(l, k) =λz(l, k)

λer (l, k) + λv(l, k)(34)

and

γSP(l, k) =|E(l, k)|2

λer (l, k) + λv(l, k), (35)

denote the a priori and a posteriori SIRs, respectively. The a prioriSIR is estimated using the Decision Directed method. An estimateof the spectral variance of the reverberant speech signal z(n) is thenobtained by:

λz(l, k) = ηzλz(l, k) + (1− ηz) (GSP(l, k))2 |E(l, k)|2, (36)

where ηz (0 ≤ ηz < 1) denotes the smoothing parameter. In general,a value ηz = exp

(− Rfs 80 ms

)yields good results.

B. Direct Path Compensation

The energy envelope of the AIR of the system between s(n) andy(n) can be modelled using the exponential decay of the AIR, andthe energy of the direct path and the energy of all reflections in the

Page 6: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 6

kth sub-band, denoted by Qd(k) and Qr(k), respectively. For the kth

sub-band we then obtain in the z-transform domain

Ak(z) = Qd(k) +Qr(k)Rk(z). (37)

where Rk(z) denotes the normalized energy envelope of the rever-berant part of the AIR, which starts at l = 1, i.e.,

Rk(z) =1− α(k)

α(k)

∞∑

l=1

(α(k))l z−l. (38)

Note that∑∞l=1 (α(k))l equals α(k)

1−α(k). By expanding the series in

(38) we obtain

Rk(z) =1− α(k)

α(k)

α(k)z−1

1− α(k)z−1. (39)

To eliminate the contribution of the energy of the direct path inλz(l, k), we apply the following filter to λz(l, k),

Fk(z) =Qr(k)Rk(z)

Qd(k) +Qr(k)Rk(z). (40)

We now define κ(k), which is inversely proportional to the Directto Reverberation Ratio (DRR) in the kth sub-band, as

κ(k) ,1− α(k)

α(k)

Qr(k)

Qd(k). (41)

In this paper it is assumed that κ(k) is known a priori.In practice κ(k) could be estimated online, by minimizingE{(|Z(l, k)|2 − λ′z(l, k)

)2} during the so-called free-decay of thereverberation in the room. An adaptive estimation technique wasproposed in [46].

Using the normalized energy envelope Rk(z), as defined in (39),(40) and (41) we obtain

Fk(z) =α(k)κ(k)z−1

1− α(k) (1− κ(k)) z−1. (42)

Using the difference equation related to the filter in (42) we obtainan estimate of the reverberant spectral variance with compensationof the direct path energy, i.e.,

λ′z(l, k) = α(k) (1− κ(k)) λ′z(l − 1, k)

+ α(k)κ(k)λz(l − 1, k). (43)

To ensure the stability of the filter |α(k) (1− κ(k)) | < 1. Fur-thermore, from a physical point of view it is important that onlythe source can increase reverberation energy in the room, i.e., thecontribution of λ′z(l − 1, k) to λ′z(l, k) should always be smallerthan, or equal to, α(k). Therefore, we require that 0 < κ(k) ≤ 1.

In case Qd(k) � Qr(k), i.e., κ(k) is small, λ′z(l, k) mainlydepends on α(k)λ′z(l − 1, k). In case Qd(k) � Qr(k) we reachthe upper-bound of κ(k), i.e., κ(k) = 1, and λ′z(l, k) is equal to

λ′z(l, k) = α(k)λz(l − 1, k). (44)

The late reverberant spectral variance λzr(l, k) with Direct PathCompensation (DPC) can now be obtained by using λ′z(l, k), i.e.,

λzr(l, k) = αNrR−1(k)λ′z

(l − Nr

R+ 1, k

). (45)

By substituting (44) in (45) we obtain the estimator (31) proposed in[22].

Algorithm 1 Summary of the developed algorithm.

1) Acoustic Echo Cancellation: Update the adaptive filter he(n)using (2) and calculate d(n) using (3).

2) Estimate Reverberation Time: Estimate T60(n) using (58).3) STFT: Calculate the STFT of e(n) = y(n)− d(n) and x(n).4) Estimate Background Noise: Estimate λv(l, k) using [34].5) Estimate Late Residual Echo Spectral Variance: Calculate

c(l, k) using (61) and λer(l, k) using (29).6) Estimate Late Reverberant Spectral Variance: Calculate

GSP(l, k) using (33)-(35). Estimate λz(l, k) using (36), andcalculate λzr(l, k) using (43) and (45).

7) Post-Filter:a) Calculate the a posteriori using (8) and a priori SIR using

(51)-(54).b) Calculate the speech presence probability p(l, k) [37].c) Calculate the gain function GOM-LSA(l, k) using (16) and

(17).d) Calculate Ze(l, k) using (18).

8) Inverse STFT: Calculate the output ze(n) by applying theinverse STFT to Ze(l, k).

VI. ALGORITHM OUTLINE AND DISCUSSION

In the previous sections a novel post-filter that is used for thejoint suppression of residual echo, late reverberation and backgroundnoise was developed. This post-filter is used in conjunction with astandard AEC. The steps of a complete algorithm, that includes theestimation of the echo path, the estimation of the spectral varianceof the interferences, and the OM-LSA gain function, are summarizedin Alg. 1.

In this paper we used a standard NLMS algorithm to update theadaptive filter. Due to the choice of Ne (Ne < Nh) the lengthof the adaptive filter is deficient. In case the far-end signal x(n)is not spectrally white, the filter coefficients are biased [47,48].However, the filter coefficients, that are mostly affected, are in thetail region. Accordingly, this problem can be partially solved byslightly increasing the value of Ne and calculating the output usingthe original Ne coefficients of the filter. Alternatively, one coulduse a, possibly adaptive, pre-whitening filter [2], or another adaptivealgorithm like AP or RLS.

An estimate of the reverberation time is required for the lateresidual echo spectral variance and late reverberant spectral varianceestimation. In some applications, e.g., conference systems, this pa-rameter may be determined using a calibration step. In this paper weproposed a method to estimate the reverberation time online usingthe estimated filter he, assuming that the convergence of the filterhe is sufficient. Instantaneous divergence of the filter coefficients,e.g., due to false double-talk detection or echo path changes, do notsignificantly influence the estimation of the reverberation time dueto its relatively slow update mechanism of T60. In case the filter co-efficients cannot convergence, for example due to background noise,the estimated reverberation time will be inaccurate. Over-estimationof the reverberation time results in an over-estimation of the spectralvariance of the late residual echo λer (l, k) and the late reverberationλzr(l, k). During double-talk periods this introduces some distortionof the early speech component. Informal listening tests indicated thatestimations errors > 10% resulted in audible distortions of the earlyspeech component. When only the far-end speech signal is activethe over-estimation of λer (l, k) does not introduce any problemssince the suppression is limited by the residual background noiselevel. Under-estimation of the reverberation time results in an under-estimation of the spectral variances. Although the under-estimationreduces the performance of the system in terms of late residual echo

Page 7: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 7

x(n)

rl

rs

y(n) s(n)

Fig. 3. Experimental setup.

and reverberation suppression it does not introduce any distortions ofthe early speech component.

Post-filters that are capable of handling both the residual echo andbackground noise are often implemented in the STFT domain. Ingeneral, they require two STFT and one inverse STFT, which isequal to the number of STFTs used in the proposed solution. Thecomputational complexity of the proposed solution is comparableto former solutions since the estimation of the reverberation timeand the late reverberant energy only requires a few operations. Thecomputational complexity of the AEC can be reduced by using anefficient implementation of the AEC in the frequency domain (cf.[49]), rather than in the time-domain.

VII. EXPERIMENTAL RESULTS

In this section we present experimental results that demonstratethe beneficial use of the developed spectral variance estimators andpost-filter1. In the subsequent sub-sections we evaluate the abilityof the post-filter to suppress background noise and non-stationaryinterferences, i.e., late residual echo and late reverberation. First, theperformance of the late residual echo spectral variance estimator,and its robustness with respect to changes in the tail of the acousticecho path, is evaluated. Secondly, the dereverberation performanceof the near-end speech is evaluated in the presence of backgroundnoise. We compare the dereverberation performance obtained with,and without, DPC that was developed in Section V-B. Finally, weevaluate the performance of the entire system in case all interferencesare present, i.e., during double-talk.

The experimental setup is depicted in Fig. 3. The room dimensionswere 5 m x 4 m x 3 m (length x width x height). The distance betweenthe near-end speaker and the microphone (rs) was 0.5 m, the distancebetween the loudspeaker and microphone (rl) was 0.25 m. All AIRswere generated using Allen and Berkley’s image method [50,51]. Thewall absorption coefficients were chosen such that the reverberationtime is approximately 500 milliseconds. The microphone signaly(n) was generated using (5). The analysis window w(n) of theSTFT was a 256 point Hamming window, i.e., Nw = 256, andthe overlap between two successive frames was set to 75%, i.e.,R = 0.25 Nw. The remaining parameter settings are shown inTable I. The additive noise v(n) was speech-like noise, taken fromthe NOISEX-92 database [52].

1The results are available for listening at the following web page: http://home.tiscali.nl/ehabets/publications/tassp07/tassp07.html

TABLE IPARAMETERS USED FOR THESE EXPERIMENTS.

fs = 8000 Hz Ne = 0.128 fs Nr = 0.024 fs

GdBmin = 18 dB βdB = 9 dB w = 3µ = 0.35 ηx = 0.5 ηz = 0.9

A. Residual Echo Suppression

The echo cancellation performance, and more specifically the im-provement due to the post-filter, was evaluated using the Echo ReturnLoss Enhancement (ERLE). This experiment was conducted withoutnoise, and the post-filter was configured such that no reverberationwas reduced, i.e., λzr(l, k) = 0 ∀l ∀k. The ERLE achieved by theadaptive filter was calculated using

ERLE(l) = 10 log10

lR′+L′−1∑n=lR′

d2(n)

lR′+L′−1∑n=lR′

(d(n)− d(n)

)2

dB, (46)

where L′ = 0.032 fs is the frame length and R′ = L′/4 is theframe rate. To evaluate the total echo suppression, i.e., with post-filter, we calculated the ERLE using (46) and replaced (d(n)− d(n))by the residual echo at the output of the post-filter which is givenby (ze(n) − z(n)). Note that by subtracting near-end speech signalz(n) from the output of the post-filter ze(n), we avoid the bias in theERLE that is caused by z(n). The final normalized misalignment ofthe adaptive filter was -24 dB (SNR=25 dB). It should be noted thatthe developed post-filter only suppresses the residual echo that resultsfrom the deficient length of the adaptive filter. Hence, the residualecho that results from the system mismatch of the adaptive filtercannot be compensated by the developed post-filter. The microphonesignal y(n), the error signal e(n), and the ERLE with and withoutpost-filter are shown in Fig. 4. We can see that the ERLE is signifi-cantly increased when the post-filter is used. A significant reductionof the residual echo was observed when subjectively comparing theerror signal and the processed signal. A small amount of residual echowas still audible in the processed signal. However, in the presence ofbackground noise (as discussed in Section VII-C) the residual echoin the processed signal is masked by the residual noise.

We evaluate the robustness of the developed late residual echosuppressor with respect to changes in the tail of the acoustic echopath when the far-end speech signal was active. Let us assume that theAEC is working perfectly at all times, i.e., the he(n) = he(n) ∀ n.We compared three systems: i) the perfect AEC, ii) the perfect AECfollowed by an adaptive filter of length 1024 which compensatefor the late residual echo, and ii) the perfect AEC followed by thedeveloped post-filter. It should be noted that the total length of thefilter that is used to cancel the echo in system ii is still shorter thanthe acoustic echo path. The output of the system ii is denoted bye′(n). At 4 seconds the acoustic echo path was changed by rotatingthe loudspeaker over 30 degrees in the x-y plane with respect to themicrophone. The time at which the position changes is marked witha dash-dotted line. The microphone signal y(n), the error signal e(n)of the standard AEC, the signal e′(n) and ze(n), and the ERLEs areshown in Fig. 5. From the results we can see that the ERLEs of e′(n)and ze are improved compared to the ERLE of e(n). When listeningto the output signals an increase in late residual echo was noticedwhen using the adaptive filter (system ii), no increase was noticedwhen using the developed late residual echo estimator and the post-filter (system iii). Since the late residual echo estimator is mainlybased on the exponential decaying envelope of the AIR, which doesnot change over time, the post-filter does not require any convergence

Page 8: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 8

Am

plitu

de

Time [s]

far-endspeech signal

near-end

speech signaldouble-talk

0 5 10 15 20-1

-0.5

0

0.5

1

(a) Microphone signal y(n).

Time [s]

Am

plitu

de

e(n)ze(n)

0 5 10 15 20

0.5

0

-0.5

(b) Error signal e(n) and the estimated signal ze(n).

Time [frames]

ER

LE

[dB

]

ERLE e(n)ERLE ze(n)

0 500 1000 1500 2000 25000

10

20

30

40

50

60

70

(c) Echo Return Loss Enhancement of e(n) and ze(n).

Fig. 4. Echo suppression performance.

time and it does not suffer from the change in the tail of the acousticecho path. Furthermore, during double-talk the adaptive filter mightnot be able to converge due to the low echo to near-end speechplus noise ratio of the microphone signal y(n). In the latter case thedeveloped late residual echo suppressor would still be able to obtainan accurate estimate of the late residual echo.

B. Dereverberation

The dereverberation performance was evaluated using the segmen-tal SIR and the Log Spectral Distance (LSD). The parameter κ(k)was obtained from the AIR of the system relating s(n) and z(n).An estimate of the reverberation time T60(k) was obtained usingthe procedure described in Appendix B. After convergence of theadaptive filter T60 was 493 ms. The parameter Nr was set to 0.024 fs.

The instantaneous SIR of the lth frame is defined as

SIR(l) = 10 log10

lR′+L′−1∑n=lR′

z2e (n)

lR′+L−1∑n=lR′

(ze(n)− υ(n))2

dB, (47)

where υ ∈ {y, ze}. The segmental SIR is defined as the averageinstantaneous SIR over the set of frames where the near-end speechis active.

The LSD between ze(n) and the dereverberated signal is used asa measure of distortion. The distance in the lth frame is calculatedusing

LSD(l) =1

K

K−1∑

k=0

∣∣∣∣10 log10

(C{|Ze(l, k)|2}C{|Υ(l, k)|2}

)∣∣∣∣ dB, (48)

where Υ ∈ {Y, Ze}, K denotes the number of frequency bins,and C{|A(l, k)|2} , max{|A(l, k)|2, ε} denotes a clipping operatorwhich confines the log-spectrum dynamic range to about 50 dB, i.e.,ε = 10−50/10 maxl,k{|A(l, k)|2}. Finally, the LSD is defined as theaverage distance over all frames.

The dereverberation performance was tested using different seg-mental SNRs. The segmental SNR value is determined by aver-aging the instantaneous SNR of those frames where the near-endspeech is active. Since the non-stationary interferences, such as thelate residual echo and reverberation, are suppressed down to theresidual background noise level the post-filter will always includethe noise suppression. To show the improvement related to thedereverberation process we evaluated the segmental SIR and LSDmeasures for the unprocessed signal, the processed signal (noisesuppression (NS) only), the processed signal without DPC (noise andreverberation suppression (NS+RS)), and the processed signal withDPC (NS+RS+DPC). It should be noted that the late reverberantspectral variance estimator without DPC is similar to the method in

Page 9: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 9

Am

plitu

de

Time [s]

0 1 2 3 4 5 6 7 8-1

-0.5

0

0.5

1

(a) Microphone signal y(n).

Time [s]

Am

plitu

de

e(n)e′(n)ze(n)

0 1 2 3 4 5 6 7 8-0.05

0

0.05

(b) Error signals e(n) and e′(n), and the estimated signal ze(n).

Time [frames]

ER

LE

[dB

]

ERLE e(n)ERLE e′(n)ERLE ze(n)

0 100 200 300 400 500 600 700 800 900 1000

0

20

40

60

80

(c) Echo Return Loss Enhancement of e(n), e′(n), and ze(n).

Fig. 5. Echo suppression performance with respect to echo path changes.

TABLE IISEGMENTAL SIR AND LSD FOR DIFFERENT SEGMENTAL SIGNAL TO NOISE RATIOS.

segmental SNR = 5 dB segmental SNR = 10 dB segmental SNR = 25 dBsegSIR LSD segSIR LSD segSIR LSD

Unprocessed -3.28 dB 8.21 dB 0.21 dB 5.77 dB 4.74 dB 2.66 dBPost-Filter (NS) 2.70 dB 3.54 dB 4.15 dB 2.83 dB 5.31 dB 2.40 dBPost-Filter (NS+RS) 2.47 dB 4.02 dB 4.48 dB 3.26 dB 6.94 dB 2.45 dBPost-Filter (NS+RS+DPC) 3.57 dB 3.41 dB 5.38 dB 2.62 dB 7.93 dB 1.71 dB

[22]. The results, presented in Table II, show that compared to theunprocessed signal the segmental SIR and LSD are improved in allcases. It can be seen that the DPC increases the segmental SIR andreduces the LSD, while the reverberation suppression without DPCdistorts the signal. When the background noise is suppressed the latereverberation of the near-end speech becomes more pronounced. Theresults of an informal listening test indicated that the near-end signalthat was processed without DPC sounds unnatural as it contains rapidamplitude variations, while the signal that was processed with DPCsounds natural.

The instantaneous SIR and LSD results obtained with a segmentalSNR of 25 dB together with the anechoic, reverberant and processedsignals are presented in Fig. 6. Since the SNR is relatively high, theinstantaneous SIR mainly relates to the amount of reverberation, suchthat the SIR improvement is related to the reverberation suppression.The instantaneous SIR and LSD are, respectively, increased and

decreased, especially in those areas where the SIR of the unprocessedsignal is low. During speech onsets some speech distortion may occurdue to using the Decision Directed approach for the a priori SIRestimation [36]. We can also see that the processed signal withoutDPC introduces some spectral distortions, i.e., for some frames theLSD is higher than the LSD of the unprocessed signal, while theprocessed signal with DPC does not introduce such distortions. Ingeneral, these distortions occur during spectral transitions in thetime-frequency domain. While the distortions are often masked bysubsequent phonemes they are clearly audible at the onset and offsetof the full-band speech signal. These distortions can best be describedas an abrupt increase or decrease of the sound level.

The spectrograms and waveforms of the near-end speech signalz(n), the early speech component ze(n), and the estimated earlyspeech component ze(n) are shown in Fig. 7. From these plots it canbe seen (for example at 0.5 seconds) that the smearing in time due

Page 10: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 10

Am

plitu

de

Time [s]

ReverberantAnechoic

0 0.5 1 1.5 2 2.5 3

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

(a) Reverberant and anechoic near-end speech signal.

Am

plitu

de

Time [s]

ReverberantProcessed

0 0.5 1 1.5 2 2.5 3

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

(b) Reverberant near-end speech signal and estimated early speech component.

Inst

anta

neo

us

SIR

[dB

]

Time [frames]

Unprocessed

Processed with DPCProcessed without DPC

50 100 150 200 250 300 350 400-30

-20

-10

0

10

20

30

(c) Instantaneous SIR of the unprocessed and processed (with and without Direct Path Compensation)near-end speech signal.

Inst

anta

neo

us

LSD

[dB

]

Time [frames]

Unprocessed

Processed with DPCProcessed without DPC

50 100 150 200 250 300 350 400

0

2

4

6

8

10

(d) LSD of the unprocessed and processed (with and without Direct Path Compensation) near-end speechsignal.

Fig. 6. Dereverberation performance of the system during near-end speech period (T60 ≈ 0.5 s).

to the reverberation has been reduced significantly.

In Section V-B we have developed a novel spectral estimator forthe late reverberant signal component zr(n). The estimator requiresan additional parameter κ(k) which is inversely dependent on theDRR. In the present work it is assumed that κ(k) is a priori known.However, in practice κ(k) needs to be estimated online. In thisparagraph we evaluate the robustness with respect to errors in κ(k) byintroducing an error of ±10%. The segmental SIR and LSD usingthe perturbated values of κ(k) are shown in Table III. From thisexperiment we can see that the performance of the proposed algorithmis not very sensitive to errors in the parameter κ(k). Furthermore,when an estimator for κ(k) is developed it is sufficient to obtain a‘rough’ estimate of κ(k).

TABLE IIISEGMENTAL SIR AND LSD, SEGMENTAL SNR = 25 DB, AND

κ(k) = {κ(k), 0.9 · κ(k), 1.1 · κ(k)}.

κ(k) ProcessedsegSIR LSD

κ(k) 7.93 dB 1.71 dB0.9 · κ(k) 7.87 dB 1.72 dB1.1 · κ(k) 7.92 dB 1.71 dB

C. Joint Suppression Performance

We now evaluate the performance of the entire system duringdouble-talk. The performance is evaluated using the segmental SIRand the LSD at three different segmental SNR values. To be able toshow that the suppression of each additional interference results in

Page 11: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 11

(b)

Am

plitu

de

Time [s]

(d)

Am

plitu

de

Time [s]

(f)

Am

plitu

de

Time [s]

(a)

Time [s]

Fre

quen

cy[k

Hz]

(c)

Time [s]

Fre

quen

cy[k

Hz]

(e)

Time [s]

Fre

quen

cy[k

Hz]

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2 2.5 3

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

-1

-0.5

0

0.5

1

-1

-0.5

0

0.5

1

-1

-0.5

0

0.5

1

Fig. 7. Spectrogram and waveform of (a-b) the reverberant near-end speech signal z(n), (c-d) the early speech component ze(n), and (e-f) the estimatedearly speech component ze(n) (segmental SNR = 25 dB, T60 ≈ 0.5 s).

an improvement of the performance we also show the intermediateresults. Since all non-stationary interferences, i.e., the late residualecho and reverberation, are reduced down to the residual backgroundnoise level, the background noise is suppressed first. We evaluatedthe performance using i) the AEC, ii) the AEC and post-filter (noisesuppression), iii) the AEC and post-filter (noise and residual echosuppression), and iv) the AEC and post-filter (noise, residual echo,and reverberation suppression). The are presented in Table IV. Theseresults show a significant improvement in terms of SIR and LSD. Animprovement of the far-end echo to near-end speech ratio is observedwhen listening to the signal after the AEC (system i). However,reverberant sounding residual echo can clearly be noticed. Whenthe background noise is suppressed (system ii) the residual echoand reverberation of the near-end speech becomes more pronounced.After suppression of the late residual echo (system iii) almost no echois observed. When in addition the late reverberation is suppressed(system iv) it sounds like the near-end speaker has moved closerto the microphone. Informal listening tests using normal hearingsubjects showed a significant improvement of the speech quality when

comparing the output of system ii and system iv.The spectrograms of the microphone signal y(n), the early speech

component ze(n), and the estimated signal ze(n) for a segmentalSNR of 25 dB and 5 dB, are shown in Figs. 8 and 9, respectively. Thespectrograms demonstrate how well the interferences are suppressedduring double-talk.

VIII. CONCLUSION

We have developed a novel post-filter for an AEC which isdesigned to efficiently reduce reverberation of the near-end speechsignal, late residual echo and background noise. Spectral varianceestimators for the late residual echo and late reverberation havebeen derived using a statistical model of the AIR that dependson the reverberation time of the room. Because blind estimationof the reverberation time is very difficult, a major advantage ofthe hands-free scenario is that due to the existence of the echoan estimate of the reverberation time can be obtained from theestimated acoustic echo path. Finally, the near-end speech is estimatedbased on a modified OM-LSA estimator. The modification ensures a

Page 12: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 12

TABLE IVSEGMENTAL SIR AND LSD FOR DIFFERENT SEGMENTAL SIGNAL TO NOISE RATIOS DURING DOUBLE-TALK.

segmental SNR = 5 dB segmental SNR = 10 dB segmental SNR = 25 dBsegSIR LSD segSIR LSD segSIR LSD

Unprocessed -9.36 dB 11.64 dB -8.75 dB 10.54 dB -8.01 dB 9.74 dBAEC -3.34 dB 8.57 dB -0.27 dB 6.08 dB 3.62 dB 2.69 dBAEC + Post-Filter (NS) 0.98 dB 3.60 dB 2.74 dB 2.72 dB 4.38 dB 2.28 dBAEC + Post-Filter (NS+RES) 1.26 dB 3.34 dB 2.98 dB 2.36 dB 4.89 dB 1.68 dBAEC + Post-Filter (NS+RES+RS) 1.82 dB 3.27 dB 3.82 dB 2.23 dB 6.98 dB 1.34 dB

(a)

Time [s]

Fre

quen

cy[k

Hz]

(b)

Time [s]

Fre

quen

cy[k

Hz]

(c)

Time [s]

Fre

quen

cy[k

Hz]

(d)

Time [s]

Fre

quen

cy[k

Hz]

0 2 4 6 8 100 2 4 6 8 10

0 2 4 6 8 100 2 4 6 8 10

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

Fig. 8. Spectrograms of (a) the microphone signal y(n), (b) the early speech component ze(n), (c) the reverberant near-end speech signal z(n), and (d)the estimated early speech component ze(n), during double-talk (segmental SNR = 25 dB, T60 ≈ 0.5 s).

stationary residual background noise level of the output. Experimentalresults demonstrate the performance of the developed post-filter, andits robustness to small changes in the tail of the acoustic echopath. During single- and double-talk periods a significant amountof interference is suppressed with little speech distortion.

The statistical model of the AIR does not take the energy con-tribution of the direct-path into account. Hence, a late reverberantspectral variance estimator, which is based on this model, results inan over-estimated spectral variance. This phenomenon is pronouncedwhen the source-microphone distance is smaller than the criticaldistance and results in spectral distortions of the desired speechsignal. Therefore, we derived an estimator that compensates for theenergy contribution of the direct-path. The compensation requires oneadditional (possibly frequency dependent) parameter that is related tothe DRR of the AIR. We demonstrated that the proposed estimator isnot not very sensitive to estimation errors of this parameter. Futureresearch will focus on the blind estimation of this parameter.

When multi-microphones are available rather than a single-microphone the spatial diversity of the signal can be used to increasethe suppression of reverberation and other interferences. Extendingthe post-filter to the case where a microphone array is available, rather

than a single microphone, is a topic for future research.

APPENDIX AA priori SIR ESTIMATOR

Rather than using one a priori SIR it is possible to calculate onevalue for each interference. By doing this, one gains control over i)the trade-off between the interference reduction and the distortion ofthe desired signal, and ii) the a priori SIR estimation approach of eachinterference. Note that in some cases it might be desirable to reduceone of the interferences at the cost of larger speech distortion, whileother interferences are reduced less to avoid distortion. Gustafsson etal. also used separate a priori SIRs in [13,30] for two interferences,i.e., background noise and residual echo. In this section we show howthe Decision Directed approach can be used to estimate the individuala priori SIRs, and we propose a slightly different way of combiningthem. It should be noted that each a priori SIR could be estimatedusing a different approach, e.g., the Decision Directed a priori SIRestimator proposed by Ephraim and Malah in [35] or the non-causala priori SIR estimator proposed by Cohen in [36]. In this work wehave used the Decision Directed a priori SIR estimator.

Page 13: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 13

(a)

Time [s]

Fre

quen

cy[k

Hz]

(b)

Time [s]

Fre

quen

cy[k

Hz]

(c)

Time [s]

Fre

quen

cy[k

Hz]

(d)

Time [s]

Fre

quen

cy[k

Hz]

0 2 4 6 8 100 2 4 6 8 10

0 2 4 6 8 100 2 4 6 8 10

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

Fig. 9. Spectrograms of (a) the microphone signal y(n), (b) the early speech component ze(n), (c) the reverberant near-end speech signal z(n), and (d)the estimated early speech component ze(n), during double-talk (segmental SNR = 5 dB, T60 ≈ 0.5 s).

The a priori SIR in (9) can be written as

1

ξ(l, k)=

1

ξzr(l, k)+

1

ξer (l, k)+

1

ξv(l, k), (49)

withξϑ(l, k) =

λze(l, k)

λϑ(l, k), (50)

where ϑ ∈ {zr, er, v}.Let us assume that there always is a certain amount of background

noise. In case the energy of the near-end speech is very low, andthe energy of the late reverberant and/or residual echo is very low,the a priori SIR ξzr(l, k) and/or ξer (l, k) may be unreliable sinceλze(l, k) and λzr(l, k) and/or λer (l, k) are close to zero. Due to thisthe a priori SIR ξ(l, k) may be unreliable. Because the LSA gainfunction as well as the speech presence probability p(l, k) dependon ξ(l, k), an inaccurate estimate can decrease the performance ofthe post-filter. We propose to calculate ξ(l, k) using only the mostimportant and reliable a priori SIRs as follows2

ξ(l, k) =

{ξv if 10 log10

(λv

λzr+λer

)> βdB,

ξ′ otherwise,(51)

and

ξ′(l, k) =

ξer ξvξer+ξv

if 10 log10

(λerλzr

)> βdB,

ξzrξvξzr+ξv

if 10 log10

(λzrλer

)> βdB,

ξzrξvξerξvξer+ξzrξer+ξzrξv

otherwise,

(52)

where the threshold βdB specifies the level difference in dB. In casethe noise level is βdB higher than the level of residual echo and

2The time and frequency indices at the right-hand side have been omitted.

late reverberation (in dB), the total a priori SIR, ξ(l, k), will beequal to ξv(l, k). Otherwise ξ(l, k) will be calculated depending onthe level difference between λzr(l, k) and λer (l, k) using (52): Incase the level of residual echo is βdB larger than the level of latereverberation, ξ(l, k) will depend on both ξv(l, k) and ξer (l, k). Incase the opposite is true, ξ(l, k) will depend on both ξv(l, k) andξzr(l, k). In any other case ξv(l, k) will be calculated using all apriori SIRs.

To estimate ξϑ(l, k) we use the following expression

ξϑ(l, k) = max

{ηϑ|Ze(l − 1, k)|2

λϑ(l − 1, k)

+ (1− ηϑ) max{ψϑ(l, k), 0}, ξmin,ϑ

}, (53)

where

ψϑ(l, k) =λ(l, k)

λϑ(l, k)ψ(l, k)

=|E(l, k)|2 − λ(l, k)

λϑ(l, k),

(54)

and ξmin,ϑ is the lower-bound on the a priori SIR ξϑ(l, k).

APPENDIX BESTIMATION OF THE REVERBERATION TIME

The reverberation time can be estimated directly from the EDC ofhe(n). It should be noted that the last EDC values are not useful dueto the finite length of he(n) and due to the final misalignment of theadaptive filter coefficients. Therefore, we use only a dynamic range of

Page 14: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 14

Algorithm 2 Estimation of the reverberation time using he(n).

1) Calculate the Energy Decay Curve of he(n), where n =uRT60 (u = 1, 2, . . .) and RT60 denotes the estimation rate,using

EDC(u,m) = 20 log10

(Ne−1∑

j=m

(he,j(uRT60)

)2)

for 0 ≤ m ≤ Ne − 1.

2) A straight line is fitted through a selected part of the EDCvalues using a least squares approach. The line at time u isdescribed by p(u) + q(u) m, where p(u) and q(u) denotes theoffset and the regression coefficient of the line, respectively.The regression coefficient q(u) is obtained by minimizing thefollowing cost function:

J(p(u), q(u)) =

me∑

m=ms

(EDC(u,m)− (p(u) + q(u) m))2 ,

(55)where ms (0 ≤ ms < Ne − 1) and me (ms < me ≤ Ne − 1)denote the start-time and end-time of EDC values that are used,respectively. A good choice for ms and me is given by

ms = arg minm

∣∣∣∣EDC(u,m)

EDC(u, 0)+ 5

∣∣∣∣ (56)

andme = arg min

m

∣∣∣∣EDC(u,m)

EDC(u, 0)+ 25

∣∣∣∣ , (57)

respectively.3) The reverberation time T60(u) can now be calculated using

T60(u) = T60(u−1)+µT60

(60

q(u)fs− T60(u− 1)

), (58)

where µT60 denotes the adaptation step-size.

20 dB3 to determine the slope of the EDC. Finally, the reverberationtime is updated using an adaptive scheme. A detailed description canbe found in Alg. 2.

In general, the reverberation time T60 is frequency dependent dueto frequency dependent reflection coefficients of walls and otherobjects and the frequency dependent absorption coefficient of air [40].Instead of applying the above procedure to he(n), we can apply theabove procedure to band-pass filtered versions of he(n). We used1-octave band filters to acquire the reverberation time in differentsub-bands. These values are interpolated and extrapolated to obtainan estimate of T60(k) for each frequency bin k.

To reduce the complexity of the estimator we can estimate thereverberation time at regular intervals, i.e., for n , uRT60 (u =1,2,. . . ), where RT60 denotes the estimation rate of the reverberationtime.

APPENDIX CESTIMATION OF THE INITIAL POWER

The initial power c(l, k) can be calculated using the followingexpression

c(l, k) =

∣∣∣∣∣Nw−1∑

j=0

hr,j(lR)e−ι 2πk

NDFTj

∣∣∣∣∣

2

for k = {0, . . . , NDFT − 1}, (59)

3It might be necessary to decrease the dynamic range when Ne is small orthe reverberation time is long.

where ι =√−1 and Nw is the length of the analysis window. Since

hr(n) is not available, we use the last Nw coefficients of he(n) andextrapolate the energy using the estimated decay. We then obtain anestimate of c(l, k) by

c(l, k) = αNwR (k)

∣∣∣∣∣Nw−1∑

j=0

he,Ne−Nw+j(lR)e−ι 2πk

NDFTj

∣∣∣∣∣

2

for k = {0, . . . , NDFT − 1}. (60)

The estimated initial power might contain some spectral zeros, whichcan easily be removed by smoothing c(l, k) along the frequency axisusing

c(l, k) =

w∑

i=−w

bic(l, k + i), (61)

where b is a normalized window function (∑wi=−w bi = 1) that

determines the frequency smoothing.In this work we have calculated c(l, k) for every frame l. However,

in many cases it can be assumed that the acoustic echo path isslowly time-varying. Therefore, c(l, k) does not have to be calculatedfor every frame l. By calculating c(l, k) at a lower frame rate thecomputational complexity of the late residual echo estimator can bereduced.

ACKNOWLEDGEMENT

The authors like to thank the anonymous reviewers for thereconstructive comments which helped to improve the presentation ofthis paper.

REFERENCES

[1] G. Schmidt, “Applications of acoustic echo control: an overview,” inProc. of the European Signal Processing Conference (EUSIPCO’04),Vienna, Austria, 2004.

[2] C. Breining, P. Dreiseitel, E. Hansler, A. Mader, B. Nitsch, H. Puder,T. Schertler, G. Schmidt, and J. Tilp, “Acoustic Echo Control - An Ap-plication of Very-High-Order Adaptive Filters,” IEEE Signal ProcessingMag., vol. 16, no. 4, pp. 42–69, 1999.

[3] E. Hansler, “The hands-free telephone problem: an annotated bibliogra-phy,” Signal Processing, vol. 27, no. 3, pp. 259–271, 1992.

[4] E. Hansler and G. Schmidt, Acoustic Echo and Noise Control: APractical Approach. John Wiley & Sons, June 2004.

[5] V. Myllyla, “Residual echo filter for enhanced acoustic echo control,”Signal Processing, vol. 86, no. 6, pp. 1193–1205, June 2006.

[6] G. Enzner, “A model-based optimum filtering approach to acoustic echocontrol: Theory and practice,” Ph.D. Thesis, RWTH Aachen University,Wissenschaftsverlag Mainz, Aachen, Germany, Apr. 2006, ISBN 3-86130-648-4.

[7] V. Turbin, A. Gilloire, and P. Scalart, “Comparison of three post-filteringalgorithms for residual acoustic echo reduction,” in Proc. of the IEEEInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP 1997), vol. 1, 1997, pp. 307–310.

[8] Y. Grenier, M. Xu, J. Prado, and D. Liebenguth, “Real-time implemen-tation of an acoustic antenna for audio-conference,” in Proc. of theInternational Workshop on Acoustic Echo Control, Berlin, September1989.

[9] M. Xu and Y. Grenier, “Acoustic echo cancellation by adaptive antenna,”in Proc. of the International Workshop on Acoustic Echo Control, Berlin,September 1989.

[10] H. Yasukawa, “An acoustic echo canceller with sub-band noise can-celling,” IEICE transactions on Fundamentals of Electronics, Commu-nications and Computer Sciences, vol. E75-A, no. 11, pp. 1516–1523,1992.

[11] W. Jeannes, P. Scalart, G. Faucon, and C. Beaugeant, “Combined noiseand echo reduction in hands-free systems: a survey,” IEEE Transactionson Speech and Audio Processing, vol. 9, no. 8, pp. 808–820, 2001.

[12] C. Beaugeant, V. Turbin, P. Scalart, and A. Gilloire, “New optimalfiltering approaches for hands-free telecommunication terminals,” SignalProcessing, vol. 64, no. 1, pp. 33–47, Jan. 1998.

Page 15: Joint Dereverberation and Residual Echo Suppression of ...webee.technion.ac.il/Sites/People/IsraelCohen/Publications/TASL_20… · for noise suppression, was the most efficient [7]

IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. X, NO. X, MONTH YEAR 15

[13] S. Gustafsson, R. Martin, P. Jax, and P. Vary, “A psychoacoustic approachto combined acoustic echo cancellation and noise reduction,” IEEETrans. Speech Audio Processing, vol. 10, no. 5, pp. 245–256, 2002.

[14] Y. Ephraim and D. Malah, “Speech enhancement using a minimum meansquare error log-spectral amplitude estimator,” IEEE Trans. Acoust.,Speech, Signal Processing, vol. ASSP-33, pp. 443–445, April 1985.

[15] R. Martin and P. Vary, “Combined acoustic echo cancellation, derever-beration and noise reduction: A two microphone approach,” in Proc. ofthe Annales des Telecommunications, vol. 49, no. 7-8, July-August 1994,pp. 429–438.

[16] M. Dorbecker and S. Ernst, “Combination of two-channel spectralsubtraction and adaptive wiener post-filtering for noise reduction anddereverberation,” in Proc. of the European Signal Processing Conference(EUSIPCO 1996), Triest, Italy, 1996.

[17] J. B. Allen, D. A. Berkley, and J. Blauert, “Multimicrophone Signal-Processing Technique to Remove Room Reverberation from SpeechSignals,” Journal of the Acoustical Society of America, vol. 62, no. 4,pp. 912–915, 1977.

[18] P. Bloom and G. Cain, “Evaluation of Two Input Speech DereverberationTechniques,” in Proc. of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP 1982), vol. 1, 1982, pp. 164–167.

[19] A. Oppenheim, R. Schafer, and J. T. Stockham, “Nonlinear filtering ofmultiplied and convolved signals,” in Proc. of the IEEE, vol. 56, no. 8,1968, pp. 1264–1291.

[20] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancementusing cepstral processing,” in Proc. of the IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP’91), vol. 2, 1991,pp. 977–980.

[21] B. Yegnanarayana and P. Murthy, “Enhancement of reverberant speechusing LP residual signal,” IEEE Trans. Speech Audio Processing, vol. 8,no. 3, pp. 267–281, 2000.

[22] K. Lebart and J. Boucher, “A New Method Based on Spectral Subtractionfor Speech Dereverberation,” Acta Acoustica, vol. 87, pp. 359–366, 2001.

[23] E. Habets, “Multi-Channel Speech Dereverberation based on a StatisticalModel of Late Reverberation,” in Proc. of the 30th IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP 2005),Philadelphia, USA, March 2005, pp. 173–176.

[24] J. Wen, N. Gaubitch, E. Habets, T. Myatt, and P. Naylor, “Evaluationof Speech Dereverberation Algorithms using the MARDY Database,” inProc. of the 10th International Workshop of Acoutsic Echo and NoiseControl (IWAENC 2006), Paris, France, September 2006, pp. 1–4.

[25] J. Benesty and S. L. Gay, “An improved PNLMS algorithm,” in Proc.of the 27th IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP 2002), 2002, pp. 1881–1884.

[26] E. Hansler and G. Schmidt, “Hands-free telephones - joint control ofecho cancellation and postfiltering,” Signal Processing, vol. 80, pp.2295–2305, 2000.

[27] T. Gansler and J. Benesty, “The fast normalized cross-correlation double-talk detector,” Signal Processing, vol. 86, pp. 1124–1139, June 2006.

[28] F. Aigner and M. Strutt, “On a physiological effect of several sources ofsound on the ear and its consequences in architectural acoustics,” Journalof the Acoustical Society of America, vol. 6, no. 3, pp. 155–159, 1935.

[29] J. Allen, “Effects of small room reverberation on subjective preference,”Journal of the Acoustical Society of America, vol. 71, no. S1, p. S5,1982.

[30] S. Gustafsson, R. Martin, and P. Vary, “Combined acoustic echo controland noise reduction for hands-free telephony,” Signal Processing, vol. 64,no. 1, pp. 21–32, Jan. 1998.

[31] G. Faucon and R. L. B. Jeannes, “Joint system for acoustic echocancellation and noise reduction,” in EuroSpeech, Madrid, Spain, Sept.1995, pp. 1525–1528.

[32] E. Habets, I. Cohen, and S. Gannot, “MMSE Log Spectral AmplitudeEstimator for Multiple Interferences,” in Proc. of the 10th InternationalWorkshop of Acoutsic Echo and Noise Control (IWAENC 2006), Paris,France, September 2006, pp. 1–4.

[33] R. Martin, “Noise power spectral density estimation based on optimalsmoothing and minimum statistics,” IEEE Trans. Speech Audio Process-ing, vol. 9, no. 5, pp. 504–512, 2001.

[34] I. Cohen and B. Berdugo, “Noise Estimation by Minima ControlledRecursive Averaging for Robust Speech Enhancement,” IEEE SignalProcessing Lett., vol. 9, no. 1, pp. 12–15, January 2002.

[35] Y. Ephraim and D. Malah, “Speech enhancement using a minimummean square error short-time spectral amplitude estimator,” IEEE Trans.Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 1109–1121,December 1984.

[36] I. Cohen, “Relaxed Statistical Model for Speech Enhancement and APriori SNR Estimation,” IEEE Trans. Speech Audio Processing, vol. 13,no. 5, pp. 870–881, September 2005.

[37] ——, “Optimal Speech Enhancement Under Signal Presence UncertaintyUsing Log-Spectral Amplitude Estimator,” IEEE Signal Processing Lett.,vol. 9, no. 4, pp. 113–116, April 2002.

[38] R. Crochiere and L. Rabiner, Multirate Digital Signal Processing.Englewood Cliffs, New Jersey: Prentice-Hall, 1983.

[39] M. Schroeder, “Integrated-impulse method measuring sound decay with-out using impulses,” Journal of the Acoustical Society of America,vol. 66, no. 2, pp. 497–500, 1979.

[40] H. Kuttruff, Room Acoustics, 4th ed. London: Spon Press, 2000.[41] M. Karjalainen, P. Antsalo, A. Mkivirta, T. Peltonen, and V. Valimaki,

“Estimation of modal decay parameters from noisy response measure-ments,” Journal of the Audio Engineering Society, vol. 11, pp. 867–878,2002.

[42] Y. Avargel and I. Cohen, “System identification in the short-time Fouriertransform domain with crossband filtering,” IEEE Trans. Acoust., Speech,Signal Processing, vol. 15, no. 4, pp. 1305–1319, 2007.

[43] J. Polack, “La transmission de l’energie sonore dans les salles,” Thesede Doctorat d’Etat, Universite du Maine, La mans, 1988.

[44] M. Schroeder, “Frequency correlation functions of frequency responsesin rooms,” Journal of the Acoustical Society of America, vol. 34, no. 12,pp. 1819–1823, 1962.

[45] A. Accardi and R. Cox, “A modular approach to speech enhancementwith an application to speech coding,” in Proc. of the IEEE InternationalConference on Acoustics, Speech, and Signal Processing (ICASSP 1999),vol. 1, 1999, pp. 201–204.

[46] A. Abramson, E. Habets, S. Gannot, and I. Cohen, “Dual-microphonespeech dereverberation using garch modeling,” in Accepted for the Proc.of the IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP 2008), Las Vegas, USA, 2008.

[47] D. Schobben and P. Sommen, “On the performance of too short adaptivefir filters,” in Proc. CSSP-97, 8th Annual ProRISC/IEEE Workshopon Circuits, Systems and Signal Processing, J. Veen, Ed. Utrecht,Netherlands: STW, Technology Foundation, 1997, pp. 545–549, ISBN90-73461-12-X.

[48] K. Mayyas, “Performance analysis of the deficient length LMS adaptivealgorithm,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 53,no. 8, pp. 2727–2734, 2005.

[49] J. Shynk, “Frequency-domain and multirate adaptive filtering,” IEEESignal Processing Mag., vol. 9, no. 1, pp. 14–37, Jan. 1992.

[50] J. B. Allen and D. A. Berkley, “Image Method for Efficiently SimulatingSmall Room Acoustics,” Journal of the Acoustical Society of America,vol. 65, no. 4, pp. 943–950, 1979.

[51] P. Peterson, “Simulating the response of multiple microphones to a singleacoustic source in a reverberant room,” Journal of the Acoustical Societyof America, vol. 80, no. 5, pp. 1527–1529, Nov. 1986.

[52] A. Varga and H. Steeneken, “Assessment for Automatic Speech Recog-nition: II. NOISEX-92: A Database and an Experiment to Study theEffect of Additive Noise on Speech Recognition Systems,” SpeechCommunication, vol. 12, pp. 247–251, July 1993.