[doi 10.1109_ncc.2013.6487990] padaki, harish; nathwani, karan; hegde, rajesh m -- [ieee 2013...
TRANSCRIPT
![Page 1: [doi 10.1109_ncc.2013.6487990] Padaki, Harish; Nathwani, Karan; Hegde, Rajesh M -- [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)]](https://reader035.vdocuments.site/reader035/viewer/2022081822/577c78851a28abe054903a33/html5/thumbnails/1.jpg)
Single Channel Speech Dereverberation Using theLP Residual Cepstrum
Harish Padaki, Karan Nathwani and Rajesh M HegdeIndian Institute of Technology Kanpur
Kanpur, India 208016Email: {nathwani,rhegde}@iitk.ac.in
Abstract—Clean speech acquisition from distant microphonesis often affected by the phenomenon of reverberation. In thispaper, the significance of a blind single channel dereverbera-tion method using the linear prediction (LP) residual of thereverberated speech signal is proposed. A relation between theLP residual of the clean and the reverberated speech signal isalso derived using the acoustic room impulse response. Howeverin the proposed method, the clean LP residual is computedfrom its reverberated counterpart by a method of cepstralsubtraction. Hence there is no estimation of the acoustic roomimpulse response (AIR) making the method computationallysimple. Experiments on speech dereverberation and distantspeech recognition are conducted at various direct to reverberantratios (DRR). The results are presented using objective measures,subjective measures, and word error rates (WER) and comparedto methods available in literature to illustrate the significance ofthis method.
Index Terms—Dereverberation, LP residual, Cepstrum, Cep-stral Subtraction.
I. INTRODUCTION
Clean speech acquisition from distant microphones caninvolve single or multiple microphones. A speech signalcaptured using a single distant microphone is smeared dueto reverberation. Reverberation is a phenomenon in whichattenuated and delayed versions of a signal are added to itself.The reverberation degrades the quality of speech, makes soundunintelligible and reduces the accuracy of speech recognitionbased systems. In general, the reverberant speech signal canbe modeled as convolution of the clean speech signal andthe acoustic room impulse response (AIR). A dereverberationalgorithm processes the observed speech signal, so as toform an estimate of the clean speech signal by computingthe acoustic room impulse response in most cases. Hence,the method of speech dereverberation is viewed as a blinddeconvolution problem as neither the clean speech signalnor the AIR is available generally. Various algorithms havebeen proposed for speech dereverberation in this context [1],[2]. Blind deconvolution methods perform effective derever-beration but are difficult to implement practically as theyhave high computational complexity and high sensitivity tonoise. In [3], the dereverberation is carried out by usingcepstrum to determine the AIR and then use inverse filteringto obtain the estimate of clean speech. The truncation errorpresent in [3] was removed in [4], but still inverse filteringwas required. In this paper, a method which makes use of
complex cepstrum and linear prediction (LP) residual signalto deconvolve the reverberated speech signal is proposed. Thismethod estimates clean speech without actually estimating theAIR and thereby avoids the computational complexity involvedin inverse filtering.
The remainder of the paper is organized as follows. Theeffects of reverberation on LP coefficients is analyzed inSection II. The relation between clean and reverberated LPresidual is then derived in Section III. The proposed algorithmis also discussed in Section III. The performance of thedereverberation algorithm is then evaluated in Section IV.Section V gives the brief conclusion.
II. LP ANALYSIS OF REVERBERANT SPEECH
The LP coefficients b of reverberated speech x(n) areobtained using
b = R−1xx rxx (1)
where Rxx is the auto-correlation matrix of reverberatedspeech and rxx is the auto-correlation vector. The ith autocorrelation coefficient of Rxx is given by
rxx,i = E{(x(n)x(n− i)} (2)
The above equation can be written equivalently in frequencydomain as
rxx,i =1
2π
∫ π
−π
| X(ejω) |2 ejωi dω (3)
rxx,i =1
2π
∫ π
−π
| H(ejω) |2| S(ejω) |2 ejωi dω (4)
Taking spatial expectation on both sides of Equation 1, we get
E{b} = E{R−1xx rxx} (5)
By using the zeroth order Taylor series, we can reduce theexpectation to
E{b} ' E{Rxx}−1E{rxx} (6)
Now consider the spatial expectation of rxx,i
E{rxx,i} =1
2π
∫ π
−π
E{| H(ejω) |2| S(ejω) |2}ejωi dω (7)
The term | S(ejω) |2 is the power spectral density (PSD) ofthe clean signal s(n). It is taken outside the spatial expectationas it is independent of source microphone position. The spatial978-1-4673-5952-8/13/$31.00 c© 2013 IEEE
![Page 2: [doi 10.1109_ncc.2013.6487990] Padaki, Harish; Nathwani, Karan; Hegde, Rajesh M -- [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)]](https://reader035.vdocuments.site/reader035/viewer/2022081822/577c78851a28abe054903a33/html5/thumbnails/2.jpg)
expectation of the energy density spectrum of the AIR can besplit into direct component and a reverberant component
E{| H(ejω)2 |} ' | Hd(e
jω) |2 + E{| Hr(ejω) |2} (8)
The direct and the reverberant components can be expressedrespectively as,
E{| H(ejω)2 |} =
1
4πD2 +1− α
Aπα(9)
where D is the distance between the microphone and source,α is the average wall absorption coefficient and A is the totalsurface area of the room. The expression for the expectedenergy density spectrum of the AIR is constant, say η, andindependent of frequency ω. Therefore Equation 7, now be-comes
E{rxx,i} =η
2π
∫ π
−π
| S(ejω) |2ejωi dω (10)
E{rxx,i} = ηrss,i (11)
Substituting above result into Equation 1, we have,
E{b} = R−1ss rss (12)
E{b} = a (13)
where a are LP coefficients (LPC) of clean speech. The aboveresult shows that if LPC analysis is applied to reverberantspeech, the LP coefficients a and b are not necessarily equal ata single observation point in space. However in terms of spatialexpectation, the LPC coefficients from reverberant speech areapproximately equal to those from clean speech. Intuitively,the result suggests that rather than using a single observationwe can use microphone array in a manner so as to approximatethe computation of the spatial expectation which will give amore accurate estimation of the LPC coefficients.
In order to illustrate the robustness of the LP coefficientsto reverberation, an average error distribution (AED) as ahistogram plot is illustrated in Figure 1 for both clean-cleanand clean-reverberant case. It is clear from the AED computedover several examples that the LP coefficients are robust toreverberation. The theoretical inference on the robustness ofLP to reverberation is also reinforced with comparison of theLP spectrograms of clean and reverberated speech as shownin Figure 2 for one sentence from the TIMIT database at adirect to reverberation ratio of -3dB.
Fig. 1. Comparison of average error distribution of (a) Clean-Clean and (b)Clean-Reverberated LP coefficients
Fig. 2. Comparison of the spectrograms of clean and reverberated speech.FFT spectrogram (Top row) and LP spectrogram (Bottom row)
III. SPEECH DEREVERBERATION USING THE LP RESIDUALCEPSTRUM
Considering the frequency domain formulation of thesource-filter model of speech production. The Fourier trans-form of the speech signal is given by
S(ejω) = T (ejω)V (ejω) (14)
where T (ejω) is the Fourier transform of the predictionresidual and V (ejω) is the transfer function of the all-polefilter. Assuming an acoustic impulse response H(ejω), theFourier transform of reverberant speech signal can be writtenas
X(ejω) = S(ejω)H(ejω) (15)= T (ejω)V (ejω)H(ejω) (16)
With reference to Equation 13, an inverse filter, B(ejω) =1 +
∑pk=1 bke
jωk, can be obtained such that E(B(ejω)) =A(ejω). Inverse filtering the reverberant speech signal resultsin
R(ejω) = T (ejω)H(ejω) (17)
where R(ejω) is the Fourier transform of the predictionresidual obtained from the reverberant speech. Hence in thetime domain, the prediction residual obtained from reverberantspeech is approximately equal to the clean speech residualconvolved with the room impulse response. This approxima-tion is a result of the inherent properties of the LP analyzedherein. Therefore, it is reasonable to assume that if the LPcoefficients of reverberant speech were identical to those fromclean speech, this approximation would be an equivalence.
A. The algorithm for single channel speech dereverberation
The reverberated speech signal is first subjected to LPanalysis. After getting the linear prediction coefficients, theresidual signal is extracted. This residual signal is convolution
![Page 3: [doi 10.1109_ncc.2013.6487990] Padaki, Harish; Nathwani, Karan; Hegde, Rajesh M -- [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)]](https://reader035.vdocuments.site/reader035/viewer/2022081822/577c78851a28abe054903a33/html5/thumbnails/3.jpg)
of clean speech residual signal and the AIR. Thus, if the cleanspeech residual is recovered from the reverberated speechresidual, the dereverberated speech signal can be synthesized.The separation of clean residual from reverberated residual isperformed via deconvolution. The deconvolution is performedusing cepstral subtraction. The cepstrum of the reverberatedresidual is obtained and the peaks in higher quefrency ofthe cepstrum correspond to the AIR [3]. Hence peak pickingis applied to the cepstrum of reverberated signal. The peaksobtained correspond to the cepstrum of AIR. The peaks arethen subtracted from the reverberated residual signal so asto perform deconvolution and obtain an estimate of cleanspeech residual signal. The cepstral subtraction removes theearly part of the reverberation whereas the late reverberationcomponents still remain. The late reverberation componentsare eliminated by using the modified spectral subtraction [5]as part of the post processing step. The dereverberated signalis then obtained by synthesizing of estimated clean speechresidual signal and the LP coefficients of reverberated signal.
Algorithm 1 Single channel speech dereverberation using LPresidual cepstrum
1: Input : Reverberant and short term windowed speechsignal acquired through one distant microphone.
2: Perform twelfth order LPC analysis and compute the LPresidual using a single sample shift.
3: Compute the complex cepstrum of the LP residual andperform peak picking in higher quefrency region beyond25 ms.
4: Subtract the selected peaks from the cepstrum of the LPresidual.
5: Compute the inverse complex cepstrum of the LP residualand the corresponding short term spectrum.
6: Remove late reverberation by using Modified SpectralSubtraction.
7: Use overlap add (OLA) method to reconstruct the LPresidual
8: Use the LP residual along with the LP coefficients fromStep 2, to synthesize the clean speech signal
9: Output : The dereverberated speech signal.
B. Removing late reverberation using modified spectral sub-traction
In this Section, the post processing of the clean speech sig-nal using a modified spectral subtraction method is described.This step helps in removing the late reverberation componentstill present in the dereverberated speech signal. The short-timeFourier transform (STFT) X(ω,m) of speech x(k) obtainedafter cepstral subtraction is a linear combination of the STFTS(ω,m) of original speech s(k), that is
X(ω,m) = S(ω,m) +M∑i=1
αi(ω)S(ω,m− i) (18)
where indexes ω and m refer to frequency bin and time frame,respectively, αi(ω) is the coefficient of the late reverberation
for previous i frames, and M is the duration of the rever-beration. Here, αi(ω) � 1 because the cepstral subtractionreduces the early reflection part that has most of the powerof the reverberation. Therefore, the power spectrum of latereverberation can be approximated by
P (ω,m) ≈M∑i=1
| αi(ω) |2| X(ω,m− i) |2 (19)
If we assume that reverberation components are approximatelyuncorrelated between frames, the coefficients of the late rever-beration are estimated by
αi(ω) = E
[X(ω,m)X∗(ω,m− i)
| X(ω,m− i) |2
](20)
Spectral subtraction is now employed to obtain the derever-berated signal:
Y (ω,m) = X(ω,m)G(ω,m) (21)
where Y (ω,m) is the STFT of the recovered speech y(k)
G(ω,m) =
[| X(ω,m− i) |2 − P (ω,m)
| X(ω,m− i) |2
] 12
(22)
The dereverberated signal y(k) is reconstructed from theestimated STFT Y (ω,m), through the inverse-STFT andoverlap add techniques. The performance of the algorithm
Fig. 3. Figure illustrating the clean (Top), reverberant at DRR = -3dB(Middle), and the dereverberated speech signal (Bottom) spectrograms fora sentence uttered by a female speaker
is illustrated by considering a sentence uttered by a femalespeaker sampled at 8 KHz. The AIR is simulated using imagemethod [6]. The dimension of the room used for simulation
![Page 4: [doi 10.1109_ncc.2013.6487990] Padaki, Harish; Nathwani, Karan; Hegde, Rajesh M -- [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)]](https://reader035.vdocuments.site/reader035/viewer/2022081822/577c78851a28abe054903a33/html5/thumbnails/4.jpg)
is 10.4mX10.4mX4.2m. Figure 3, illustrates the results ofdereverberation using the proposed algorithm. It can be noticedthat there is a reasonable improvement in the dereverberatedspeech spectrogram from Figure 3.
IV. PERFORMANCE EVALUATION
The performance evaluation of the proposed method is il-lustrated by conducting three sets of experiments. Experimentson perceptual similarity evaluation, speech dereverberation,and distant speech recognition are conducted at various DRR.The results are presented using perceptual similarity measures,objective measures, subjective measures, and word error ratesrespectively. log spectral distance (LSD) and signal to rever-beration ratio (SRR) are used as objective measures while themean opinion score (MOS) is used as a subjective measureto quantify the experimental results. Perceptual similaritymeasures (PSM) and its instantaneous audio quality (PSMt)scores are used to measure perceptual audio quality betweentwo signal. Word error rates (WER) are used to illustrate theresults of distant speech recognition.
A. Experimental conditions
The TIMIT [7] and MONC database [8], are used inexperiments on speech dereverberation and distant speechrecognition. Sentences from the both the database are rever-berated at different DRRs and used in the experiments. Thereverberant speech signal or the AIR is simulated using a roomwhose dimensions and setup are illustrated in Figure 4. For the
Fig. 4. The room dimensions and setup to acquire reverberant speech
subjective and objective evaluations, the AIR computed usingthe aforementioned room setup is used. For experiments ondistant speech recognition, the AIR is simulated with sourceS1 at five different location corresponding to five differentDRR.
B. Experimental results on perceptual similarity measures
To predict perceived quality differences between audio sig-nals, PEMO-Q [9] is used to compute internal representationsof signal pairs. The results of PSM is obtained by calculatingcorrelation value between two audio signals. It also estimatesthe instantaneous audio quality as a function of time by frame-wise correlation (output vector ’PSMt’) [10]. Higher the PSMand lower the PSMt scores, better is the method. It can be
noted from Table I that proposed LP residual cepstrum methodhas better perceptual similarity measure compared to othermethods.
TABLE ITHE PSM (P) AND PSMT (PT) SCORES FOR TIMIT & MONC DATABASE
TIMIT MONCDRR=-5dB DRR=-1dB DRR=-5dB DRR=-1dB
Methods P Pt P Pt P Pt P PtIF 0.57 -0.19 0.66 -0.12 0.45 -0.35 0.56 -0.24TA 0.61 -0.15 0.69 -0.08 0.51 -0.19 0.62 -0.10LP 0.65 -0.10 0.73 -0.05 0.54 -0.17 0.65 -0.08SS 0.55 -0.20 0.63 -0.12 0.48 -0.29 0.59 -0.19TS 0.50 -0.26 0.61 -0.15 0.44 -0.37 0.53 -0.26KA 0.46 -0.29 0.57 -0.18 0.40 -0.39 0.50 -0.29
C. Experimental results on speech dereverberation
The experimental results on speech dereverberation arelisted as both objective and subjective measures in Table II.The Log spectral distortion is a speech distortion measure wellsuited for the assessment of dereverberation algorithms [11].On the other hand, the Signal to Reverberation Ratio [11] isa measure of reverberation which is dependent on the signal.Both the LSD and the SRR are used as objective measures.For illustrating the subjective evaluation, MOS is used. Thesubjective evaluation was done by 20 listeners in the agegroup of 21 to 25 years on dereverberated sentences from theTIMIT and MONC database. The measures for the proposed(LP residual cepstrum algorithm) along with inverse filtering(IF) [3], temporally averaging (TA) [4], spectral subtraction(SS) [5], two stage (TS) algorithm for one-microphone[12]and kurtosis algorithm (KA) [13] are also listed Table II.The experimental results indicate that the proposed algorithmprovides reasonable improvements both in terms of objectiveand subjective evaluations when compared to standard dere-verberation methods.
TABLE IIEXPERIMENTAL RESULTS OF SPEECH DEREVERBERATION USING
OBJECTIVE MEASURES (LSD, SRR) AND SUBJECTIVE MEASURES (MOS)FOR TIMIT & MONC DATABASE
TIMIT MONCMethods LSD SRR MOS LSD SRR MOSIF 1.79 2.01 3.3 2.10 0.85 2.3TA 2.13 2.03 3.7 1.78 1.25 3.0LP 1.69 2.64 4.2 1.70 1.79 3.3SS 1.77 2.01 3.4 2.04 1.09 2.9TS 1.75 2.02 3.6 1.75 1.20 3.0KA 2.22 1.68 2.7 2.30 0.73 2.0
D. Experimental results on distant speech recognition
Distant speech recognition experiments are conducted onthe reverberated sentences of TIMIT & MONC database atdifferent DRR using the experimental room setup describedin Figure 4. The reverberant version of sentences from theTIMIT & MONC database are first obtained by computingthe AIR with source S1 at five different locations in the room
![Page 5: [doi 10.1109_ncc.2013.6487990] Padaki, Harish; Nathwani, Karan; Hegde, Rajesh M -- [IEEE 2013 National Conference on Communications (NCC) - New Delhi, India (2013.2.15-2013.2.17)]](https://reader035.vdocuments.site/reader035/viewer/2022081822/577c78851a28abe054903a33/html5/thumbnails/5.jpg)
TABLE IIITHE WORD ERROR RATES FOR TIMIT & MONC DATABASE
TIMIT MONCMethods DRR=-5dB DRR=-4dB DRR=-3dB DRR=-1dB DRR=-5dB DRR=-4dB DRR=-3dB DRR=-1dBIF 38.2 36.6 32.4 26.8 53.4 52.0 51.6 47.4TA 34.8 32.4 30.2 29.2 43.4 39.6 38.8 37.6LP 33.1 28.4 26.6 22.8 39.8 36.4 33.8 30.2SS 42.0 37.6 32.8 26.0 46.8 45.4 44.8 42.8TS 36.0 34.2 30.2 26.0 52.0 51.2 47.8 44.2KA 58.0 56.4 56.0 52.0 56.6 52.2 52.0 49.6
corresponding to four different DRR. The proposed method iscompared with other methods as shown in Table III, whichinclude speech recognition experiments. It can be noted thatthe proposed method indicates reasonable improvements inspeech recognition accuracy when compared to other methods.
V. CONCLUSIONS
A blind speech dereverberation method based on the linearprediction (LP) residual cepstrum is proposed in this paper.The method explores deconvolution based on the source in-formation of the speech signal in the cepstral domain. Themethod is computationally less complex compared to generaldeconvolution methods which often rely on the estimation ofthe acoustic impulse response prior to deconvolution of theclean speech from the reverberant component. Although theproposed method gives reasonable improvements in terms ofcomputational ease and performance evaluation in terms ofdereverberation and distant microphone speech recognition.However, there are a few issues that need to be addressed.The peak detection algorithm is prone to errors at very lowdirect to reverberant ratios (DRR) often encountered in largerooms due to spurious peaks present in the regions where latereverberation components exist elaborately. The theoreticaland performance evaluation of the method is currently beinginvestigated in large reverberation time and low signal to noiseratio scenarios.
REFERENCES
[1] S. Subramaniam, A.P. Petropulu, and C. Wendt, “Cepstrum-based de-convolution for speech dereverberation,” Speech and Audio Processing,IEEE Transactions on, vol. 4, no. 5, pp. 392–396, 1996.
[2] RA Kennedy and BD Radlovic, “Iterative cepstrum-based approachfor speech dereverberation,” in Signal Processing and Its Applications,1999. ISSPA’99. Proceedings of the Fifth International Symposium on.IEEE, 1999, vol. 1, pp. 55–58.
[3] D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancementusing cepstral processing,” in Acoustics, Speech, and Signal Processing,1991. ICASSP-91., 1991 International Conference on. IEEE, 1991, pp.977–980.
[4] S. Xizhong and M. Guang, “Complex cepstrum based single channelspeech dereverberation,” in Computer Science & Education, 2009.ICCSE’09. 4th International Conference on. IEEE, 2009, pp. 7–11.
[5] K. Furuya, S. Sakauchi, and A. Kataoka, “Speech dereverberationby combining mint-based blind deconvolution and modified spectralsubtraction,” in Acoustics, Speech and Signal Processing, 2006. ICASSP2006 Proceedings. 2006 IEEE International Conference on. IEEE, 2006,vol. 1, pp. I–I.
[6] J.B. Allen and D.A. Berkley, “Image method for efficiently simulatingsmall-room acoustics,” J. Acoust. Soc. Am, vol. 65, no. 4, pp. 943–950,1979.
[7] J.S. Garofolo, TIMIT: Acoustic-phonetic Continuous Speech Corpus,Linguistic Data Consortium, 1993.
[8] Avram Levi, Multi Channel Overlapping Numbers Corpus distribution,Linguistic Data Consortium, (http://cslu.cse.ogi.edu/corpora/), 2003.
[9] R.K.B. Huber, “PEMO-Q–A New Method for Objective Audio QualityAssessment Using a Model of Auditory Perception,” IEEE Transactionson Audio Speech and Language Processing, vol. 14, no. 6, pp. 1902–1911, 2006.
[10] S. Goetze, E. Albertin, M. Kallinger, A. Mertins, and K.D. Kammeyer,“Quality assessment for listening-room compensation algorithms,” inAcoustics Speech and Signal Processing (ICASSP), 2010 IEEE Interna-tional Conference on. IEEE, 2010, pp. 2450–2453.
[11] P.A. Naylor and N.D. Gaubitch, “Acoustic signal processing in noise: Itsnot getting any quieter,” in Acoustic Signal Enhancement; Proceedingsof IWAENC 2012; International Workshop on. VDE, 2012, pp. 1–6.
[12] M. Wu and D.L. Wang, “A two-stage algorithm for one-microphonereverberant speech enhancement,” Audio, Speech, and Language Pro-cessing, IEEE Transactions on, vol. 14, no. 3, pp. 774–784, 2006.
[13] V. Tomar, “Blind dereverberation using maximum kurtosis of the speechresidual,” 2010.