=1=distributed speech processing for microphone networks

=1=Distributed Speech Processing for Microphone NetworksShmulik Markovich-Golan
Tel-Aviv University, Israel November, 2013
S. Markovich-Golan (BIU) Distributed Speech Proc. for Mic. Networks Tel-Aviv University, Israel 1 / 49
Thanks to my Supervisors
Introduction Motivation
Advantages
Using more microphones improves spatial resolution.
High probability to find microphones close to a relevant sound source.
Improved sound field sampling.
Beamforming algorithms for distributed microphone constellation:
Ad hoc sensor networks. Large volume (and many nodes).
Robustness against randomly deployed microphones:
High fault percentage. Arbitrary deployment of nodes.
Background Room Acoustics
Room Acoustics Essentials
Uncorrelated: Signals on microphone are uncorrelated.
Diffused: Sound is coming from all directions [Dal-Degan and Prati, 1988];
[Habets and Gannot, 2007].
Deteriorates intelligibility.
Background Room Acoustics
The Room Impulse Response (RIR) [Allen and Berkley, 1979]; simulator: [Habets, 2006]; [Polack, 1993]; [Jot et al., 1997]
0 0.05 0.1 0.15 0.2 0.25 0.3 −0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Reverberation should be taken into consideration while designing the algorithms even if it does not deteriorate speech quality and intelligibility.
Background Array Processing Preliminaries
w: M × 1 beamforming vector of filters (or just gains).
0.2
0.4
0.6
0.8
1
30
210
60
240
90
270
120
300
150
330
= 1 2
= 1 2
= 1 2
= 1 32
= 4 1
0 dB
−10 dB
−20 dB
−30 dB
−40 dB
10 microphone uniform linear array.
2 Desired sources in green and 2 interfering sources in red.
Can be obtained by applying the LCMV criterion.
Array Design for Speech Propagating in Acoustic Environments
Beampatterns: Array response as a function of the angle of arrival (AoA).
In reverberant environments (especially for low DRR), sound propagation is more complicated than merely the AoA.
The steering vector (comprised of the AoA) generalizes to acoustic transfer function (ATF).
The ATF summarizes all arrivals of the speech signals.
The vector of received signals is treated as a vector in an abstract linear space.
Linear Algebra methods are utilized to construct beamformers.
AoA becomes less prominent.
Problem Formulation
21.744
Short-Time Fourier Transform (STFT) - Multiplicative Transfer Function (MTF) Approximation
t STFT =⇒ {`, k}; Convolution
zm(`, k) = ∑Pd
d jm +
P = Pd + Pi + Pn ≤ M
Beamforming in the STFT Domain
Apply filter & sum beamforming independently for each frequency bin.
Linearly Constrained Minimum Variance Closed-Form Solution
Linearly Constrained Minimum Variance Beamformer [Er and Cantoni, 1983]; [Van Veen and Buckley, 1988]
LCMV Criterion
y(`, k) = wH(`, k)z(`, k).
Let Φzz = E{zzH} be the M ×M correlation matrix of the microphone signals.
Minimize noise power wHΦzzw Such that a linear constraint set is satisfied: CHw = g.
C : M × P constraints matrix (usually equals H).
g : P × 1 response vector.
Closed-form Solution
( CHΦ−1
( CHΦ−1
Linearly Constrained Minimum Variance Closed-Form Solution
LCMV Minimization Graphical Interpretation [Frost III, 1972]
LCMVw
1w
2w
Linearly Constrained Minimum Variance The GSC Implementation
The Generalized Sidelobe Canceller Implementation For Constrained Minimization [Griffiths and Jim, 1982]
Split the Beamformer: w = q− Bf
Fixed Beamformer (FBF):
q = C ( CHC
)−1 g ∈ Span{C}
wn , Bf ∈ N{C}. Blocking matrix (BM): B, a basis for N (C), generates M − P unconstrained signals, degrees of freedom (DoF).
Adaptive Noise Canceler (ANC): Suppresses the residual noise utilizing
DoFs, f = ( BHΦzzB
)−1 BHΦzzw0.
Can be recursively updated using the LMS alg. [Widrow et al., 1975]; [Shynk, 1992].
LCMVw
1w
2w
Multi-Constraint Beamformer Algorithm
Based on LCMV Beamforming
Relax the dereverberation requirements using RTFs [Gannot et al., 2001]
Estimating H is a cumbersome task.
In many scenarios reverberation does not comprise intelligibility.
Relative Transfer Function (RTF) of the jth desired source: hd j , hd
j /h d j0.
LCMV output:
y = ∑Pd
d j + noise components
The Importance of the RTF
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 -0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 -0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Can be blindly estimated from data.
No need to know microphone position (crucial in ad hoc applications).
Multitude estimation procedures exists.
Drawback: Non-causal (in severe cases can cause “pre-echo”).
Multi-Constraint Beamformer Example
(a) Noisy at mic. #1 (b) Enhanced signal
Figure: 1 desired source and 3 competing speakers. 8 microphones recorded at BIU acoustic lab set to T60 = 300ms. Approximately 20dB SIR and SNR improvement.
21.744
19.728
+ Beamformer components estimated from the received signals.
+ High amount of noise and interference reduction.
+ Low speech distortion.
- Number of filter coefficients to be estimated tends to be very large.
- Hence frame length tends to be large as well (can be mitigated at the expense of complexity. See CTF approximation [Talmon et al., 2009]).
- Limited performance in diffuse noise fields (can be mitigated by using postfiltering [Balan and Rosca, 2002, Zelinski, 1988, Meyer and Simmer, 1997, Marro et al., 1998,
McCowan and Bourlard, 2003, Leukimmiatis et al., 2006, Cohen et al., 2003, Gannot and Cohen, 2004]).
Distributed GSC Motivation
Distributed GSC Motivation
N nodes with Mn microphones.∑N n=1 Mn = M.
z , [
zT1 · · · zTN ]T .
Closed-form LCMV necessitates the inversion of Φzz . A cumbersome task in distributed networks.
Nave GSC Implementation
n=1 yn.
Implement a local GSC at each node:
Mn − P outputs of the BM at the nth node (might go negative!).
Total number of BM outputs: ∑N
n=1(Mn − P) = M − (N × P). M − (N × P) < (M − P)⇒ degrees of freedom (DoF) lost ⇒ incomplete minimization ⇒performance degradation.
Distributed GSC Algorithm
Overview
Introduce P shared signals:
Broadcast by a subset of the nodes. Retrieve degrees of freedom.
Extended inputs at each node:
Local microphones plus shared signals. Purely local FBF, BM, ANC.
DGSC adaptively converges to the centralized solution.
Shared signal Local BF output
Shared signal Local BF output
Local BF output
Local BF output
Local BF output
Sources “Owned” by the nth Node:
A node n that receives the pth source with the highest SNR is declared its “owner”.
The shared signals broadcast by the nth node: rn = DH n zn.
Dn: an Mn × Pn selection matrix.
A shared signal (one component of rn) is responsible for only one source.
Shared signals serve as a reference for RTF estimation in each node.
Extended Inputs at the nth Node
P − Pn shared signals (excluding self-owned signals): rn.
Total number of signals: Mn = Mn + P − Pn.
Signals: zn = [
GSC-BF
nz
An Mn × 1 local GSC-BF at the nth node: wn.
Outputs of local GSC-BFs: yn = wH n zn; ∀n = 1, 2, . . . ,N.
Global BF: w , [
wT 1 · · · wT
n=1 yn.
Local FBF
Select Owner
ny+
Fixed Beamformer (Local)
Hn: the RTF relating the extended inputs and the shared signals.
Build local FBF qn using only local RTFs; qn , 1 N Hn
( HH
) = N
( Hn
∑N n=1(Mn − Pn) = M − P ⇒ DoF fully utilized.
Adaptive Noise Canceler (Local with Global Error)
Least Mean Squares: fn(`+ 1) = fn(`) + µ un(`)y∗(`)
Pu,n(`) .
Distributed processing for distributed constellation.
It is shown [Markovich-Golan et al., 2013a] that the distributed and centralized LCMV implementations identifies.
Proof is based on: constraint set is a subspace of the M-dimensional linear space. Extending the linear space dimensions to M does not alter the sub-space.
Local input signals selection (quasi-) fixed:
Original inputs. Shared signals selected by the system. Hence RTF estimation valid until the acoustics changes.
The DGSC sequentially converges to the centralized solution using local ANC updates.
Important Practical Considerations
Latency in the communication channel might require large buffering in each node.
Owner selection is a cumbersome task if several speakers are concurrently active, since it is not clear how to identify each speaker.
RTF can be very long for remote nodes.
Number of nodes and constraints can dynamically change (see [Markovich-Golan et al., 2012b] for possible cure).
Sampling rate offsets between nodes might degrade performance (see [Markovich-Golan et al., 2012a] for possible cure).
Distributed GSC Example
Desired and competing speaker with the same level.
2 point source Gaussian noises, 13dB lower than the speech signals.
Sensors noise.
0.5
1
1.5
2
2.5
3
3.5
4
Distributed GSC Example
(c) Centralized GSC (d) Distributed GSC
12.24
12.24
12.24
12.24
WASNs with Random Node Deployment [Markovich-Golan et al., 2011]; [Markovich-Golan et al., 2013b]; general reading [Lo, 1964]
Scenarios
High fault percentage.
Is there an optimal deployment? [Kodrasi et al., 2011]
Statistical Beamformer Problem formulation & definitions
Notations
Signals
Let sd(`, k) be a desired speaker signal located at rd .
Microphone signals: z(`, k) , hd (`, k) sd(`, k) + v(`, k).
Microphone signals PSD: Φzz(`, k) , σ2 dhdhH
d + Φvv
Room Constellation
Volume V , surface area A.
M microphones randomly deployed with a uniform distribution at
coordinates rm , [ rmx rmy rmz
]T ; m = 1, . . . ,M.
Statistical Beamformer Problem formulation & definitions
Criterion
κ , σ2 d |wHhd |2
wHΦvvw = σ2
ξ , |wHhd |2
Statistical Beamformer Statistical ATF modelling
Statistical ATF Modelling
ATF relating a coherent source at rd , and the mth microphone at rm: h , h + h
h the direct arrival.
h the reverberant component.
Early reflections are ignored.
0 0.05 0.1 0.15 0.2 0.25 0.3 −0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
ATF Statistical Model [Schroeder, 1987],[Kuttruff, 2000]
h ∼ CN (0, α) ; h ∼ CN (0, α)
with α the variance of the direct arrival, α , 1−ε πεA and ε , 0.161V
AT60 , the
exponential decay rate of the RIR tail. Reverberant tail is diffused ⇒ Tail coherence of microphones at r1, r2:
E { h1h
The signal wavelength is much smaller than the room dimensions.
The microphones and sources are at least half wavelength away from the walls.
The signal frequency is above the Schroeder frequency,
fSchroeder , 2000 √
Reliability Measures
SIR Reliability
The reliability of an SIR level of κ0 is defined as the probability that the output SIR will exceed κ0:
Rκ (κ0) , Pr (κ ≥ κ0) .
White Noise Gain (WNG) Reliability
The reliability of a WNG level of ξ0 is defined as the probability that the WNG will exceed ξ0:
Rξ (ξ0) , Pr (ξ ≥ ξ0) .
P Directional Noise Sources
SINR and WNG follow a scaled χ2 distribution:
E {κ} =
2 α σ2 u
· (M − P); Rξ (ξ0) = 1− Fη (
2 αξ0
) .
α - variance of ATF (sum of direct and reverberant components variance).
σ2 u - sensor noise variance and σ2
d - desired source variance.
η ∼ χ2 (2 (M − P)) Chi-square RV with 2 (M − P) degrees of freedom.
Fη (η0) , Pr (η ≤ η0) is the respective CDF.
Statistical Beamformer Model verification
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
R ,
c
P=1, empirical P=1. theory P=2, empirical P=2, theory P=3, empirical P=3, theory P=4, empirical P=4, theory
(a) M = 5
2 4 6 8 10 12 14 16 18 20 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
c
M=5, empirical M=5, theoretical M=10, empirical M=10, theoretical M=15, empirical M=15, theoretical M=20, empirical M=20, theoretical M=25, empirical M=25, theoretical
(b) P = 1
SINRout/SNRin for coherent noise sources. T60 = 0.4sec, room dimensions 4× 4× 3m. Similar trends for diffused noise field.
Bibliography
Allen, J. and Berkley, D. (1979).
Image method for efficiently simulating small-room acoustics. J. Acoustical Society of America, 65(4):943–950.
Balan, R. and Rosca, J. (2002).
Microphone array speech enhancement by Bayesian estimation of spectral amplitude and phase. In IEEE Workshop on Sensor Array and Multichannel Signal Processing, pages 209–213, Rosslyn, Virginia, USA.
Cohen, I., Gannot, S., and Berdugo, B. (2003).
An integrated real-time beamforming and postfiltering system for nonstationary noise environments. EURASIP Journal on Applied Signal Processing, 2003:1064–1073.
Dal-Degan, N. and Prati, C. (1988).
Acoustic noise analysis and speech enhancement techniques for mobile radio application. Signal Processing, 15(4):43–56.
Er, M. and Cantoni, A. (1983).
Derivative constraints for broad-band element space antenna array processors. IEEE Transactions on Acoustics, Speech and Signal Processing, 31(6):1378–1393.
Frost III, O. L. (1972).
An algorithm for linearly constrained adaptive array processing. Proceedings of the IEEE, 60(8):926–935.
Gannot, S., Burshtein, D., and Weinstein, E. (2001).
Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Transactions on Signal Processing, 49(8):1614–1626.
Bibliography
Gannot, S. and Cohen, I. (2004).
Speech enhancement based on the general transfer function GSC and postfiltering. IEEE Transactions on Speech and Audio Processing, 12(6):561–571.
Griffiths, L. J. and Jim, C. W. (1982).
An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. on Antennas and Propagation, 30(1):27–34.
Habets, E. and Gannot, S. (2007).
Generating sensor signals in isotropic noise fields. The Journal of the Acoustical Society of America, 122:3464–3470.
Habets, E. A. P. (2006).
Room impulse response (RIR) generator. http://home.tiscali.nl/ehabets/rir generator.html.
Jot, J.-M., Cerveau, L., and Warusfel, O. (1997).
Analysis and synthesis of room reverberation based on a statistical time-frequency model. In Audio Engineering Society Convention 103. Audio Engineering Society.
Kodrasi, I., Rohdenburg, T., and Doclo, S. (2011).
Microphone position optimization for planar superdirective beamforming. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 109–112, Prague, Czech Republic.
Kuttruff, H. (2000).
Bibliography
Leukimmiatis, S., Dimitriadis, D., and Maragos, P. (2006).
An optimum microphone array post-filter for speech applications. In Proc. Interspeech-ICSLP, pages 2142–2145.
Lo, Y. (1964).
A mathematical theory of antenna arrays with randomly spaced elements. IEEE Transactions on Antennas and Propagation, 12(3):257–268.
Markovich-Golan, S., Gannot, S., and Cohen, I. (2009).
Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfering speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(6):1071–1086.
Markovich-Golan, S., Gannot, S., and Cohen, I. (2011).
Performance analysis of a randomly spaced wireless microphone array. In The IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 121–124, Prague, Czech Republic.
Markovich-Golan, S., Gannot, S., and Cohen, I. (2012a).
Blind sampling rate offset estimation and compensation in wireless acoustic sensor networks with application to beamforming. In The International Workshop on Acoustic Signal Enhancement (IWAENC), Aachen, Germany. Final list for best student paper award.
Markovich-Golan, S., Gannot, S., and Cohen, I. (2012b).
Low-complexity addition or removal of sensors/constraints in LCMV beamformers. IEEE Transactions on Signal Processing, 60(3):1205–1214.
Bibliography
Markovich-Golan, S., Gannot, S., and Cohen, I. (2013a).
Distributed multiple constraints generalized sidelobe canceler for fully connected wireless acoustic sensor networks. IEEE Transactions on Audio, Speech, and Language Processing, 21(2):343–356.
Markovich-Golan, S., Gannot, S., and Cohen, I. (2013b).
Performance of the SDW-MWF with randomly located microphones in a reverberant enclosure. IEEE Trans. Audio, Speech and Language Processing. Accepted for publication.
Marro, C., Mahieux, Y., and Simmer, K. (1998).
Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering. IEEE Trans. on Speech and Audio Proc., 6(3):240–259.
McCowan, I. and Bourlard, H. (2003).
Microphone array post-filter based on noise field coherence. IEEE Trans. Speech and Audio Process., 11(6):709–716.
Meyer, J. and Simmer, K. U. (1997).
Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction. In IEEE Internat. Conf. Acoust. Speech Signal Process (ICASSP), pages 21–24, Munich, Germany.
Polack, J.-D. (1993).
Playing billiards in the concert hall: The mathematical foundations of geometrical room acoustics. Applied Acoustics, 38(2):235–244.
Schroeder, M. R. (1987).
Statistical parameters of the frequency response curves of large rooms. Journal of the Audio Engineering Society, 35(5):299–306.
Bibliography
Shynk, J. (1992).
Talmon, R., Cohen, I., and Gannot, S. (2009).
Convolutive transfer function generalized sidelobe canceler. IEEE Transactions on Audio, Speech, and Language Processing, 17(7):1420–1434.
Van Veen, B. D. and Buckley, K. M. (1988).
Beamforming: A versatile approach to spatial filtering. IEEE Acoustics, Speech and Signal Proc. magazine, pages 4–24.
Widrow, B., Jr., J. G., McCool, J., Kaunitz, J., Williams, C., Hearn, R., Zeider, J., Jr., E. D., and Goodlin, R. (1975).
Adaptive noise cancelling: Principals and applications. Proceeding of the IEEE, 63(12):1692–1716.
Zelinski, R. (1988).
A microphone array with adaptive post-filtering for noise reduction in reverberant rooms. In IEEE Int. Conf. Acoust. Speech and Sig. Proc. (ICASSP), pages 2578–2581, New-York, USA.
Introduction
Background

=1=distributed speech processing for microphone networks

Documents