tight integration of spatial and spectral features for …...fk y tf ˙ tfk r fk exploit spatial...
TRANSCRIPT
![Page 1: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/1.jpg)
Communications Engineering
NT
Tight integration of spatial and spectral featuresfor BSS with Deep Clustering embeddings
Lukas Drude, Reinhold Haeb-Umbach
Department of Communications Engineering - Paderborn UniversityProf. Dr.-Ing. Reinhold Haeb-Umbach2017-08-20
![Page 2: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/2.jpg)
NT
Table of contents
1 Problem Statement2 Known SolutionsI Spectral: Deep ClusteringI Spatial: Time-Variant cGMM
3 Integrated Model (proposed)4 Evaluation5 Conclusion
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 1 / 13
![Page 3: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/3.jpg)
NT
Problem StatementGoal:
• Transcribe speechfrom an observed mixtureApplication:
• Meeting transcription• Home assistants/ automation• (Surveillance)
Focus:• Robust source separation front-endexploiting spectral and spatial features
stf1 y1tf
yDtf
y2tf
h1f1
...
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 2 / 13
![Page 4: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/4.jpg)
NT
Motivation
Spectral/ deep learning:+ Leverage training data+ Model speech characteristics– Overfit, possibly poorgeneralization
Spatial clustering (multi-channel):+ Training data agnostic– No concept of human speech+ Exploit spatial selectivity
Joint formulation desired!
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 3 / 13
![Page 5: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/5.jpg)
NT
Motivation
Spectral/ deep learning:+ Leverage training data+ Model speech characteristics– Overfit, possibly poorgeneralization
Spatial clustering (multi-channel):+ Training data agnostic– No concept of human speech+ Exploit spatial selectivity
Joint formulation desired!
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 3 / 13
![Page 6: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/6.jpg)
NT
Spectral: Deep Clustering [1]
ztfk
πfk
etfκk
µk
Exploit speaker specific spectral characteristics for separation.Concept:
• BLSTM yields an embedding vector etf for each tf-bin.• Encourage tendency to form clusters in embedding space.• Cluster using k-means on etf .
von Mises Fisher MMRelevance:
• Speaker independent• Number of speakers not fixed at training time
[1] Hershey et al., Deep clustering: Discriminative embeddings [...], ICASSP 2016
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 4 / 13
![Page 7: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/7.jpg)
NT
Spectral: Deep Clustering [1]
ztfk
πfk
etfκk
µk
Exploit speaker specific spectral characteristics for separation.Concept:
• BLSTM yields an embedding vector etf for each tf-bin.• Encourage tendency to form clusters in embedding space.• Cluster using k-means on etf .
von Mises Fisher MM
Relevance:• Speaker independent• Number of speakers not fixed at training time[1] Hershey et al., Deep clustering: Discriminative embeddings [...], ICASSP 2016
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 4 / 13
![Page 8: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/8.jpg)
NT
Spectral: Deep Clustering [1]
ztfk
πfk
etfκk
µk
Exploit speaker specific spectral characteristics for separation.Concept:
• BLSTM yields an embedding vector etf for each tf-bin.• Encourage tendency to form clusters in embedding space.• Cluster using k-means on etf .von Mises Fisher MM
Relevance:• Speaker independent• Number of speakers not fixed at training time[1] Hershey et al., Deep clustering: Discriminative embeddings [...], ICASSP 2016
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 4 / 13
![Page 9: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/9.jpg)
NT
Spatial: Time-Variant cGMM [2]
ztfk
πfk
ytf σtfk
Rfk
Exploit spatial diversity to separate speakers.Concept:
• Complex random vectors ytf• Acoustic transfer function captured by spatial correlation matrix.p(ytf) =
∑k
πfk · NC(ytf ;0, σtfk · Rfk)
Relevance:• Competitive: Winning system of CHiME 3 and CHiME 4 challenges
[2] Ito et al., Relaxed disjointness based clustering [...], IWAENC 2014
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 5 / 13
![Page 10: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/10.jpg)
NT
Spatial: Time-Variant cGMM [2]
ztfk
πfk
ytf σtfk
Rfk
Exploit spatial diversity to separate speakers.Concept:
• Complex random vectors ytf• Acoustic transfer function captured by spatial correlation matrix.p(ytf) =
∑k
πfk · NC(ytf ;0, σtfk · Rfk)
Relevance:• Competitive: Winning system of CHiME 3 and CHiME 4 challenges
[2] Ito et al., Relaxed disjointness based clustering [...], IWAENC 2014
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 5 / 13
![Page 11: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/11.jpg)
NT
Integrated Model
ztfk
πfk
etfetf ytfµkµk
κkκk
σtfk
Rfk
Beamforming ASRDeep Clustering
ytf
words
IntegratedEM Algorithm
Spectral observation model Spatial observation model
• Statistical model:p(etf , ytf) =
∑k
πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)
• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13
![Page 12: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/12.jpg)
NT
Integrated Model
ztfk
πfk
etfetf ytfµkµk
κkκk
σtfk
Rfk
Beamforming ASRDeep Clustering
ytf
words
IntegratedEM Algorithm
Spectral observation model Spatial observation model
• Statistical model:p(etf , ytf) =
∑k
πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)
• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13
![Page 13: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/13.jpg)
NT
Integrated Model
ztfk
πfk
etfetf ytfµkµk
κkκk
σtfk
Rfk
Beamforming ASRDeep Clustering
ytf
words
IntegratedEM Algorithm
Spectral observation model Spatial observation model
• Statistical model:p(etf , ytf) =
∑k
πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)
• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13
![Page 14: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/14.jpg)
NT
Integrated Model
ztfk
πfk
etfetf ytfµkµk
κkκk
σtfk
Rfk
Beamforming ASRDeep Clustering
ytf
words
IntegratedEM Algorithm
Spectral observation model Spatial observation model
• Statistical model:p(etf , ytf) =
∑k
πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)
• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13
![Page 15: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/15.jpg)
NT
Update equations: E-step
ztfk
πfk
etf ytfµk
κk
σtfk
Rfk
• Update class affiliation posterior:γ′tfk = πfk︸︷︷︸Prior
p(ztf)
· vMFα(etf ;µk, κk)︸ ︷︷ ︸Spectral Modelp(etf |ztf)
· N (1−α)C (ytf ;0, σtfk · Rfk)︸ ︷︷ ︸Spatial Model
p(ytf |ztf)
γtfk = γ′tfk/∑
k
γ′tfk
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 7 / 13
![Page 16: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/16.jpg)
NT
Evaluation: Setup
• Train/CV deep clustering on single channel WSJ utterances [1, 3]• GMM-HMM recognizer trained on clean, 3-Gram LM [3]• 3000 test mixtures, 2 speakers per mixture:
I Random source and array positions (6 sensors)I Image method to generate room impulse responses
[3] Isik et al., Single-Channel Multi-Speaker Separation using Deep Clustering,Interspeech 2016
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 8 / 13
![Page 17: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/17.jpg)
NT
Example: Posterior Masks (200ms < T60 < 300ms)
0 100 200 3000
128
256
Time frame
Frequen
cybin
DC + vMFMM
0 100 200 300Time frame
vMF-TV-cGMM
0 100 200 300Time frame
TV-cGMM
0
1
Mask
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 9 / 13
![Page 18: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/18.jpg)
NT
Evaluation: Signal to Distortion Ratio GainLow reverb (T60 = 50 . . . 100ms)
DC+M
asking
DC+BF
vMF-TV
-cGM
M
spatial
only
05101520
SDRgain
/dB
0.9 dB
Medium reverb (T60 = 200 . . . 300ms)
DC+M
asking
DC+BF
vMF-TV
-cGM
M
spatial
only
0
5
10
15
SDRgain
/dB
2.3 dB
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 10 / 13
![Page 19: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/19.jpg)
NT
Evaluation: Signal to Distortion Ratio GainLow reverb (T60 = 50 . . . 100ms)
DC+M
asking
DC+BF
vMF-TV
-cGM
M
spatial
only
05101520
SDRgain
/dB
0.9 dBMedium reverb (T60 = 200 . . . 300ms)
DC+M
asking
DC+BF
vMF-TV
-cGM
M
spatial
only
0
5
10
15
SDRgain
/dB
2.3 dB
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 10 / 13
![Page 20: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/20.jpg)
NT
Word Error Rates (T60 = 50 . . . 100ms)
Features Model Extraction WER /%spectral DC [3] Masking 65.8spectral DC Beamforming 42.4
integrated vMF-TV-cGMM Beamforming 29.7spatial TV-cGMM [2] Beamforming 33.6
[2] Ito et al., Relaxed disjointness based clustering [...], IWAENC 2014[3] Isik et al., Single-Channel Multi-Speaker Separation using Deep Clustering,Interspeech 2016
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 11 / 13
![Page 21: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/21.jpg)
NT
Audio Example
(Switch to external audio player.)• Observed mixture: y.wav• Estimates: z_1.wav, z_2.wav• Clean speech: x_1.wav, x_2.wav
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 12 / 13
![Page 22: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic](https://reader034.vdocuments.site/reader034/viewer/2022050323/5f7d354e3409b203785fdebc/html5/thumbnails/22.jpg)
NT
Conclusion and Future Work• EM algorithm for joint modelSpectral/ deep learning:+ Leverages training data+ Models speechcharacteristics
Spatial clustering:+ Generalizes to unseenconditions+ Exploits spatial selectivity• Outperforms both very different baselines• Bridges model-based research and deep learning research
Future Work:• Joint training• Real recordings
L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 13 / 13