tight integration of spatial and spectral features for …...fk y tf ˙ tfk r fk exploit spatial...

22
Communications Engineering NT Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings Lukas Drude, Reinhold Haeb-Umbach Department of Communications Engineering - Paderborn University Prof. Dr.-Ing. Reinhold Haeb-Umbach --

Upload: others

Post on 29-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

Communications Engineering

NT

Tight integration of spatial and spectral featuresfor BSS with Deep Clustering embeddings

Lukas Drude, Reinhold Haeb-Umbach

Department of Communications Engineering - Paderborn UniversityProf. Dr.-Ing. Reinhold Haeb-Umbach2017-08-20

Page 2: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Table of contents

1 Problem Statement2 Known SolutionsI Spectral: Deep ClusteringI Spatial: Time-Variant cGMM

3 Integrated Model (proposed)4 Evaluation5 Conclusion

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 1 / 13

Page 3: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Problem StatementGoal:

• Transcribe speechfrom an observed mixtureApplication:

• Meeting transcription• Home assistants/ automation• (Surveillance)

Focus:• Robust source separation front-endexploiting spectral and spatial features

stf1 y1tf

yDtf

y2tf

h1f1

...

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 2 / 13

Page 4: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Motivation

Spectral/ deep learning:+ Leverage training data+ Model speech characteristics– Overfit, possibly poorgeneralization

Spatial clustering (multi-channel):+ Training data agnostic– No concept of human speech+ Exploit spatial selectivity

Joint formulation desired!

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 3 / 13

Page 5: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Motivation

Spectral/ deep learning:+ Leverage training data+ Model speech characteristics– Overfit, possibly poorgeneralization

Spatial clustering (multi-channel):+ Training data agnostic– No concept of human speech+ Exploit spatial selectivity

Joint formulation desired!

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 3 / 13

Page 6: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Spectral: Deep Clustering [1]

ztfk

πfk

etfκk

µk

Exploit speaker specific spectral characteristics for separation.Concept:

• BLSTM yields an embedding vector etf for each tf-bin.• Encourage tendency to form clusters in embedding space.• Cluster using k-means on etf .

von Mises Fisher MMRelevance:

• Speaker independent• Number of speakers not fixed at training time

[1] Hershey et al., Deep clustering: Discriminative embeddings [...], ICASSP 2016

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 4 / 13

Page 7: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Spectral: Deep Clustering [1]

ztfk

πfk

etfκk

µk

Exploit speaker specific spectral characteristics for separation.Concept:

• BLSTM yields an embedding vector etf for each tf-bin.• Encourage tendency to form clusters in embedding space.• Cluster using k-means on etf .

von Mises Fisher MM

Relevance:• Speaker independent• Number of speakers not fixed at training time[1] Hershey et al., Deep clustering: Discriminative embeddings [...], ICASSP 2016

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 4 / 13

Page 8: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Spectral: Deep Clustering [1]

ztfk

πfk

etfκk

µk

Exploit speaker specific spectral characteristics for separation.Concept:

• BLSTM yields an embedding vector etf for each tf-bin.• Encourage tendency to form clusters in embedding space.• Cluster using k-means on etf .von Mises Fisher MM

Relevance:• Speaker independent• Number of speakers not fixed at training time[1] Hershey et al., Deep clustering: Discriminative embeddings [...], ICASSP 2016

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 4 / 13

Page 9: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Spatial: Time-Variant cGMM [2]

ztfk

πfk

ytf σtfk

Rfk

Exploit spatial diversity to separate speakers.Concept:

• Complex random vectors ytf• Acoustic transfer function captured by spatial correlation matrix.p(ytf) =

∑k

πfk · NC(ytf ;0, σtfk · Rfk)

Relevance:• Competitive: Winning system of CHiME 3 and CHiME 4 challenges

[2] Ito et al., Relaxed disjointness based clustering [...], IWAENC 2014

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 5 / 13

Page 10: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Spatial: Time-Variant cGMM [2]

ztfk

πfk

ytf σtfk

Rfk

Exploit spatial diversity to separate speakers.Concept:

• Complex random vectors ytf• Acoustic transfer function captured by spatial correlation matrix.p(ytf) =

∑k

πfk · NC(ytf ;0, σtfk · Rfk)

Relevance:• Competitive: Winning system of CHiME 3 and CHiME 4 challenges

[2] Ito et al., Relaxed disjointness based clustering [...], IWAENC 2014

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 5 / 13

Page 11: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Integrated Model

ztfk

πfk

etfetf ytfµkµk

κkκk

σtfk

Rfk

Beamforming ASRDeep Clustering

ytf

words

IntegratedEM Algorithm

Spectral observation model Spatial observation model

• Statistical model:p(etf , ytf) =

∑k

πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)

• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13

Page 12: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Integrated Model

ztfk

πfk

etfetf ytfµkµk

κkκk

σtfk

Rfk

Beamforming ASRDeep Clustering

ytf

words

IntegratedEM Algorithm

Spectral observation model Spatial observation model

• Statistical model:p(etf , ytf) =

∑k

πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)

• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13

Page 13: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Integrated Model

ztfk

πfk

etfetf ytfµkµk

κkκk

σtfk

Rfk

Beamforming ASRDeep Clustering

ytf

words

IntegratedEM Algorithm

Spectral observation model Spatial observation model

• Statistical model:p(etf , ytf) =

∑k

πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)

• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13

Page 14: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Integrated Model

ztfk

πfk

etfetf ytfµkµk

κkκk

σtfk

Rfk

Beamforming ASRDeep Clustering

ytf

words

IntegratedEM Algorithm

Spectral observation model Spatial observation model

• Statistical model:p(etf , ytf) =

∑k

πfk · vMF(etf ;µk, κk) · NC(ytf ;0, σtfk · Rfk)

• Observations independent, given class label• Estimate of all parameters better, when estimated jointly• Risk/ opportunity: Weighting between both models

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 6 / 13

Page 15: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Update equations: E-step

ztfk

πfk

etf ytfµk

κk

σtfk

Rfk

• Update class affiliation posterior:γ′tfk = πfk︸︷︷︸Prior

p(ztf)

· vMFα(etf ;µk, κk)︸ ︷︷ ︸Spectral Modelp(etf |ztf)

· N (1−α)C (ytf ;0, σtfk · Rfk)︸ ︷︷ ︸Spatial Model

p(ytf |ztf)

γtfk = γ′tfk/∑

k

γ′tfk

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 7 / 13

Page 16: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Evaluation: Setup

• Train/CV deep clustering on single channel WSJ utterances [1, 3]• GMM-HMM recognizer trained on clean, 3-Gram LM [3]• 3000 test mixtures, 2 speakers per mixture:

I Random source and array positions (6 sensors)I Image method to generate room impulse responses

[3] Isik et al., Single-Channel Multi-Speaker Separation using Deep Clustering,Interspeech 2016

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 8 / 13

Page 17: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Example: Posterior Masks (200ms < T60 < 300ms)

0 100 200 3000

128

256

Time frame

Frequen

cybin

DC + vMFMM

0 100 200 300Time frame

vMF-TV-cGMM

0 100 200 300Time frame

TV-cGMM

0

1

Mask

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 9 / 13

Page 18: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Evaluation: Signal to Distortion Ratio GainLow reverb (T60 = 50 . . . 100ms)

DC+M

asking

DC+BF

vMF-TV

-cGM

M

spatial

only

05101520

SDRgain

/dB

0.9 dB

Medium reverb (T60 = 200 . . . 300ms)

DC+M

asking

DC+BF

vMF-TV

-cGM

M

spatial

only

0

5

10

15

SDRgain

/dB

2.3 dB

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 10 / 13

Page 19: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Evaluation: Signal to Distortion Ratio GainLow reverb (T60 = 50 . . . 100ms)

DC+M

asking

DC+BF

vMF-TV

-cGM

M

spatial

only

05101520

SDRgain

/dB

0.9 dBMedium reverb (T60 = 200 . . . 300ms)

DC+M

asking

DC+BF

vMF-TV

-cGM

M

spatial

only

0

5

10

15

SDRgain

/dB

2.3 dB

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 10 / 13

Page 20: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Word Error Rates (T60 = 50 . . . 100ms)

Features Model Extraction WER /%spectral DC [3] Masking 65.8spectral DC Beamforming 42.4

integrated vMF-TV-cGMM Beamforming 29.7spatial TV-cGMM [2] Beamforming 33.6

[2] Ito et al., Relaxed disjointness based clustering [...], IWAENC 2014[3] Isik et al., Single-Channel Multi-Speaker Separation using Deep Clustering,Interspeech 2016

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 11 / 13

Page 21: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Audio Example

(Switch to external audio player.)• Observed mixture: y.wav• Estimates: z_1.wav, z_2.wav• Clean speech: x_1.wav, x_2.wav

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 12 / 13

Page 22: Tight integration of spatial and spectral features for …...fk y tf ˙ tfk R fk Exploit spatial diversity to separate speakers. Concept: • Complex random vectors y tf • Acoustic

NT

Conclusion and Future Work• EM algorithm for joint modelSpectral/ deep learning:+ Leverages training data+ Models speechcharacteristics

Spatial clustering:+ Generalizes to unseenconditions+ Exploits spatial selectivity• Outperforms both very different baselines• Bridges model-based research and deep learning research

Future Work:• Joint training• Real recordings

L. Drude, R. Haeb-Umbach Integration of spatial and spectral features 13 / 13