robust direction estimation with convolutional neural ...pertila/icassp2017talk.pdf · robust...

75
Robust Direction Estimation with Convolutional Neural Networks-based Steered Response Power Pasi Pertil¨ a, Emre Cakir Laboratory of Signal Processing Audio Research Group Tampere University of Technology Tampere, Finland The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing, 2017

Upload: duonghuong

Post on 12-May-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Robust Direction Estimation withConvolutional Neural Networks-based

Steered Response Power

Pasi Pertila, Emre Cakir

Laboratory of Signal ProcessingAudio Research Group

Tampere University of TechnologyTampere, Finland

The 42nd IEEE International Conference on Acoustics,Speech and Signal Processing, 2017

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Outline

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 2/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 3/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speechenhancement

Surveillance, automatic camera management

Speaker DOA estimate can be degraded byreverberation and everyday noise

time-varying

e.g. household, cafeteria…

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 4/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speechenhancement

Surveillance, automatic camera management

Speaker DOA estimate can be degraded byreverberation and everyday noise

time-varying

e.g. household, cafeteria…

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 4/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 6/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 7/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Array signal model

The ith microphone signal is modeled as themixture ofreverberated signals in presence of noise

xi(t, f)︸ ︷︷ ︸observation

=∑

n hmi,rn(f) · sn(t, f)︸ ︷︷ ︸reverbered nth source signal

+ ei(t, f)︸ ︷︷ ︸noise

, (1)

f = 0, . . . ,K − 1 is discrete frequency index,

t is processing frame index,

hmi,rn(f) is the room impulse response (RIR) betweensource position rn ∈ R3 and microphone positionmi ∈ R3 in Cartesian coordinates

i = 0, . . . ,M − 1 whereM is the number ofmicrophones.

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 8/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

TF masking

Enhances by applying a mask ηi(t, f) to the observed signal

si(t, f) = xi(t, f) · ηi(t, f) (2)

Wiener filter

η(t, f) =|s(t, f)|2

|s(t, f)|2 + |e(t, f)|2, (3)

To enhance the direct-path signal in presence of interference

s(t, f) contains the direct path component of source

e(t, f) is a mixture of source reverberation,interference with reverberation, and noise

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 9/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

TF masking

Enhances by applying a mask ηi(t, f) to the observed signal

si(t, f) = xi(t, f) · ηi(t, f) (2)

Wiener filter

η(t, f) =|s(t, f)|2

|s(t, f)|2 + |e(t, f)|2, (3)

To enhance the direct-path signal in presence of interference

s(t, f) contains the direct path component of source

e(t, f) is a mixture of source reverberation,interference with reverberation, and noise

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 9/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

TF masking

Enhances by applying a mask ηi(t, f) to the observed signal

si(t, f) = xi(t, f) · ηi(t, f) (2)

Wiener filter

η(t, f) =|s(t, f)|2

|s(t, f)|2 + |e(t, f)|2, (3)

To enhance the direct-path signal in presence of interference

s(t, f) contains the direct path component of source

e(t, f) is a mixture of source reverberation,interference with reverberation, and noise

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 9/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 10/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 11/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 12/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

SRP-PHAT

Steered Response Power with PHAseTransform (SRP-PHAT)

L(k, t)=∑i,j

K−1∑f=0

xi(t, f) · (xj(t, f))∗

|xi(t, f)|·|x∗j (t, f)|exp(τi,j · ωf ), (4)

k ∈ R3 sound direction (unit vector)

τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,

(·)∗ denotes complex conjugate, ωf =2πf/K , and

c is the speed of sound.

Due to feature and target averaging, ηi(t, f) = ηj(t, f)

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 13/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

SRP-PHAT with TF maskingBased on weighted GCC-PHAT[1]

Weighted SRP-PHAT

L(k, t)=∑i,j

K−1∑f=0

ηi(t, f)xi(t, f) · (ηj(t, f)xj(t, f))∗

|xi(t, f)|·|x∗j (t, f)|exp(τi,j ·ωf ),

(4)

k ∈ R3 sound direction (unit vector)

τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,

(·)∗ denotes complex conjugate, ωf =2πf/K , and

c is the speed of sound.

Due to feature and target averaging, ηi(t, f) = ηj(t, f)

[1]F. Grondin, and F. Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for

Sound Source Localization on Mobile Robots”, IROS, 2015Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 13/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

SRP-PHAT with TF maskingBased on weighted GCC-PHAT[1]

Weighted SRP-PHAT

L(k, t)=∑i,j

K−1∑f=0

η(t, f)2xi(t, f) · (xj(t, f))∗

|xi(t, f)|·|x∗j (t, f)|exp(τi,j · ωf ), (4)

k ∈ R3 sound direction (unit vector)

τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,

(·)∗ denotes complex conjugate, ωf =2πf/K , and

c is the speed of sound.

Due to feature and target averaging, ηi(t, f) = ηj(t, f)

[1]F. Grondin, and F. Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for

Sound Source Localization on Mobile Robots”, IROS, 2015Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 13/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

SRP-PHAT with TF masking

The point estimate for DOA

k(t) = argmaxk

L(k, t). (5)

For each frame t separately, pick the maximum responsedirection of the weighted SRP-PHAT L(k, t)

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 14/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysis

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

5 di�erent rooms (with di�erent number of RIRs)

Total of 1650 mixtures

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 15/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisUsed performance measures

1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.

2 Percentage of DOA estimates in ground truthdirection(±10°)

Only frames with detected speech are included

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 16/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisUsed performance measures

1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.

2 Percentage of DOA estimates in ground truthdirection(±10°)

Only frames with detected speech are included

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 16/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisUsed performance measures

1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.

2 Percentage of DOA estimates in ground truthdirection(±10°)

Only frames with detected speech are included

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 16/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisCompared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask(ICM)

ICM←Wiener filter with access to addedinterference and original reverberated speech

will not remove reverberation of target speaker.

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 17/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisCompared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask(ICM)

ICM←Wiener filter with access to addedinterference and original reverberated speech

will not remove reverberation of target speaker.

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 17/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisCompared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask(ICM)

ICM←Wiener filter with access to addedinterference and original reverberated speech

will not remove reverberation of target speaker.

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 17/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

5 Conclusions andDiscussion

DOA performance analysisA

zim

uth

angl

e e

(deg

)

a)

50 100 150

100

200

300

DOA estimate, active frame DOA estimate, inactive frame Ground truth ±10°

Azi

mut

h an

gle e

(deg

)

b)

50 100 150

100

200

300

Time (frame)

Azi

mut

h an

gle e

(deg

)

c)

50 100 150

100

200

300

Time (frame)

Azi

mut

h an

gle e

(deg

)

d)

50 100 150

100

200

300

a) SRP-PHAT without interferenceb) SRP-PHAT with added printer interference at +6 dB SIR,c) SRP-PHAT with ICM weight for the mixture,d) SRP-PHAT with CNN predicted weight for the mixture.

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 18/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 19/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

ResultsSRP-PHAT results[1], static source

0

10

20

30

40

50

60

70

80

90

100

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

Perfo

rman

ce (%

)

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT

In SIR +12,+6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

[1]Normalized with SRP-PHAT result without interferenceRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 20/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

ResultsSRP-PHAT results[1], static source

0

10

20

30

40

50

60

70

80

90

100

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

Perfo

rman

ce (%

)

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT

In SIR +12,+6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

[1]Normalized with SRP-PHAT result without interferenceRobust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 20/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

ResultsSRP-PHAT results, moving source

0

10

20

30

40

50

60

70

80

90

100

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

Perfo

rman

ce (%

)

Relative SRP-PHAT mass within ±10° of ground truth (Moving)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 21/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

ResultsDOA point estimate, static source

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.Re

lativ

e pe

rform

ance

(%)

Correct DOA estimates within ±10 deg. of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 22/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

ResultsDOA point estimate, moving source

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

House

Inter.

bg.

Print

Avg.

Rela

tive

perfo

rman

ce (%

)

Correct DOA estimates within ±10 deg. of ground truth (Moving)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 23/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 24/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 25/28

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Thank you for listening

Time for questions?

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 26/28