robust direction estimation with convolutional neural ...pertila/icassp2017talk.pdf · robust...

Robust Direction Estimation withConvolutional Neural Networks-based

Steered Response Power

Pasi Pertila, Emre Cakir

Laboratory of Signal ProcessingAudio Research Group

Tampere University of TechnologyTampere, Finland

The 42nd IEEE International Conference on Acoustics,Speech and Signal Processing, 2017

1 Introduction

2 Time-Frequencymasking for speechenhancement

3 Sound direction ofarrival (DOA)estimation

4 Results

5 Conclusions andDiscussion

Outline

1 Introduction

2 Time-Frequency masking for speech enhancement

3 Sound direction of arrival (DOA) estimation

4 Results

5 Conclusions and Discussion

Robust DOA estimation using CNN-based SRP –Pertila, Cakir ICASSP’17 2/28

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speechenhancement

Surveillance, automatic camera management

Speaker DOA estimate can be degraded byreverberation and everyday noise

time-varying

e.g. household, cafeteria…

1 Introduction

4 Results

Introduction

Uses of sound DOA estimation

Spatial filtering: beamforming, speechenhancement

Surveillance, automatic camera management

Speaker DOA estimate can be degraded byreverberation and everyday noise

time-varying

e.g. household, cafeteria…

1 Introduction

4 Results

Time-Frequency (TF) maskingremoves undesired TF-components from observation

successfully applied for speech enhancement usingdeep learning[1][2]

applied in Time Di�erence of Arrival estimation

Eqs. based on insight to deal with static noise[3]

Regression to deal with reverberation[4]

This work proposes deep learning-based (CNN)TF-masking for DOA estimation

to deal with everyday noise and reverberation[1]

A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speechrecognition, ICASSP’13

[2]Y. Wang, A. Narayanan, and D. Wang, On training targets for supervised speech separation, IEEE/ACM

Trans. Audio, Speech and Lang. Proc., vol. 22, no. 12, pp. 1849�1858,2014.[3]

Grondin, Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for Sound

Source Localization on Mobile Robots”, IROS, 2015[4]

Wilson, Darrell, ”Learning a precedence e�ect-like weighting function for the generalized

cross-correlation framework,” TASLP, 2006.Robust DOA estimation using CNN-based SRP –

Pertila, Cakir ICASSP’17 5/28

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

Why use convolutional neuralnetworks (CNNs) for speech?

Speech signals contain significant local information inspectral domain.

changing speaker and environmental conditions→ shi�s in spectral position

CNNs have a translational shi� invariance property→ Suitable option for our task.

convolutional neural networks (CNNs)

are discriminative classifiers

compute activations through shared weights overlocal receptive fields

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

2 Time-Frequencymasking for speechenhancementArray signal model

TF-mask learning with CNNs

CNN Training data generationprocess

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

Array signal model

The ith microphone signal is modeled as themixture ofreverberated signals in presence of noise

xi(t, f)︸︷︷︸observation

n hmi,rn(f) · sn(t, f)︸︷︷︸reverbered nth source signal

+ ei(t, f)︸︷︷︸noise

f = 0, . . . ,K − 1 is discrete frequency index,

t is processing frame index,

hmi,rn(f) is the room impulse response (RIR) betweensource position rn ∈ R3 and microphone positionmi ∈ R3 in Cartesian coordinates

i = 0, . . . ,M − 1 whereM is the number ofmicrophones.

1 Introduction

4 Results

TF masking

Enhances by applying a mask ηi(t, f) to the observed signal

si(t, f) = xi(t, f) · ηi(t, f) (2)

Wiener filter

η(t, f) =|s(t, f)|2

|s(t, f)|2 + |e(t, f)|2, (3)

To enhance the direct-path signal in presence of interference

s(t, f) contains the direct path component of source

e(t, f) is a mixture of source reverberation,interference with reverberation, and noise

1 Introduction

4 Results

TF masking

si(t, f) = xi(t, f) · ηi(t, f) (2)

Wiener filter

η(t, f) =|s(t, f)|2

|s(t, f)|2 + |e(t, f)|2, (3)

1 Introduction

4 Results

TF masking

si(t, f) = xi(t, f) · ηi(t, f) (2)

Wiener filter

η(t, f) =|s(t, f)|2

|s(t, f)|2 + |e(t, f)|2, (3)

1 Introduction

4 Results

CNN Training data generation processTIMIT speech convolved with RWCP RIRs→ arrayobservations

16ch circular array, radius 15 cm.

Seven rooms, RT60 ∈ [0.3, 1.3] s.

Mixed with interference and ambient noise.

Three classes of interference[1]

Household, Interior background, Printer.

speech-to-interference ratios (SIRs){+12,+6, 0,−6} dB

Ambient noise SNR ∼ U [6, 12] dB

22440 training, 3000 validation, and 5160 test mixtures

[1]BBC sound e�ects libraryRobust DOA estimation using CNN-based SRP –

1 Introduction

4 Results

Seven rooms, RT60 ∈ [0.3, 1.3] s.

1 Introduction

4 Results

Seven rooms, RT60 ∈ [0.3, 1.3] s.

1 Introduction

4 Results

Seven rooms, RT60 ∈ [0.3, 1.3] s.

1 Introduction

4 Results

Seven rooms, RT60 ∈ [0.3, 1.3] s.

1 Introduction

4 Results

Seven rooms, RT60 ∈ [0.3, 1.3] s.

1 Introduction

4 Results

Seven rooms, RT60 ∈ [0.3, 1.3] s.

1 Introduction

4 Results

CNN Training data generation processCNN maps feats. to targets: log (|x(t, f)|) 7→ η(t, f).

Input/output block size is 32 frames × 172 freq.bins.

Window length 21.4 ms, 50 % overlap, 16 kHz data.

Masks between microphones are highly similar

Features and targets are averaged over microphones→ Single mask: Computationally e�icient

CNN design

Four convolutional layers

96 feature maps, ReLU activation functions,11-by-5 (time-by-frequency) convolution

Each followed by batch-norm. and dropout (0.25)

First three followed by max-pooling ↓ 4 freq. bins.Hyper-params obtained with grid search

Output layer: feed-forward (sigmoid).Robust DOA estimation using CNN-based SRP –

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

4 Results

CNN design

1 Introduction

3 Sound direction ofarrival (DOA)estimationSRP-PHAT with TF masking

DOA performance analysis

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

SRP-PHAT

Steered Response Power with PHAseTransform (SRP-PHAT)

L(k, t)=∑i,j

K−1∑f=0

xi(t, f) · (xj(t, f))∗

|xi(t, f)|·|x∗j (t, f)|exp(τi,j · ωf ), (4)

k ∈ R3 sound direction (unit vector)

τi,j = kT (mi −mj)/c is time di�erence betweenmicrophones,

(·)∗ denotes complex conjugate, ωf =2πf/K , and

c is the speed of sound.

Due to feature and target averaging, ηi(t, f) = ηj(t, f)

1 Introduction

4 Results

SRP-PHAT with TF maskingBased on weighted GCC-PHAT[1]

Weighted SRP-PHAT

L(k, t)=∑i,j

K−1∑f=0

ηi(t, f)xi(t, f) · (ηj(t, f)xj(t, f))∗

|xi(t, f)|·|x∗j (t, f)|exp(τi,j ·ωf ),

[1]F. Grondin, and F. Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for

Sound Source Localization on Mobile Robots”, IROS, 2015Robust DOA estimation using CNN-based SRP –

1 Introduction

4 Results

SRP-PHAT with TF maskingBased on weighted GCC-PHAT[1]

Weighted SRP-PHAT

L(k, t)=∑i,j

K−1∑f=0

η(t, f)2xi(t, f) · (xj(t, f))∗

|xi(t, f)|·|x∗j (t, f)|exp(τi,j · ωf ), (4)

[1]F. Grondin, and F. Michaud, ”Time Di�erence of Arrival Estimation based on Binary Frequency Mask for

Sound Source Localization on Mobile Robots”, IROS, 2015Robust DOA estimation using CNN-based SRP –

1 Introduction

4 Results

SRP-PHAT with TF masking

The point estimate for DOA

k(t) = argmaxk

L(k, t). (5)

For each frame t separately, pick the maximum responsedirection of the weighted SRP-PHAT L(k, t)

1 Introduction

4 Results

Speech: recorded Japanese sentences (RWCP)

Mixed with reverberated interference signals

Same SIR levels as in training.

Di�erent inteference instances, di�erent RIRs

1 Speech played back from moving loudspeaker

4 di�erent rooms (with di�erent number of RIRs)

Total of 1440 mixtures

2 Speech played back from a static loudspeaker

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

DOA performance analysisUsed performance measures

1 Portion of the SRP-PHAT likelihood in ground truthdirection (±10°) w.r.t. all directions.

2 Percentage of DOA estimates in ground truthdirection(±10°)

Only frames with detected speech are included

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

DOA performance analysisCompared methods

1 SRP-PHAT (traditional approach)

2 SRP-PHAT weighted with CNN-predicted mask

3 SRP-PHAT weighted with interference canceling mask(ICM)

ICM←Wiener filter with access to addedinterference and original reverberated speech

will not remove reverberation of target speaker.

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

DOA performance analysisA

50 100 150

DOA estimate, active frame DOA estimate, inactive frame Ground truth ±10°

50 100 150

Time (frame)

50 100 150

Time (frame)

50 100 150

a) SRP-PHAT without interferenceb) SRP-PHAT with added printer interference at +6 dB SIR,c) SRP-PHAT with ICM weight for the mixture,d) SRP-PHAT with CNN predicted weight for the mixture.

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

ResultsSRP-PHAT results[1], static source

Inter.

Relative SRP-PHAT mass within ±10° of ground truth (Static)

SRP-PHAT CNN-W-SRP-PHAT ICM-W-SRP-PHAT

SIR +12 dB SIR +6 dB SIR 0 dB SIR -6 dB

CNN weighting improves SRP-PHAT

In SIR +12,+6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

[1]Normalized with SRP-PHAT result without interferenceRobust DOA estimation using CNN-based SRP –

1 Introduction

4 Results

ResultsSRP-PHAT results[1], static source

Inter.

Relative SRP-PHAT mass within ±10° of ground truth (Static)

In SIR +12,+6 dB: CNN outperforms ICM.

Most likely caused by reduction of reverberation

[1]Normalized with SRP-PHAT result without interferenceRobust DOA estimation using CNN-based SRP –

1 Introduction

4 Results

ResultsSRP-PHAT results, moving source

Inter.

Relative SRP-PHAT mass within ±10° of ground truth (Moving)

1 Introduction

4 Results

ResultsDOA point estimate, static source

100.00

Inter.

Avg.Re

Correct DOA estimates within ±10 deg. of ground truth (Static)

1 Introduction

4 Results

ResultsDOA point estimate, moving source

100.00

Inter.

Correct DOA estimates within ±10 deg. of ground truth (Moving)

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

Conclusions and Discussion

CNN-based time-frequency mask weighted SRP-PHATfunction was proposed.

Results showed reduction in detrimental e�ects causedby reverberation and time-varying interference.

Relative performance unchanged by source motion

CNN generalized to new speakers and new instances ofinterference from unseen angles.

The CNN mask is obtained from log-magnitude→ Should work on other arrays (without re-training)

SRP-PHAT weighting does not fully remove e�ects ofinterference→ Phase errors not a�ected by (real-valued) masking

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

1 Introduction

4 Results

Thank you for listening

Time for questions?

robust direction estimation with convolutional neural ...pertila/icassp2017talk.pdf · robust...

Documents