speech enhancement ee 516 spring 2009

Speech EnhancementEE 516 Spring 2009

Alex Acero

Outline

• A model of the acoustical environment• Simple things first!• Microphones• Echo cancellation• Microphone arrays• Single channel noise suppression

Additive noise

• Stationary noise: properties don’t change over time:– White noise x[n]

• flat power spectrum• Samples are uncorrelated

– White Gaussian Noise

• Pdf is Gaussian (see chapter 10)– Typical noise is colored

• Pink noise: low-pass in nature• Non-stationary: properties changes over time

– Babble noise– Cocktail party effect

( )xxS f q[ ] [ ]xxR n q n

Reverberation

• Impulse response of an average office

0 200 400 600 800 1000 1200 1400 1600 1800 2000-3000

-2000

-1000

0

1000

2000

3000

4000

5000

6000

7000

Time (samples)

Roo

m Im

puls

e R

espo

nse 0 0

1[ ] [ ] [ ]k k

k kk kk k

h n n T n Tr c T

Model of the Environment

n[m]

x[m] y[m]h[m] +

[ ] [ ] [ ] [ ]y m x m h m n m

Outline


Cepstral Mean NormalizationCompute mean of cepstrum

And subtract it from input

CMN robust to channel

distortion

Normalizes average

vocal tract or short filters

Average must include

> 2 sec of speech

1

0

1 T

ttT

x x

ˆ t t x x x

0

2

4

6

8

10

12

14

16

10 15 20 30

SNR (dB)

Wo

rd E

rro

r R

ate

(%) No CMN

CMN-2

RASTA

• CMN is a low-pass filter with rectangular window

• Can use other low-pass filters too• RASTA filter is band-pass

1 3 44

1

2 2( ) 0.1 *

1 0.98

z z zH z z

z

1

0

1ˆ

T

t t ttT

x x x

Retrain with noisy data

• Mismatches between training and testing are bad for pattern recognition systems

• Retrain with noisy data• Approximation: add noise to clean data and retrain

0

20

40

60

80

100

0 5 10 15 20 25 30

SNR (dB)

Wo

rd E

rro

r R

ate

(%)

Mismatched

Matched (Noisy)

Multi-condition training

• Very hard to predict exactly the type of noise we’ll encounter at test time

• Too expensive to retrain the system for each noise condition• Train system offline with several noise types and levels

0

5

10

15

20

25

30

5 10 15 20 25 30

SNR (dB)

Wo

rd E

rro

r R

ate

(%)

Matched Noise

Multistyle

Outline


Condenser Microphone

b

b

h

~

ZM RL

v(t) G+

-

PreamplifierMicrophone

Ommidirectional microphones

• Polar response

0.5

1

30

210

60

240

90

270

120

300

150

330

180 0

Diaphragm

Mic opening

Bidirectional microphones

Speech sound wave from the front

Noise sound wave from the side

r

source

(d, 0)(–d, 0)

r1r2

5

10

15

20

25

30

210

60

240

90

270

120

300

150

330

180 0

Bidirectional microphones

• bidirectional microphone with d=1 cm at 0• Solid line corresponds to far field conditions ( ) and the

dotted line to near field conditions ( )

102

103

104

-30

-25

-20

-15

-10

-5

0

Frequency (Hz)

Diff

eren

ce in

air

pres

sure

(dB

)

0.02 0.5 /d r

Unidirectional microphones

5

10

15

20

25

30

210

60

240

90

270

120

300

150

330

180 0

Speech sound wave from the front

Noise sound wave from the side

Dynamic microphones

Output voltage

Magnet

Coil

Diaphragm

Outline


Acoustic Echo cancellation

2

10 2

{ [ ]}( ) 10log

ˆ{( [ ] [ ]) }

E d nERLE dB

E d n d n

Adaptive filter

Acoustic path H

-

x[n]

s[n]

r[n]

Loudspeaker

e[n]

Speech signal

Microphone

+ +v[n] Local

noise

d[n]ˆ[ ]d n

Line echo cancellation

Adaptive filter

Hybrid circuit H

-

x[n]

s[n]r[n]

Speaker A

e[n]

Speaker B

+ +v[n]

d[n]

Noise

ˆ[ ]d n

2

10 2

{ [ ]}( ) 10log

ˆ{( [ ] [ ]) }

E d nERLE dB

E d n d n

Least Mean Squares (LMS)

• Given input

• Estimate output

• Compute error

• Update filter

• Need to tune step size

[ 1] [ ] [ ] [ ]n n e n n W W X

[ ] [ ] [ ]e n d n y n

[ ] { [ ], [ 1], [ 1]}n x n x n x n L X

1

0

[ ] [ ] [ ] [ ] [ ]L

Tk

k

y n w n x n k n n

W X

Normalized LMS

• Make step size adaptive to ensure convergence

• Where we track the input energy

2[ ]

ˆ [ ]x

nL n

2 2 2ˆ ˆ[ ] (1 ) [ 1] [ ]x xn n x n

Recursive Least Squares (RLS)

• Newton Raphson

• New weights

• Faster convergence, but more CPU intensive

x0x1

f(x)

1

( )

( )i

i ii

f xx x

f x

121 [ ] ( ) ( )i i i in e e

w w w w2 ( ) [ ] { [ ] [ ]}T

ie n E n n w R x x

[ ] [ 1] [ ] [ ]Tn n n n R R x x

Outline


12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

Microphone arrays: delay & sum

5 microphones spaced 5 cm apart. Source located at 5 m

Angle 0

400Hz 880Hz 4400Hz 8000 Hz

21

0

1arg max [ sin( )]

N

in i

y n iaN

M0

M1

M2

S

M-2

M-1

a1

0

1[ ] [ sin ]

N

ii

y n y n iaN

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

12.5

25

30

210

60

240

90

270

120

300

150

330

180 0

Microphone arrays: delay & sum

5 microphones spaced 5 cm apart. Source located at 5 m.

Angle 30

400Hz 880Hz 4400Hz 8000 Hz

21

0

1arg max [ sin( )]

N

in i

y n iaN

M0

M1

M2

S

M-2

M-1

a1

0

1[ ] [ sin ]

N

ii

y n y n iaN

WITTY: Who Is Talking To You?

( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

Y f X f V f

B f H f X f G f V f W f

Bone microphone for noise robust ASR

• Conventional microphones are sensitive to noise• Bone microphones are more noise resistant, but distort the signal

• Not enough data to retrain recognizer with bone microphone

• Fusion between acoustic microphone and bone microphone

Acoustic Microphone

Bone Microphone

Microphone fusion

Relationship between acoustic mic and bone mic

Acoustic

Contact

Relationship between acoustic mic and bone mic

WITTY: Who is talking to you?

Blind source separation

• Linear mixing• Estimate filter • Separate signals• Using assumption signals are independent

• Do gradient descent:

[ ] [ ]n ny Gx1H G

[ ] [ ]n nx Hy

( [ ]) | | ( [ ])p n p ny xy H Hy

1 1

0 0

( [0], [1], , [ 1]) ( [ ]) | | ( [ ])N N

N

n n

p N p n p n

y y xy y y y H Hy

1

1 ( [ ])( [ ])T Tn n n n n n

H H H H y y

Blind source separation

Idea: Estimate filters h11[n] and h12[n] that maximize p(z1[n]|) where is a HMM.

Approximate HMM by a Gaussian Mixture Model with LPC parameters => EM algorithm with a linear set of equations

+

+

h11[n]

h22[n]

h12[n]

h21[n]

z1[n]

z2[n]

y1[n]

y2[n]

+

+

h11[n]

h22[n]

h12[n]

h21[n]

z1[n]

z2[n]

y1[n]

y2[n]

Outline


Spectral subtraction

Corrupted signal

Power spectrum

but

So

Estimate noise power spectrum from noisy frames

Estimate clean power spectrum as

[ ] [ ] [ ]y m x m n m

2 2 2( ) ( ) ( )Y f X f N f

12 2

0

1ˆ ( ) ( )M

ii

N f Y fM

2 22 2 1ˆ ˆ( ) ( ) ( ) ( ) 1( )

X f Y f N f Y fSNR f

2

2

( )( )

ˆ ( )

Y fSNR f

N f

2 2 2( ) ( ) ( ) 2 ( ) ( ) cosY f X f N f X f N f

cos 0E

Spectral subtraction

Keep original phase

Ensure it’s positive

ˆ ( ) ( ) ( )ssX f Y f H f1

( ) max 1 ,( )ssH f a

SNR f

-5 0 5 10 15 20-12

-10

-8

-6

-4

-2

0

Instantaneous SNR (dB)

Ga

in(d

B)

spectral subtractionmagnitude subtractionOversubtraction

Aurora2

• ETSI STQ group• TIDigits• Added noise at SNRs: -5dB, 0dB, 5dB, 10dB, 15dB, 20dB• Set A: subway, babble, car, exhibition• Set B: restaurant, airport, street, station• Set C: one noise from set A and one noise from set C• Aurora 3 recorded in car (no digital mixing!)• Aurora4 for large vocabulary• Advanced Front-End (AFE) standard (2001) uses a variant of

spectral subtraction

Aurora 2 (Clean training)

Using SPLICE algorithm

AA BB CCSubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage

CleanClean 20 dB20 dB 98.1698.16 98.5298.52 98.7298.72 98.2798.27 98.4298.42 98.6598.65 97.5897.58 98.8198.81 98.798.7 98.4498.44 98.3498.34 98.0498.04 98.1998.19 98.3898.3815 dB15 dB 96.6596.65 97.6497.64 98.0998.09 96.6196.61 97.2597.25 97.8897.88 96.8996.89 97.9797.97 97.8497.84 97.6597.65 96.8196.81 96.496.4 96.6196.61 97.2897.2810 dB10 dB 93.7793.77 94.6894.68 95.7195.71 93.0993.09 94.3194.31 94.7594.75 93.4493.44 95.8595.85 94.694.6 94.6694.66 93.1893.18 91.2391.23 92.2192.21 94.0394.035 dB5 dB 87.4787.47 84.4684.46 88.4688.46 85.5385.53 86.4886.48 85.0885.08 83.7183.71 87.0387.03 84.9484.94 85.1985.19 84.3184.31 80.3580.35 82.3382.33 85.1385.130 dB0 dB 65.9265.92 57.1357.13 63.6763.67 63.7863.78 62.6362.63 59.7259.72 57.8357.83 63.1163.11 57.4257.42 59.5259.52 59.2359.23 52.952.9 56.0756.07 60.0760.07-5dB-5dB AveragAveragee

88.3988.39 86.4986.49 88.9388.93 87.4687.46 87.8287.82 87.2287.22 85.8985.89 88.5588.55 86.7086.70 87.0987.09 86.3786.37 83.7883.78 85.0885.08 86.9886.98


CleanClean 20 dB20 dB 37.63%37.63% 84.97%84.97% 50.58%50.58% 52.08%52.08% 56.31%56.31% 86.51%86.51% 43.19%43.19% 87.29%87.29% 75.38%75.38% 73.09%73.09% 74.62%74.62% 59.75%59.75% 67.19%67.19% 65.20%65.20%15 dB15 dB 48.54%48.54% 91.01%91.01% 80.82%80.82% 57.41%57.41% 69.45%69.45% 91.08%91.08% 73.07%73.07% 91.17%91.17% 86.79%86.79% 85.53%85.53% 75.89%75.89% 67.54%67.54% 71.71%71.71% 76.33%76.33%10 dB10 dB 70.72%70.72% 89.48%89.48% 87.00%87.00% 71.61%71.61% 79.70%79.70% 88.39%88.39% 80.05%80.05% 91.01%91.01% 86.40%86.40% 86.46%86.46% 73.87%73.87% 65.70%65.70% 69.79%69.79% 80.42%80.42%5 dB5 dB 73.81%73.81% 78.77%78.77% 82.49%82.49% 73.77%73.77% 77.21%77.21% 78.37%78.37% 73.53%73.53% 81.38%81.38% 79.11%79.11% 78.10%78.10% 67.80%67.80% 61.31%61.31% 64.56%64.56% 75.04%75.04%0 dB0 dB 53.94%53.94% 52.74%52.74% 57.53%57.53% 55.80%55.80% 55.00%55.00% 54.76%54.76% 48.67%48.67% 56.90%56.90% 51.85%51.85% 53.05%53.05% 45.33%45.33% 38.90%38.90% 42.12%42.12% 51.64%51.64%-5dB-5dB AveragAveragee

61.96%61.96% 73.03%73.03% 71.90%71.90% 63.75%63.75% 68.48%68.48% 73.03%73.03% 63.33%63.33% 75.52%75.52% 70.02%70.02% 70.83%70.83% 59.73%59.73% 52.14%52.14% 55.93%55.93% 67.39%67.39%

Aurora 2 (multi-condition training)

Using SPLICE algorithm

AA BB CC

SubwaySubway BabbleBabble CarCar ExhibitionExhibition AverageAverage RestaurantRestaurant StreetStreet AirportAirport StationStation AverageAverage Subway MSubway M Street MStreet M AverageAverage AverageAverage

CleanClean 20 dB20 dB 98.5398.53 98.6498.64 98.5198.51 98.6498.64 98.5898.58 98.4698.46 97.9197.91 98.698.6 98.5898.58 98.3998.39 98.498.4 98.2598.25 98.3398.33 98.4598.4515 dB15 dB 97.6497.64 98.0798.07 98.3398.33 97.6997.69 97.9397.93 97.7997.79 97.4997.49 97.4497.44 97.4797.47 97.5597.55 97.8897.88 97.1697.16 97.5297.52 97.7097.7010 dB10 dB 95.9895.98 96.3796.37 96.8496.84 95.6595.65 96.2196.21 95.2795.27 94.4194.41 95.1195.11 95.1295.12 94.9894.98 95.7995.79 93.893.8 94.8094.80 95.4395.435 dB5 dB 92.0892.08 88.9488.94 92.7892.78 90.2590.25 91.0191.01 87.6387.63 88.0688.06 88.1688.16 87.0487.04 87.7287.72 90.9790.97 85.8585.85 88.4188.41 89.1889.180 dB0 dB 78.0278.02 65.5765.57 76.8376.83 74.4274.42 73.7173.71 65.3765.37 68.2368.23 69.4969.49 65.5765.57 67.1767.17 72.6772.67 65.4265.42 69.0569.05 70.1670.16-5dB-5dB AverageAverage 92.4592.45 89.5289.52 92.6692.66 91.3391.33 91.4991.49 88.9088.90 89.2289.22 89.7689.76 88.7688.76 89.1689.16 91.1491.14 88.1088.10 89.6289.62 90.1890.18


CleanClean 20 dB20 dB 38.49%38.49% 40.09%40.09% 24.37%24.37% 47.49%47.49% 37.61%37.61% 50.80%50.80% 13.64%13.64% 45.31%45.31% 52.51%52.51% 40.56%40.56% 40.74%40.74% 49.28%49.28% 45.01%45.01% 40.27%40.27%15 dB15 dB 33.14%33.14% 34.80%34.80% 30.13%30.13% 30.63%30.63% 32.17%32.17% 52.98%52.98% 31.98%31.98% 34.02%34.02% 43.40%43.40% 40.59%40.59% 41.92%41.92% 36.47%36.47% 39.19%39.19% 36.95%36.95%10 dB10 dB 27.70%27.70% 23.09%23.09% 25.82%25.82% 26.15%26.15% 25.69%25.69% 41.17%41.17% 1.06%1.06% 27.12%27.12% 31.56%31.56% 25.23%25.23% 36.79%36.79% 17.33%17.33% 27.06%27.06% 25.78%25.78%5 dB5 dB 31.96%31.96% 11.16%11.16% 40.82%40.82% 21.37%21.37% 26.33%26.33% 24.85%24.85% 17.03%17.03% 13.89%13.89% 21.36%21.36% 19.28%19.28% 48.66%48.66% 19.00%19.00% 33.83%33.83% 25.01%25.01%0 dB0 dB 33.60%33.60% 9.04%9.04% 50.24%50.24% 28.23%28.23% 30.27%30.27% 14.93%14.93% 17.82%17.82% 12.55%12.55% 21.54%21.54% 16.71%16.71% 48.61%48.61% 24.10%24.10% 36.35%36.35% 26.06%26.06%-5dB-5dB AverageAverage 32.85%32.85% 13.01%13.01% 45.52%45.52% 27.57%27.57% 30.15%30.15% 24.04%24.04% 16.83%16.83% 17.14%17.14% 24.99%24.99% 21.05%21.05% 47.14%47.14% 24.13%24.13% 36.01%36.01% 27.87%27.87%

Wiener Filtering

• Find linear estimate of clean signal• MMSE (Minimum Mean Squared Error)

• Wiener-Hopf equation

• In Freq domain

• If noise and signal are uncorrelated

[ ] [ ] [ ]n n n y x v

ˆ[ ] [ ] [ ]m

n h m n m

x y

2

[ ] [ ] [ ]m

E n h m n m

x y

[ ] [ ] [ ]xy yym

R l h m R l m

( )( )

( )xy

yy

S fH f

S f

[ ] [ ] [ ]xym

R l x m y m l

[ ] [ ] [ ]yym

R l y m y m l

( )( )

( ) ( )xx

xx vv

S fH f

S f S f

Wiener Filtering

• Find linear estimate of clean signal• If noise and signal are uncorrelated

• With

• Compare with Spectral Subtraction

[ ] [ ] [ ]n n n y x v

ˆ[ ] [ ] [ ]m

n h m n m

x y

( ) ( )( ) 1( ) 1

( ) ( ) ( )yy vvxx

yy yy

S f S fS fH f

S f S f SNR f

( )( )

( )yy

vv

S fSNR f

S f

1( ) max 1 ,

( )ssH f aSNR f

Spectral Subtraction

0

20

40

60

80

100

0 5 10 15 20 25 30

SNR (dB)

Wo

rd E

rro

r R

ate

(%)

Clean Speech Training

Spectral Subtraction

Matched Noisy Training

Vector Taylor Series (VTS)

• Acero, Moreno

• The power spectrum, on the average

• Taking logs

• Cepstrum is DCT (matrix C) of log power spectrum

( ) y x h g n x h 1

( ) ln 1 e

C zg z C

[ ] [ ] [ ] [ ]y m x m h m n m

2 2 2 2( ) ( ) ( ) ( )i i i iY f X f H f N f

2 2 2

2 2 2

ln ( ) ln ( ) ln ( )

ln 1 exp ln ( ) ln ( ) ln ( )

i i i

i i i

Y f X f H f

N f X f H f

Vector Taylor Series (VTS)

• x, h, and n are Gaussian random vectors with means , , and and covariance matrices , , and

• Expand y in first-order Taylor series

xμ hμnμ xΣ hΣ nΣ

( )

( ) ( ) ( )( )x h n x h

x h n

y μ μ g μ μ μ

A x μ A h μ I A n μ

1A CFC1

1( )

1 e

C μf μ

( )y x h n x h μ μ μ g μ μ μ

( ) ( )T T T y x h nΣ AΣ A AΣ A I A Σ I A

Vector Taylor Series

• Distribution of corrupted log-spectra• Noise with mean of 0dB and std dev of 2dB• Speech with mean of 25dB• Montecarlo simulation• Std dev: 25dB 10dB 5dB

0 50 1000

0.01

0.02

0.03

0 20 40 600

0.01

0.02

0.03

0.04

0 20 40 600

0.02

0.04

0.06

0.08

Phase matters

Corrupted signal

Spectrum

But is only an approximation

[ ] [ ] [ ]y m x m n m

2 2 2( ) ( ) ( )Y f X f N f

2 2 2( ) ( ) ( ) 2 ( ) ( ) cosY f X f N f X f N f

cos 0E

2 ( ) ( ) cos 0t t tX f N f

-6 -4 -2 0 2-6

-5

-4

-3

-2

-1

0

1

2

-6 -5 -4 -3 -2 -1 0 1 2-6

-5

-4

-3

-2

-1

0

1

2

Non-stationary noise

• Speech/Noise decomposition (Varga et al.)

Observations

Speech HMM

Noise HMM

speech enhancement ee 516 spring 2009

Documents