ee269 signal processing for machine learning - cepstrum

16
EE269 Signal Processing for Machine Learning Cepstrum Instructor : Mert Pilanci Stanford University October 11 2021

Upload: others

Post on 09-May-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EE269 Signal Processing for Machine Learning - Cepstrum

EE269Signal Processing for Machine Learning

Cepstrum

Instructor : Mert Pilanci

Stanford University

October 11 2021

Page 2: EE269 Signal Processing for Machine Learning - Cepstrum

Linear systems and additive noise

I Linear systems, e.g., filters, can easily separate additive noisefrom useful information when we know the frequency range ofthe noise and information

y[n] = x[n] + w[n]

I In vector notation

Hy = Hx+Hw

Page 3: EE269 Signal Processing for Machine Learning - Cepstrum

Multiplicative or convolutive noise

I This is harder if the signal and noise are convoluted, e.g., inspeech processing

y[n] = x[n] ∗ w[n]

I w[n] is the flowing air (noise source)

I h[n] is the vocal tract (filter)

We can develop an operator that can separate convolutedcomponents by transforming convolution into addition

Page 4: EE269 Signal Processing for Machine Learning - Cepstrum

Cepstrum

I Developed to separate convoluted signals

y[n] = x[n] ∗ w[n]

Discrete Fourier Domain:

Y [k] = X[k]W [k]

I Take logarithms

log[Y [k]] = logX[k] + logW [k]

I we can apply a linear filter to log Y [k] to separate

I equivalently we can take DFT of log Y [k] and process infrequency domain

cepstrum is the DFT (or DCT) of the log spectrum

Page 5: EE269 Signal Processing for Machine Learning - Cepstrum
Page 6: EE269 Signal Processing for Machine Learning - Cepstrum
Page 7: EE269 Signal Processing for Machine Learning - Cepstrum
Page 8: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Mel-frequency spectrum

I perceptual scale of pitches

I 1 mels = 1000 Hz

I a formula to convert f hertz into m mels

m = 2595 log10

⇣1 + f

700

Page 9: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Mel-frequency spectrum

I weighted DFT magnitude

I mel-frequency spectrum MF [r] is defined as

MF [r] =X

k

|Vr[k]X[k]|2

I Vr[k] is the triangular weighting function for the rth filter.

I bandwidths are constant for center frequencies ¡ 1kHz and

then increase exponentially

I identical to convolutions with 22 filters

Page 10: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Mel-frequency spectrum

I weighted DFT magnitude

I mel-frequency spectrum MF [r] is defined as

MF [r] =X

k

|Vr[k]X[k]|2

I Vr[k] is the triangular weighting function for the rth filter.

I bandwidths are constant for center frequencies ¡ 1kHz and

then increase exponentially

I identical to convolutions with 22 filters

Page 11: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Mel-frequency spectrum

MF[r] =X

k

|Vr[k]X[k]|2

I Mel Frequency Cepstral Coe�cient (MFCC)

MFCC[m] =RX

r=1

log(MF[r]) cos

2⇡

R

✓r +

1

2

◆m

�(1)

I i.e., inner-product with cosines MFCC[m] = hlogMF[r], cm[r]i

Page 12: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Speaker Identification

I train a k-Nearest Neighbor classifier to classify frames

Page 13: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Speaker Identification

I AN4 dataset (CMU): 5 male and 5 female subjects speaking

words and numbers

I collect the training samples into frames of 30 ms with an

overlap of 75%

I calculate MFCC

I train a k-Nearest Neighbor classifier on the frames

I for a given test signal, predictions are made every frame

I most frequently occurring label is declared as the speaker

Page 14: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Speaker Identification

speaker 1 (blue) and speaker 2 (red) time domain signals

frame based MFCC features

Page 15: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Speaker Identification

Page 16: EE269 Signal Processing for Machine Learning - Cepstrum

Application: Speaker Identification

I average accuracy is 92.93%