mfcc tutorial - ocw.nthu.edu.tw · 1980s can recognize 20000 words 4mb ram => 30sec speech per...

43
MFCC tutorial Brought to you by EE6641 TAs

Upload: others

Post on 24-Jan-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

MFCC tutorial

Brought to you by EE6641 TAs

Page 2: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 3: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 4: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Signal Processing

Speech recognition

Pitch detection

Cover-song detector and so

on…

Page 5: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1930s

Page 6: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1950s

Page 7: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1960s

Page 8: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1960s

Rej Reddy

Soviet=>DTW capable of 200 words

Page 9: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1960’s

IDA: Leonard Baum

Page 10: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1980s

Can recognize 20000 words

4MB ram => 30sec speech per 100

minutes

Page 11: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

1990

Commercial opportunities

Number of words bigger than human’s

Page 12: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

2000s

Lern & Hauspie

Dragon System

Later Bankrupt

Page 13: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

2010s

Deep learning

Reduced 30% error

“the most dramatic change”

Page 14: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 15: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Mel-frequency

perceptual scale of pitch

1000 to 1000

"聽閾"

Not all equations are the same

Page 16: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Mel-frequency

Hz 40 161 200 404 693 867 1000 2022 3000 3393 4109 5526 6500 7743 12000

mel 43 257 300 514 771 928 1000 1542 2000 2142 2314 2600 2771 2914 3228

Page 17: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 18: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Cepstrum

FFT => abs() => log() => IFFT(FFT)

“quefrency”

Page 19: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Spectral Envelope

Page 20: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Spectral Envelope

Page 21: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Spectral Envelope

Page 22: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Why dB?

Page 23: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Disc-Cosine-Trans v.s. Disc-Fourier-

Trans

DCT

DFT

Difference?

Page 24: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Disc-Cosine-Trans v.s. Disc-Fourier-

Trans

DCT

DFT

2x resolution

0.5x memory

O(N*N)

Page 25: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Disc-Cosine-Trans v.s. Disc-Fourier-

Trans

Page 26: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Disc-Cosine-Trans v.s. Disc-Fourier-

Trans

Page 27: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 28: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

MFCC

FFT => power spectrum =>

triangular filter banks (usually 26)

log => DCT(IDCT)

取係數 (usually 13)

Page 29: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

MFCC

Why MFCC?

Simplicity (Only several coefficients)

Smoothness

Page 30: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 31: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Machine Learning

Unsupervised learning

Supervised learning

Semi-supervised learning

Page 32: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Unsupervised learning

Expectation maximization

E-step vs M-step

Page 33: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

K-means

Page 34: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

GMM

Page 35: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Supervised learning

Based on “labels”

Empirical vs General

Error minimization

Page 36: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

SVM

Page 37: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

kNN

Page 38: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Neural nets

Page 39: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Musical Instruments Identification

Use audio recorded by ourselves

Page 40: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Outline

History

Mel frequency

Cepstrum

MFCC

Applications

Conclusions

Page 41: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Quick Recap

Why Mel-Filterbank?

人耳聽覺Why DCT?

頻譜對稱性兩倍的解析度

Why dB?

系統分解

Page 42: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes

Conclusions

What’s next?

工欲善其事,必先利其器。

Page 43: MFCC tutorial - ocw.nthu.edu.tw · 1980s Can recognize 20000 words 4MB ram => 30sec speech per 100 minutes