spectral features for automatic text- independent speaker recognition tomi kinnunen research...

40
Spectral Features for Automatic Text-Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University of Joensuu

Upload: norma-hopkins

Post on 17-Dec-2015

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Spectral Features for Automatic Text-Independent Speaker Recognition

Tomi Kinnunen

Research seminar, 27.2.2004

Department of Computer ScienceUniversity of Joensuu

Page 2: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Based on a True Story …

T. Kinnunen: Spectral Features for Automatic Text-Independent Speaker Recognition, Ph.Lic. thesis, 144 pages, Department of Computer Science, University of Joensuu, 2004.

Downloadable in PDF from :

http://cs.joensuu.fi/pages/tkinnu/research/index.html

Page 3: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Introduction

Page 4: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Why Study Feature Extraction ?

• As the first component in the recognition chain, the accuracy of classification is strongly determined by its selection

Page 5: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Why Study Feature Extraction ? (cont.)

• Typical feature extraction methods are directly “loaned” from the speech recognition task

Quite contradictory, considering the “opposite” nature of the two tasks

• In general, it seems that currently we are at the best guessing what might be invidual in our speech !

• Because it is interesting & challenging!

Page 6: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Principle of Feature Extraction

Page 7: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

1. FFT-implemented filterbanks (subband processing)

2. FFT-cepstrum

3. LPC-derived features

4. Dynamic spectral features (delta features)

Studied Features

Page 8: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Speech Material & Evaluation Protocol

• Each test file is splitted into segments of T=350 vectors (about ~ 3.5 seconds of speech)

• Each segment is classified by vector quantization

• Speaker models are constructed from the training data by RLS clustering algorithm

• Performance measure = classification error rate (%)

Page 9: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

1. Subband Features

Page 10: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Computation of Subband Features

Windowed speech frame

Magnitude spectrum by FFT

Smoothing by a filterbank

Nonlinear mapping of the filter outputs

Compressed filter ouputsf = (f1,f2, … , fM)T

Parameters of the filterbank:• Number of subbands• Filter shapes & bandwidths• Type of frequency warping• Filter output nonlinearity

Page 11: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Frequency Warping… What’s That?!

0 500 1000 1500 2000 2500 3000 3500 40000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

Gai

n

shape: triangular, warping: Bark

0 0.5 1 1.5 2 2.5 3 3.5 40

5

10

15

20

25

Frequency [kHz]

Fre

quen

cy [B

ark]

• “Real” frequency axis (Hz) is stretched and compressed locally according to a (bijective) warping function

Bark scaleA 24-channel bark-warped filterbank

Page 12: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Discrimination of Individual Subbands (F-ratio)F

-rat

io

Frequency Frequency

Helsinki TIMIT

Low-end (~0-200 Hz) and mid/high frequencies (~ 2 - 4 kHz) are important, region ~200-2000 Hz less important. (However, not consistently!)

(Fixed parameters: 30 linearly spaced triangular filters)

Page 13: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Subband Features : The Effect of the Filter Output Nonlinearity

Helsinki TIMIT

Consistent ordering (!) : cubic < log < linear

Fixed parameters: 30 linearly spaced triangular filters

1. Linear f(x) = x2. Logarithmic: f(x) = log(1 + x)3. Cubic: f(x) = x1/3

Page 14: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Subband Features : The Effect of the Filter Shape

Helsinki TIMIT

The differences are small, no consistent ordering

probably the filter shape is not as crucial as the other parameters

Fixed parameters: 30 linearly spaced filters, log-compression

1. Rectangular2. Triangular3. Hanning

Page 15: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Subband Features : The Number of Subbands (1)

Helsinki TIMIT

Observation: error rates decrease monotonically with increasing number of subbands (in most cases) …

Fixed parameters: linearly spaced / triangular-shaped filters, log-compression

Experiment 1: From 5 to 50

Page 16: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Subband Features : The Number of Subbands (2)

Fixed parameters: linearly spaced / triangular-shaped filters, log-compression

Experiment 2: From 50 to 250

Helsinki: (Almost) monotonous decrease in errors with increasing number of subbands

TIMIT: Optimum number of bands is in the range 50..100

Differences between corpora are (partly) explained by the discrimination curves

Page 17: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Discussion of the Subband Features

• (Typically used) log-compression should be replaced with cubic compression or some better nonlinearity

• Number of subbands should be relatively high (at least 50 based on these experiments)

• Shape of the filter does not seem to be important• Discriminative information is not evenly spaced along the

frequency axis• The relative discriminatory powers of subbands depends on

the selected speaker population/language/speech content…

Page 18: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

2. FFT-Cepstral Features

Page 19: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Computation of FFT-Cepstrum

Windowed speech frame

Magnitude spectrum by FFT

Smoothing by a filterbank

Nonlinear mapping of the filter outputs

c = (c1,…,cM)T

Decorrelation by DCT

Coefficient selection

Cepstrum vector

Processing is very similar to “raw” subband processing

Common steps

Page 20: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

FFT-Cepstrum : Type of Frequency Warping

Helsinki TIMIT

Fixed parameters: 30 triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0]

Helsinki: Mel-frequency warped cepstrum gives the best results on average

TIMIT: Linearly warped cepstrum gives the best results on average

Same explanation as before: discrimination curves

1. Linear warping2. Mel-warping3. Bark-warping4. ERB-warping

Page 21: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

FFT-Cepstrum : Number of Cepstral Coefficients

( Fixed parameters: mel-frequency warped triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0], codebook size = 64)

Helsinki TIMIT

Minimum number of coefficients around ~ 10, rather independent of the number of filters

Page 22: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Discussion About the FFT-Cepstrum• Same performance as with the subband features, but smaller number of features

For computational and modeling reasons, cepstrum is the preferred method of these two in automatic recognition

• The commonly used mel-warped filterbank is not the best choice in general case !

There is no reason to assume that it would be, since mel-cepstrum is based on modeling of human hearing and originally meant for speech recognition purposes

• I prefer / recommend to use linear frequency warping, since:

It is easier to control the amount resolution on desired subbands (e.g. by linear weighting). In nonlinear warping, the relationship between the “real” and “warped” frequency axes is more complicated

Page 23: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

3. LPC-Derived Features

Page 24: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

What Is Linear Predictive Coding (LPC) ?• In time domain, current sample is approximated as a linear combination of the past p samples :

• The objective is to determine the LPC coefficients a[k] k=1,…,p such that the squared prediction error is minimized• In the frequency domain, LPC’s define an all-pole IIR-filter whose poles correspond to local maximae of the magnitude spectrum

An LPC pole

Page 25: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Computation of LPC and LPC-Based FeaturesWindowed speech frame

Autocorrelation computation

Levinson-Durbin algorithmSolving of Yule-Walker AR equations

LPC coefficients (LPC)

Arcus sine coefficients (ARCSIN)

Reflection coefficients (REFL)

asin(.)LAR conversion

Log area ratios (LAR)

Complex polynomial expansion

Root-finding algorithm

Line spectral frequencies (LSF)

LPC pole finding

Formants (FMT)

Atal’s recursion

Linear Predictive Cepstral Coefficients (LPCC)

Page 26: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Linear Prediction (LPC) : Number of LPC coefficients

Helsinki TIMIT

• Minimum number around ~ 15 coefficients (not consistent, however)

• Error rates surprisingly small in general !

• LPC coefficients were used directly in Euclidean-distance -based classifier. In literature there is usually warning of the following form : “Do not ever use LPC’s directly, at least with the Euclidean metric.”

Page 27: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Comparison of the LPC-Derived Features

• Overall performance is very good• Raw LPC coefficients gives worst performance on average• Differences between feature sets are rather small

Other factors to be considered:• Computational complexity• Ease of implementation

Fixed parameters: LPC predictor order p = 15

Helsinki TIMIT

A programming bug???

Page 28: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

LPC-Derived FormantsFixed parameters: Codebook size = 64

Helsinki TIMIT

• Formants give comparable, and surprisingly good results !

• Why “surprisingly good” ?1. Analysis procedure was very simple (produces spurious formants)

2. Subband processing, LPC, cepstrum, etc… describe the spectrum continuously - formants on the other hand pick only a discrete number of maximum peaks’ amplitudes from the spectrum (and a small number!)

Page 29: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Discussion About the LPC-Derived Features

• In general, results are promising, even for the raw LPC coefficients

• The differences between feature sets were small– From the implementation and efficiency viewpoint the following

are the most attractive: LPCC, LAR and ARCSIN

• Formants give (surprisingly) good results also, which indicates indirectly:– The regions of spectrum with high amplitude might be important

for speaker recognition

0 1000 2000 3000 4000 5000 60000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Frequency [Hz]

Mag

nitu

de [d

B]

An idea for future study :

How about selecting subbands around local maximae?

Page 30: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

4. Dynamic Features

Page 31: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Dynamic Spectral Features• Dynamic feature: an estimate of the time derivate of the feature• Can be applied to any feature

Time trajectory of the original feature

Estimate of the 1st time derivative (-feature)

Estimate of the 2nd time derivative ( -feature)

• Two widely used estimatation methods are differentiator and linear regression method :

(M = number of neigboring frames, typically M = 1..3)

• Typical phrase : “Don’t use differentiator, it emphasizes noise”

Page 32: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Delta Features :Comparison of the Two Estimation Methods

Helsinki TIMIT

Dif

fere

nti

ator

Reg

ress

ion

Best : -LSF (7.0 %), M=1 Best: -ARCSIN (8.1 %), M=4

Best : -LSF (10.6 %), M=2 Best : -ARCSIN (8.8 %), M=1

Page 33: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Delta Features :Comparison with the Static Features

Discussion About the Delta Features :

• Optimum order is small (In most cases M=1,2 neighboring frames)

• The differentiator method is better in most cases (surprising result, again!)

• Delta features are worse than static features but might provide uncorrelated extra information (for multiparameter recognition)

• The commonly used delta-cepstrum gives quite poor results !

Page 34: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

Towards Concluding Remarks ...

Page 35: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

FFT-Cepstrum Revisited :Question : Is Log-Compression / Mel-Cepstrum Best ?

Helsinki TIMIT

Answer: NO !

Please note: Now segment length is reduced down to T=100 vectors, that’s why absolute recognition rates are worse than before (ran out of time for the thesis…)

Page 36: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

FFT- vs. LPC-Cepstrum:Question: Is it really that “FFT-cepstrum is more accurate” ?

Helsinki TIMIT

Answer: NO ! (TIMIT shows this quite clearly)

Page 37: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

The Essential Difference Between the FFT- and LPC-Cepstra ?

• FFT-cepstrum approximates the spectrum by linear combination of cosine functions (non-parametric model)

• LPC makes a least-squares fit of the all-pole filter to the spectrum (parametric model)

• FFT-cepstrum first smoothes the original spectrum by filterbank, whereas LPC filter is fitted directly to the original spectrum

LPC captures more “details”

FFT-cepstrum represents “smooth” spectrum

However, one might argue that we could drop out the filterbank from FFT-cepstrum ...

Page 38: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

General Summary and Discussion

• Number of subbands should be high (30-50 for these corpora)

• Number of cepstral coefficients (LPC/FFT-based) should high ( 15)

• In particular, number of subbands, coefficients, and LPC order are clearly higher than in speech recognition generally

• Formants give (surprisingly) good performance

• Number of formants should be high ( 8)

• In most cases, the differentiator method outperforms the regression method in delta-feature computation

All of these indicate indirectly the importance of spectral details and rapid spectral changes

Page 39: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

“Philosophical Discussion”

• The current knowledge of speaker individuality is far from perfect :

• Engineers concentrete on tuning complex feature compensation methods but don’t (necessarily) understand what’s individual in speech

• Phoneticians try to find the “individual code” in the speech signal, but they don’t (necessarily) know how to apply engineers’ methods

• Why do we believe that speech would be any less individual than e.g. fingerprints ?

• Compare the history “fingerprint” and “voiceprint” :

• Fingerprints have been studied systematically since the 17th century (1684)

• Spectrograph wasn’t invented until 1946 ! How could we possibly claim that we know what speech is with research of less than 60 years?

• Why do we believe that human beings are optimal speaker discriminators? Our ear can be fooled already (e.g. MP3 encoding).

Page 40: Spectral Features for Automatic Text- Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University

That’s All, Folks !