spectral features for automatic text- independent speaker recognition tomi kinnunen research...

Spectral Features for Automatic Text-Independent Speaker Recognition

Tomi Kinnunen

Research seminar, 27.2.2004

Department of Computer ScienceUniversity of Joensuu

Based on a True Story …

T. Kinnunen: Spectral Features for Automatic Text-Independent Speaker Recognition, Ph.Lic. thesis, 144 pages, Department of Computer Science, University of Joensuu, 2004.

Downloadable in PDF from :

http://cs.joensuu.fi/pages/tkinnu/research/index.html

Introduction

Why Study Feature Extraction ?

• As the first component in the recognition chain, the accuracy of classification is strongly determined by its selection

Why Study Feature Extraction ? (cont.)

• Typical feature extraction methods are directly “loaned” from the speech recognition task

Quite contradictory, considering the “opposite” nature of the two tasks

• In general, it seems that currently we are at the best guessing what might be invidual in our speech !

• Because it is interesting & challenging!

Principle of Feature Extraction

1. FFT-implemented filterbanks (subband processing)

2. FFT-cepstrum

3. LPC-derived features

4. Dynamic spectral features (delta features)

Studied Features

Speech Material & Evaluation Protocol

• Each test file is splitted into segments of T=350 vectors (about ~ 3.5 seconds of speech)

• Each segment is classified by vector quantization

• Speaker models are constructed from the training data by RLS clustering algorithm

• Performance measure = classification error rate (%)

1. Subband Features

Computation of Subband Features

Windowed speech frame

Magnitude spectrum by FFT

Smoothing by a filterbank

Nonlinear mapping of the filter outputs

Compressed filter ouputsf = (f1,f2, … , fM)T

Parameters of the filterbank:• Number of subbands• Filter shapes & bandwidths• Type of frequency warping• Filter output nonlinearity

Frequency Warping… What’s That?!

0 500 1000 1500 2000 2500 3000 3500 40000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

Gai

n

shape: triangular, warping: Bark

0 0.5 1 1.5 2 2.5 3 3.5 40

5

10

15

20

25

Frequency [kHz]

Fre

quen

cy [B

ark]

• “Real” frequency axis (Hz) is stretched and compressed locally according to a (bijective) warping function

Bark scaleA 24-channel bark-warped filterbank

Discrimination of Individual Subbands (F-ratio)F

-rat

io

Frequency Frequency

Helsinki TIMIT

Low-end (~0-200 Hz) and mid/high frequencies (~ 2 - 4 kHz) are important, region ~200-2000 Hz less important. (However, not consistently!)

(Fixed parameters: 30 linearly spaced triangular filters)

Subband Features : The Effect of the Filter Output Nonlinearity

Helsinki TIMIT

Consistent ordering (!) : cubic < log < linear

Fixed parameters: 30 linearly spaced triangular filters

1. Linear f(x) = x2. Logarithmic: f(x) = log(1 + x)3. Cubic: f(x) = x1/3

Subband Features : The Effect of the Filter Shape

Helsinki TIMIT

The differences are small, no consistent ordering

probably the filter shape is not as crucial as the other parameters

Fixed parameters: 30 linearly spaced filters, log-compression

1. Rectangular2. Triangular3. Hanning

Subband Features : The Number of Subbands (1)

Helsinki TIMIT

Observation: error rates decrease monotonically with increasing number of subbands (in most cases) …

Fixed parameters: linearly spaced / triangular-shaped filters, log-compression

Experiment 1: From 5 to 50

Subband Features : The Number of Subbands (2)

Fixed parameters: linearly spaced / triangular-shaped filters, log-compression

Experiment 2: From 50 to 250

Helsinki: (Almost) monotonous decrease in errors with increasing number of subbands

TIMIT: Optimum number of bands is in the range 50..100

Differences between corpora are (partly) explained by the discrimination curves

Discussion of the Subband Features

• (Typically used) log-compression should be replaced with cubic compression or some better nonlinearity

• Number of subbands should be relatively high (at least 50 based on these experiments)

• Shape of the filter does not seem to be important• Discriminative information is not evenly spaced along the

frequency axis• The relative discriminatory powers of subbands depends on

the selected speaker population/language/speech content…

2. FFT-Cepstral Features

Computation of FFT-Cepstrum

Windowed speech frame

Magnitude spectrum by FFT

Smoothing by a filterbank

Nonlinear mapping of the filter outputs

c = (c1,…,cM)T

Decorrelation by DCT

Coefficient selection

Cepstrum vector

Processing is very similar to “raw” subband processing

Common steps

FFT-Cepstrum : Type of Frequency Warping

Helsinki TIMIT

Fixed parameters: 30 triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0]

Helsinki: Mel-frequency warped cepstrum gives the best results on average

TIMIT: Linearly warped cepstrum gives the best results on average

Same explanation as before: discrimination curves

1. Linear warping2. Mel-warping3. Bark-warping4. ERB-warping

FFT-Cepstrum : Number of Cepstral Coefficients

( Fixed parameters: mel-frequency warped triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0], codebook size = 64)

Helsinki TIMIT

Minimum number of coefficients around ~ 10, rather independent of the number of filters

Discussion About the FFT-Cepstrum• Same performance as with the subband features, but smaller number of features

For computational and modeling reasons, cepstrum is the preferred method of these two in automatic recognition

• The commonly used mel-warped filterbank is not the best choice in general case !

There is no reason to assume that it would be, since mel-cepstrum is based on modeling of human hearing and originally meant for speech recognition purposes

• I prefer / recommend to use linear frequency warping, since:

It is easier to control the amount resolution on desired subbands (e.g. by linear weighting). In nonlinear warping, the relationship between the “real” and “warped” frequency axes is more complicated

3. LPC-Derived Features

What Is Linear Predictive Coding (LPC) ?• In time domain, current sample is approximated as a linear combination of the past p samples :

• The objective is to determine the LPC coefficients a[k] k=1,…,p such that the squared prediction error is minimized• In the frequency domain, LPC’s define an all-pole IIR-filter whose poles correspond to local maximae of the magnitude spectrum

An LPC pole

Computation of LPC and LPC-Based FeaturesWindowed speech frame

Autocorrelation computation

Levinson-Durbin algorithmSolving of Yule-Walker AR equations

LPC coefficients (LPC)

Arcus sine coefficients (ARCSIN)

Reflection coefficients (REFL)

asin(.)LAR conversion

Log area ratios (LAR)

Complex polynomial expansion

Root-finding algorithm

Line spectral frequencies (LSF)

LPC pole finding

Formants (FMT)

Atal’s recursion

Linear Predictive Cepstral Coefficients (LPCC)

Linear Prediction (LPC) : Number of LPC coefficients

Helsinki TIMIT

• Minimum number around ~ 15 coefficients (not consistent, however)

• Error rates surprisingly small in general !

• LPC coefficients were used directly in Euclidean-distance -based classifier. In literature there is usually warning of the following form : “Do not ever use LPC’s directly, at least with the Euclidean metric.”

Comparison of the LPC-Derived Features

• Overall performance is very good• Raw LPC coefficients gives worst performance on average• Differences between feature sets are rather small

Other factors to be considered:• Computational complexity• Ease of implementation

Fixed parameters: LPC predictor order p = 15

Helsinki TIMIT

A programming bug???

LPC-Derived FormantsFixed parameters: Codebook size = 64

Helsinki TIMIT

• Formants give comparable, and surprisingly good results !

• Why “surprisingly good” ?1. Analysis procedure was very simple (produces spurious formants)

2. Subband processing, LPC, cepstrum, etc… describe the spectrum continuously - formants on the other hand pick only a discrete number of maximum peaks’ amplitudes from the spectrum (and a small number!)

Discussion About the LPC-Derived Features

• In general, results are promising, even for the raw LPC coefficients

• The differences between feature sets were small– From the implementation and efficiency viewpoint the following

are the most attractive: LPCC, LAR and ARCSIN

• Formants give (surprisingly) good results also, which indicates indirectly:– The regions of spectrum with high amplitude might be important

for speaker recognition

0 1000 2000 3000 4000 5000 60000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Frequency [Hz]

Mag

nitu

de [d

B]

An idea for future study :

How about selecting subbands around local maximae?

4. Dynamic Features

Dynamic Spectral Features• Dynamic feature: an estimate of the time derivate of the feature• Can be applied to any feature

Time trajectory of the original feature

Estimate of the 1st time derivative (-feature)

Estimate of the 2nd time derivative ( -feature)

• Two widely used estimatation methods are differentiator and linear regression method :

(M = number of neigboring frames, typically M = 1..3)

• Typical phrase : “Don’t use differentiator, it emphasizes noise”

Delta Features :Comparison of the Two Estimation Methods

Helsinki TIMIT

Dif

fere

nti

ator

Reg

ress

ion

Best : -LSF (7.0 %), M=1 Best: -ARCSIN (8.1 %), M=4

Best : -LSF (10.6 %), M=2 Best : -ARCSIN (8.8 %), M=1

Delta Features :Comparison with the Static Features

Discussion About the Delta Features :

• Optimum order is small (In most cases M=1,2 neighboring frames)

• The differentiator method is better in most cases (surprising result, again!)

• Delta features are worse than static features but might provide uncorrelated extra information (for multiparameter recognition)

• The commonly used delta-cepstrum gives quite poor results !

Towards Concluding Remarks ...

FFT-Cepstrum Revisited :Question : Is Log-Compression / Mel-Cepstrum Best ?

Helsinki TIMIT

Answer: NO !

Please note: Now segment length is reduced down to T=100 vectors, that’s why absolute recognition rates are worse than before (ran out of time for the thesis…)

FFT- vs. LPC-Cepstrum:Question: Is it really that “FFT-cepstrum is more accurate” ?

Helsinki TIMIT

Answer: NO ! (TIMIT shows this quite clearly)

The Essential Difference Between the FFT- and LPC-Cepstra ?

• FFT-cepstrum approximates the spectrum by linear combination of cosine functions (non-parametric model)

• LPC makes a least-squares fit of the all-pole filter to the spectrum (parametric model)

• FFT-cepstrum first smoothes the original spectrum by filterbank, whereas LPC filter is fitted directly to the original spectrum

LPC captures more “details”

FFT-cepstrum represents “smooth” spectrum

However, one might argue that we could drop out the filterbank from FFT-cepstrum ...

General Summary and Discussion

• Number of subbands should be high (30-50 for these corpora)

• Number of cepstral coefficients (LPC/FFT-based) should high ( 15)

• In particular, number of subbands, coefficients, and LPC order are clearly higher than in speech recognition generally

• Formants give (surprisingly) good performance

• Number of formants should be high ( 8)

• In most cases, the differentiator method outperforms the regression method in delta-feature computation

All of these indicate indirectly the importance of spectral details and rapid spectral changes

“Philosophical Discussion”

• The current knowledge of speaker individuality is far from perfect :

• Engineers concentrete on tuning complex feature compensation methods but don’t (necessarily) understand what’s individual in speech

• Phoneticians try to find the “individual code” in the speech signal, but they don’t (necessarily) know how to apply engineers’ methods

• Why do we believe that speech would be any less individual than e.g. fingerprints ?

• Compare the history “fingerprint” and “voiceprint” :

• Fingerprints have been studied systematically since the 17th century (1684)

• Spectrograph wasn’t invented until 1946 ! How could we possibly claim that we know what speech is with research of less than 60 years?

• Why do we believe that human beings are optimal speaker discriminators? Our ear can be fooled already (e.g. MP3 encoding).

That’s All, Folks !

spectral features for automatic text- independent speaker recognition tomi kinnunen research...

Documents

subband features

introduction slide

selection slide

hanning slide

studied features

lpcderived features

t parameters

cases fixed parameters