support vector machines based text-dependent speaker verification using hmm supervectors

research & development

Support Vector Machines Based Text-Dependent Speaker Verification Using HMM Supervectors

Chengyu Dong

France Telecom R&D Beijing

2008-01-21


Outline

Introduction

HMM supervectors

Normalized scores using SI HMM supervectors

Experimental results

Conclusions


Introduction

Subword based HMM is state-of-the-art technology for text dependent speaker verification (TDSV) system.

Support vector machines (SVM) using GMM supervector linear (GSL) kernel has proven to be an effective method for text-independent tasks.

Both two popular techniques inspire ideas and methods in research for TDSV tasks.


HMM Baseline Systems

Forced alignment

1 1 11 1 1; log | ; log | ;i i i

i i i

t t ts bi t i t i t iO S P O S P O S

Phone LLR scores:

1 1

1

1;S ;

Ti

i

Nt

i t ii

O O S

Final verification score:

N segments , where frame to frame are belonging

to the ith phone. 1 2

1 11 1 1, ,..., N

N

tt tt tO O O O

1 1it it

A test utterance with observation is firstly segmented into 1 2, ,..., TO O O O


Support Vector Machines

SVM is a two-class classifier. It is another well- used and powerful modeling method.

In the standard formulation, a SVM, , is given by

1

1

,

,

,

M

i ii

M

i ii

f d

d

k d

v w v

v v

v v

f v

Each speaker is modeled by a set of support vectors to form a two-class hyperplane as the figure shows below.


HMM Supervectors

The Block Diagram of HMM Supervectors Extraction


HMM Supervectors

Kullback-Leibler divergence (KLD) of two HMM models and is defined as:

aa b a

blog

xR

xD x dx

x

a b

a b a b a b

1

a b

a b

a b1

a b

1

1

J

j j j jj

Jj j

j jj j j

J

j jj

D D a a D b b

D a aD b b

D b b

D b b

Finally deduce a conclusion

A good upper bound

estimation


HMM Supervectors

Linear kernel:

1

1,

LT

LIN j jj

K X Y X Y x yL

,1

1, max

I JI J

LT

DTA j jj

K X Y X Y j x yM

Dynamic Time Alignment Kernel:

1 1I Ij j X

1 1J Jj j Y

Subject to: M X Y

, 1

, max 1, 1 2

1,

Ti j

Ti j

Ti j

D i j x y

D i j D i j x y

D i j x y

Optimization

function:

,D i j

, 1D i j

1,D i j

1, 1D i j

1

12


HMM Supervectors

Linear kernel function:

1 1a b2 2

a b1

a b

,

TJ M

LIN g g g g g gg

T

K c m c m

Nonlinear kernel function:

a ba b

,,

1, max

I JI J

T

DTA j k j kj k

K jM

1a a2

1b b2

: ,I I I I

J J J J

j k j k j k j k

j k j k j k j k

where c m

c m



The SVM discriminant function can be summarized as:

1

|

TJ

s s s T sj j j

j

S f y d W d

| |s s b sS f f

Normalization form:

The HMM supervector derives from the background SI HMMs

b

The concept of normalizing SVM score comes from zero normalization (Z-Norm).



The reason why we use normalized score is

the lack of training data

,s s b Suppose:

denotes the dimensions which are adapted s

is the remaining part of SI HMM means. b

Some part of dimensions are not adapted.

Therefore SI HMM mean vectors remain in the supervector.



Un-normalized SVM scores:

1 2

T s

T s T b

S W d

W W d

Normalized SVM scores:

1 2

1

T s T b

T s b T b b

T s b

S W d W d

W W

W

No discrimination

s b only shift the input dimension space

removes the nuisance of supervectorsS


Experimental Results 134 speakers involved in the evaluations. There are total 5292

target trials and 7840 imposter trials. Each participant is required to utter one password twice. The imposters were assumed to know the exact password of the target speaker.

SD HMMs is constructed by MAP adaptation with relevance factor to 1. Context-independent phone units are used as a universal phone set.

The acoustic features used in our system are the first 12 PLP coefficients together with the log-energy of each frame which are calculated every 10 ms using a 25ms Hamming window. The features are processed through a RASTA channel equalization filter. By including the first and the second derivatives over ±2 frame span, 39-dimensional feature vectors were finally used.


Experimental Results



0. 0%

1. 0%

2. 0%

3. 0%

4. 0%

5. 0%

6. 0%

HMMGMM

SVM HMM+GMM

HMM+SVM

GMM+SVM

HMM+GMM+SVM

0

0. 01

0. 02

0. 03

0. 04

0. 05

0. 06

EER

DCF

System fusions on HMMs, GMMs and SVMs

;S 1 ;SF O f O

is a weighting factor determined a discriminant analysis procedure like LDA which follows the Fisher'sdiscrimination criterion.

System fusions



2-D distribution of the scores for target and imposter trials (HMM and SVM scores)

3-D distribution of the scores for target and imposter trials (HMM, GMM and SVM scores)


Conclusions

SVMs using HMM supervectors provide another evidence for TDSV systems.

DTA kernel performs a little better than the linear kernel, but requires too much computational cost.

Normalized output score can remarkably improve the performance of the SVM system.

Fusion of HMM and SVM yields excellent results. EER is reduced from 4.01% to 3.47%.

When incorporates the fusion system, EER is further reduced to 2.95%.


Thanks

support vector machines based text-dependent speaker verification using hmm supervectors

Documents