speaker detection without models dan gillick july 27, 2004

Speaker Detection Without Models

Dan Gillick

July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (2)

Motivation

Want to develop a speaker ID algorithm that:

• captures sequential information• takes advantage of extended data• combines well with existing baseline systems


The Algorithm

• Rather than build models (GMM, HMM, etc.) to describe the information in the training data, we directly compare test data frames to training data frames.

• We compare sequences of frames because we believe there is information in sequences that systems like the GMM do not capture.

• The comparisons are guided by token-level alignments extracted from a speech recognizer.


Front-End

Using 40 MFCC features per 10ms frame

– 19 Cepstrals and Energy (C0)

– Their deltas


The Algorithm: Overview

Cut the test and target data into tokens

– use word or phone-level time-alignments from the SRI recognizer

– note that these alignments have lots of errors (both word errors and alignment errors)


The Algorithm: Overview

Compare test and target data

1. Take the first test token

2. Find every instance of this token in the target data

3. Measure the distance between the test token and each target instance

4. Move on to the next test token


The AlgorithmTest data Training data



“take the first test token”: grab the sequence of frames corresponding to this token according to the recognizer output

Hello



“Find every instance of this token in the target data”

Hello Hello (1)

Hello (2)

Hello (3)



“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances

Hello Hello (1)

Hello (2)

Hello (3)Euclidian distance function

Distance = 25

Distance = 25




Hello Hello (1)

Hello (2)


Distance = 40

Distance = 25

Distance = 40




Hello Hello (1)

Hello (2)


Distance = 18

Distance = 25

Distance = 40

Distance = 18


The Algorithm: Distance Function

But these instances have different lengths. How do we line up the frames? Here are some possibilities:

1. Line up the first frames and cut off the longer at the shorter

2. Use a sliding window approach: slide the shorter through the longer, taking the best (smallest) total distance.

3. Use dynamic time warping (DTW)

Hello (test)

Hello (3)

Euclidian distance function

Distance = 18


The Algorithm: Take the 1-BestTest data Training data

Now what do we do with these scores? There are a number of options, but we only keep the 1-best score. One motivation for this decision is that we are mainly interested in positive information.

Hello Hello (1)

Hello (2)

Hello (3)

Distance = 25

Distance = 40

Distance = 18

Token Score = 18


The Algorithm: ScoringTest data Training data

So we accumulate scores for each token. What do we do with these? Some options:

1. Average them, normalizing either by the number of tokens or by the total number of frames (Basic score)

2. Focus on some subset of the scores

a. Positive evidence (Hit score): ∑ [ (#frames) / (k^score) ]

b. Negative evidence: ∑ [ (#frames*target count) / (k^(M-score)) ]

Hello Token Score = 18my Token Score = 16.5name Token Score = 21Etc…


Normalization

• Most systems use a UBM (universal background model) to center the test pieces– Since this system has no model, we create a

background by lumping together speech from a number of different held-out speakers and running the algorithm with this group as training data

• ZNorm to center the “models”– Find the mean score for each “model” or training set by

running a number of held-out imposters against each one.


Results

Results reported on split 1 (of 6) of Switchboard I

(1624 test vs. target scores)


Results

TOKEN STYLE BKG ZNORMBSCR EER

HS EER

COMB EER

COMB DCF

word unigrams sw 14 none 6.82 4.83

For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF

Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set


Results


HS EER

COMB EER

COMB DCF

word unigramssw

dtw

14

14

none

none

6.82

4.16

4.83

3.16




Results


HS EER

COMB EER

COMB DCF

word unigrams

sw

dtw

dtw

14

14

14

none

none

16

6.82

4.16

2.66

4.83

3.16

2.16 2.00 0.0416




Results


HS EER

COMB EER

COMB DCF

word unigrams

sw

dtw

dtw

14

14

14

none

none

16

6.82

4.16

2.66

4.83

3.16

2.16 2.00 0.0416

word bigramssw

dtw

14

14

16

16

5.80

2.83

3.68

2.16 1.83 0.0447

phone unigrams dtw 14 16 2.64 2.48 1.98 0.0560

phone bigrams dtw 14 16 1.83 1.83 1.33 0.0333

phone trigrams dtw 14 16 1.65 1.65 1.16 0.0345




Results

How do positive and negative evidence compare?

Word-bigrams + bkg (positive evidence) 3.16% EER

Word-bigrams + bkg (negative evidence) 26.5% EER


Results

How is the system effected by errorful recognizer transcripts?

Word bigrams + bkg + znorm (recognized transcripts) 1.83% EER

Word bigrams + bkg + znorm (true transcripts) 1.16% EER


Results

How does the system combine with the GMM?

This experiment was done on the first half (splits 1,2,3) of Switchboard I

EER DCF

SRI GMM system 0.97 0.04806

Best phone-bigram system 1.46 0.06110

GMM + phone-bigrams 0.49 0.02040


Future Stuff

• Try larger background population, larger znorm set• Try other, non-Euclidian distance functions• Change the front-end features (Feature mapping)• Run the system on Switchboard II; 2004 eval. data• Dynamic token selection

– While the system works well already, perhaps its real strength is one which has not been exploited. Since there are no models, we might dynamically select the longest available frame sequences in the test and target data for scoring.


Thanks

Steve (wrote all the DTW code, versions 1 through 5…)

Barry (tried to make my slides fancy)

Barbara

Everyone else in the Speaker ID group

speaker detection without models dan gillick july 27, 2004

Documents