speaker detection without models dan gillick july 27, 2004
Post on 20-Dec-2015
216 views
TRANSCRIPT
Speaker Detection Without Models
Dan Gillick
July 27, 2004
July 27, 2004 Speaker Detection Without Models Dan Gillick (2)
Motivation
Want to develop a speaker ID algorithm that:
• captures sequential information• takes advantage of extended data• combines well with existing baseline systems
July 27, 2004 Speaker Detection Without Models Dan Gillick (3)
The Algorithm
• Rather than build models (GMM, HMM, etc.) to describe the information in the training data, we directly compare test data frames to training data frames.
• We compare sequences of frames because we believe there is information in sequences that systems like the GMM do not capture.
• The comparisons are guided by token-level alignments extracted from a speech recognizer.
July 27, 2004 Speaker Detection Without Models Dan Gillick (4)
Front-End
Using 40 MFCC features per 10ms frame
– 19 Cepstrals and Energy (C0)
– Their deltas
July 27, 2004 Speaker Detection Without Models Dan Gillick (5)
The Algorithm: Overview
Cut the test and target data into tokens
– use word or phone-level time-alignments from the SRI recognizer
– note that these alignments have lots of errors (both word errors and alignment errors)
July 27, 2004 Speaker Detection Without Models Dan Gillick (6)
The Algorithm: Overview
Compare test and target data
1. Take the first test token
2. Find every instance of this token in the target data
3. Measure the distance between the test token and each target instance
4. Move on to the next test token
July 27, 2004 Speaker Detection Without Models Dan Gillick (7)
The AlgorithmTest data Training data
July 27, 2004 Speaker Detection Without Models Dan Gillick (8)
The AlgorithmTest data Training data
“take the first test token”: grab the sequence of frames corresponding to this token according to the recognizer output
Hello
July 27, 2004 Speaker Detection Without Models Dan Gillick (9)
The AlgorithmTest data Training data
“Find every instance of this token in the target data”
Hello Hello (1)
Hello (2)
Hello (3)
July 27, 2004 Speaker Detection Without Models Dan Gillick (10)
The AlgorithmTest data Training data
“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances
Hello Hello (1)
Hello (2)
Hello (3)Euclidian distance function
Distance = 25
Distance = 25
July 27, 2004 Speaker Detection Without Models Dan Gillick (11)
The AlgorithmTest data Training data
“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances
Hello Hello (1)
Hello (2)
Hello (3)Euclidian distance function
Distance = 40
Distance = 25
Distance = 40
July 27, 2004 Speaker Detection Without Models Dan Gillick (12)
The AlgorithmTest data Training data
“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances
Hello Hello (1)
Hello (2)
Hello (3)Euclidian distance function
Distance = 18
Distance = 25
Distance = 40
Distance = 18
July 27, 2004 Speaker Detection Without Models Dan Gillick (13)
The Algorithm: Distance Function
But these instances have different lengths. How do we line up the frames? Here are some possibilities:
1. Line up the first frames and cut off the longer at the shorter
2. Use a sliding window approach: slide the shorter through the longer, taking the best (smallest) total distance.
3. Use dynamic time warping (DTW)
Hello (test)
Hello (3)
Euclidian distance function
Distance = 18
July 27, 2004 Speaker Detection Without Models Dan Gillick (14)
The Algorithm: Take the 1-BestTest data Training data
Now what do we do with these scores? There are a number of options, but we only keep the 1-best score. One motivation for this decision is that we are mainly interested in positive information.
Hello Hello (1)
Hello (2)
Hello (3)
Distance = 25
Distance = 40
Distance = 18
Token Score = 18
July 27, 2004 Speaker Detection Without Models Dan Gillick (15)
The Algorithm: ScoringTest data Training data
So we accumulate scores for each token. What do we do with these? Some options:
1. Average them, normalizing either by the number of tokens or by the total number of frames (Basic score)
2. Focus on some subset of the scores
a. Positive evidence (Hit score): ∑ [ (#frames) / (k^score) ]
b. Negative evidence: ∑ [ (#frames*target count) / (k^(M-score)) ]
Hello Token Score = 18my Token Score = 16.5name Token Score = 21Etc…
July 27, 2004 Speaker Detection Without Models Dan Gillick (16)
Normalization
• Most systems use a UBM (universal background model) to center the test pieces– Since this system has no model, we create a
background by lumping together speech from a number of different held-out speakers and running the algorithm with this group as training data
• ZNorm to center the “models”– Find the mean score for each “model” or training set by
running a number of held-out imposters against each one.
July 27, 2004 Speaker Detection Without Models Dan Gillick (17)
Results
Results reported on split 1 (of 6) of Switchboard I
(1624 test vs. target scores)
July 27, 2004 Speaker Detection Without Models Dan Gillick (18)
Results
TOKEN STYLE BKG ZNORMBSCR EER
HS EER
COMB EER
COMB DCF
word unigrams sw 14 none 6.82 4.83
For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF
Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set
July 27, 2004 Speaker Detection Without Models Dan Gillick (19)
Results
TOKEN STYLE BKG ZNORMBSCR EER
HS EER
COMB EER
COMB DCF
word unigramssw
dtw
14
14
none
none
6.82
4.16
4.83
3.16
For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF
Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set
July 27, 2004 Speaker Detection Without Models Dan Gillick (20)
Results
TOKEN STYLE BKG ZNORMBSCR EER
HS EER
COMB EER
COMB DCF
word unigrams
sw
dtw
dtw
14
14
14
none
none
16
6.82
4.16
2.66
4.83
3.16
2.16 2.00 0.0416
For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF
Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set
July 27, 2004 Speaker Detection Without Models Dan Gillick (21)
Results
TOKEN STYLE BKG ZNORMBSCR EER
HS EER
COMB EER
COMB DCF
word unigrams
sw
dtw
dtw
14
14
14
none
none
16
6.82
4.16
2.66
4.83
3.16
2.16 2.00 0.0416
word bigramssw
dtw
14
14
16
16
5.80
2.83
3.68
2.16 1.83 0.0447
phone unigrams dtw 14 16 2.64 2.48 1.98 0.0560
phone bigrams dtw 14 16 1.83 1.83 1.33 0.0333
phone trigrams dtw 14 16 1.65 1.65 1.16 0.0345
For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF
Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set
July 27, 2004 Speaker Detection Without Models Dan Gillick (22)
Results
How do positive and negative evidence compare?
Word-bigrams + bkg (positive evidence) 3.16% EER
Word-bigrams + bkg (negative evidence) 26.5% EER
July 27, 2004 Speaker Detection Without Models Dan Gillick (23)
Results
How is the system effected by errorful recognizer transcripts?
Word bigrams + bkg + znorm (recognized transcripts) 1.83% EER
Word bigrams + bkg + znorm (true transcripts) 1.16% EER
July 27, 2004 Speaker Detection Without Models Dan Gillick (24)
Results
How does the system combine with the GMM?
This experiment was done on the first half (splits 1,2,3) of Switchboard I
EER DCF
SRI GMM system 0.97 0.04806
Best phone-bigram system 1.46 0.06110
GMM + phone-bigrams 0.49 0.02040
July 27, 2004 Speaker Detection Without Models Dan Gillick (25)
Future Stuff
• Try larger background population, larger znorm set• Try other, non-Euclidian distance functions• Change the front-end features (Feature mapping)• Run the system on Switchboard II; 2004 eval. data• Dynamic token selection
– While the system works well already, perhaps its real strength is one which has not been exploited. Since there are no models, we might dynamically select the longest available frame sequences in the test and target data for scoring.
July 27, 2004 Speaker Detection Without Models Dan Gillick (26)
Thanks
Steve (wrote all the DTW code, versions 1 through 5…)
Barry (tried to make my slides fancy)
Barbara
Everyone else in the Speaker ID group