building a robust speaker recognition system old řich plchot , ondřej glembek , pavel matějka

BEST-PRISM-12mReview

Building a Robust Speaker Recognition System

Oldich Plchot, Ondej Glembek, Pavel MatjkaDecember 9th 2012

1The PRISM TeamSRI InternationalHarry Bratt, Lukas Burget, Luciana Ferrer, Martin Graciarena, Aaron Lawson, Yun Lei, Nicolas SchefferSachin Kajarekar, Elizabeth Shriberg, Andreas Stolcke

Brno University of TechnologyJan H. Cernocky, Ondrej Glembek, Pavel Matejka, Oldrich Plchot

PRISM Robustness How did we achieve these results?

What are the outstanding research issues?BEST Phase I PI conference, Nov. 29th, 2011~~

~~Error rates lowered3RobustnessA need for effectiveness on non-ideal conditionsMoving beyond biometric evaluation on clean, controlled acquisition environmentsExtract robust and discriminative biometric features, invariant to such variability types

A need for predictabilityA system claiming 99% accuracy should not give 80% on unseen dataUnless otherwise warned by the system

BEST Phase I PI conference, Nov. 29th, 20114A comprehensive approachMulti-stream High order and Low order features

Advanced speaker modeling and system combination

Prediction of difficult scenarios QM vector

Robustness vs. Unknown Carefully test on held-out data, beware of overtraining

BEST Phase I PI conference, Nov. 29th, 201155A comprehensive approachMulti-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Multiple HOFs: new complimentary informationMultiple LOFs: ditto + redundancy for increased robustness

Advanced speaker modeling and system combination: Unified modeling framework i-vector / probabilistic discriminant analysisRobust variation-compensation scheme for multiple features and variability typesi-vector / PLDA framework adapted to all high- and low- level featuresDiscriminative training for more compact thus robust systems

BEST Phase I PI conference, Nov. 29th, 20116THE MAGIC? - iVectorsiVector extractor is model similar to JFA with single subspace T easier to trainno need for speaker labels the subspace can be trained on large amount of unlabeled recordingsWe assume standard normal prior factors i.iVector point estimate of i can now be extracted for every recording as its low-dimensional, fixed-length representation (typically 200 dimensions). However, iVector contains information about both speaker and channel. Hopefully this can by separated by the following classifier.

Dehak, N., et al., Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification In Proc Interspeech 2009, Brighton, UK, September 2009

Do the remark on iVector extractor training here maybe IllustrationLow dimensional vector can represent complex patterns in multi-dimensional spacei2i3i1 121212=t13t23t13t23t13t23

m1m2m1m2m1m2+t11 t12t21 t22t11 t12t21 t22t11 t12t21 t22

Probabilistic Linear Discriminant Analysis (PLDA)Let every speech recording be represented by iVector.

What would be now the appropriate probabilistic model for verification?iVector are assumed to be normal distributediVector still contains channel information our model should consider both speaker and channel variability, just like in JFA.

Natural choice is simplified JFA model with only single Gaussian. Such model is known as PLDA and is described by familiar equation:

9For our low-dimensional iVectors, we usually choose U to be full rank matrix no need to consider residual We can rewrite definition of PLDA as

or equivalently asWhy PLDA ?

familiar LDA assumptions !Tady PICHAT do bodu a rikat, zda to je nebo neni stejny mluvci ! PLDA based verificationLets again consider verification score given by log-likelihood ratio for same and different speaker hypothesis, now in the context of modeling iVectors using PLDA:

before: intractable, with iVectors: feasible.

All the integral are now convolutions of Gaussians and can be solved analytically, giving, after some manipulation:

FAST !Performance compared to Eigenchannels and JFANIST SRE 2010, tel-tel (cond. 5) Baseline (relevane MAP) Eigenchannel adapt. JFA iVector+PLDA

iVector+PLDA system:Implementation simpler than for JFAAllows for extremely fast verificationProvides significant improvements especially in important low False Alarm regioniVector+PLDA enhancementsNIST SRE 2010, tel-tel (cond. 5) iVector+PLDA iVector+PLDA fullcov UBM LDA150+Length normalization red + Mean normalizationIdeas behind the enhancements:Make it easier for PLDA by preprocessing the data by LDAMake the heavy tail distributed iVectors more GaussianHelp a little bit more with channel compensation by condition-based mean normalization

Diverse systems unifiedNew technologies for prosody modeling, e.g. subspace multinomial modeling14BEST Phase I Final review, Nov. 3rd, 2011 %FA @ 10% MissAll features are now modeled using the i-vector paradigm, even for combinationBEST evaluation submissionsComplex multi-feature / combination of low- and high- level systems% False Alarms @10% Miss for our PRISM MFCC system: Look at another operating point? (if that low for the evaluation)15BEST Phase I Final review, Nov. 3rd, 2011 %FA @ 10% MissPRIMARYEarly iVector fusion, optimalA comprehensive approachMulti-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Advanced speaker modeling and system combination: Unified modeling framework

Prediction of difficult scenarios: Universal audio characterization for system combinationDetect the difficulty of the problem, eg: enroll on noise, test on telephoneReact appropriately, eg: calibrate scores for sound decisions

BEST Phase I PI conference, Nov. 29th, 201116Predicting challenging scenariosUnified acoustic characterization: A novel approach to extract any metadata in a unified wayDesigned with the BEST program goal in mind: ability to handle unseen data, or compounded variability typesAvoid the unnecessary burden to develop a new system for each new type of metadata IDentification system, where the training data is divided into conditions Investigating how to integrate intrinsic conditions: language and vocal effortBEST Phase I Final review, Nov. 3rd, 20110.0010.40.0010.0010.0010.0010.30.317MicrophoneNoise 20dbNoise 15dbReverb 0.3Reverb 0.5Reverb 0.7TelNoise 8db

Robust calibration / fusionCondition prediction features as new higher order information for calibrationCalibration: scale and shift scores for sound decision making on all operating pointsConfidence under matched vs. mismatch conditions will differDiscriminative training of the bilinear form Model is giving a bias for each condition typeFurther researchAssess generalizationAffect system fusion weights not just calibrationEarly inclusion of the information

BEST Phase I Final review, Nov. 3rd, 201118Fusion with QMBEST Phase I PI conference, Nov. 29th, 2011

Offset

Linear combination weigths

Score from system k

Vectors of metadata

Bilinear combination matrix A comprehensive approachMulti-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Advanced speaker modeling and system combination: Unified modeling frameworkPrediction of difficult scenarios: Unified condition prediction for system combination

Robustness vs. unknown: The PRISM data setExpose systems to a diverse enough variability types of interestAim for generalization on non-ideal or unseen data scenariosUse advanced strategies to compensate for these degradation

BEST Phase I PI conference, Nov. 29th, 201120The PRISM data setA multi-variability, large scale, speaker recognition evaluation set Unprecedented design effort across many data setsSimulation of extrinsic variability types: reverb & noiseIncorporation of intrinsic and cross-language variability1000 speakers, 30K audio files and more than 70M trials Open design: Recipe published at SRE11 analysis workshop [Ferrer11]Extrinsic data simulationDegradation of a clean interview data set from SRE08 and 10 (close mics)A variety of degradation aiming at generalization: Diversity of SNRs / reverbs to cover unseen data BEST Phase I Final review, Nov. 3rd, 201121

Reverb data setUses RIR + FconvChoose 3 RT30 values: 0.3, 0.5, 0.715 different room configurations9 for training, 3 enrollment, 3 test

Noisy data setNoises from freesound.org, mixied using FaNT (Aurora)Real noise sample: cocktail party type, office noisesDifferent noises for training and evaluation

08dB

15dB

20dBA comprehensive approachMulti-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Advanced speaker modeling and system combination: Unified modeling frameworkPrediction of difficult scenarios: Unified condition prediction for system combinationRobustness vs. unknown: The PRISM data setExecuting this planBEST Evaluation: An order of magnitude bigger than other known evaluationsDeveloped a very fast speaker recognition systemLeverage a SRIs rapid application development framework for efficient idea assessment and system delivery: SRI Idento systemA diversely skilled team

BEST Phase I PI conference, Nov. 29th, 201122Research opportunities Multi-feature systemsUse novel low-level features for noise robustnessNoise / Reverb robust pitch extraction algorithms

Deeper understanding of combination: Aiming for simpler systemsInformation fusion at earlier stage than score levelNew speech feature design?

Acoustic characterizationDeep integration of condition prediction in the pipelineAffecting fusion weights during system combinationIntegrate language and intrinsic variationsAssessing improvements on unseen data, compounded variations

Hard extrinsic variations brings up new domains of expertise borrowed from speech recognition and others (noise robust modeling, speech enhancement: De-reverberation, de-noising, binary masks, )

23BEST Phase I PI conference, Nov. 29th, 2011Research opportunities Relaxing constraints even moreCompounded variations: Reverb + noise + language switch

Explore new types of variationsNew kinds of intrinsic variations: vocal effort (furtive, oration), Aging, SicknessNaturally occurring reverberant and noisy speech

Other parametric relaxationsUnconstrained duration for speaker enrollment and testing (as low as a second?)Robustness to multi-speaker audio enrollment and testing: another kind of variability: VERY important for interview data processing

BEST Phase I PI conference, Nov. 29th, 201124

Assess technology generalization on unseen data

Styles (e.g. fast hyper-articulated)

24Questions?25Expanding boundaries of speaker recognitionMore similar trials (close relatives, same dialect area)

Change verification to large scale speaker search / trackingLarge multi-speaker corpora, enroll as many speaker as possibleEvaluation: in another large scale corpus, find the enrolled speakers (HVI)

How much can you learn from a speaker? Enrolling a familiar voice.Plentiful of enrollment dataLimited test dataIntroduce new high level features, towards social analytics

BEST Phase I PI conference, Nov. 29th, 201126Research opportunitiesUnderstanding combination: Leaner systemsInformation fusion at earlier stage than score levelNew feature creation / Early stage fusion methodsNew opportunitiesNeed: Enabling technologies are understudiedBenefits: SRI + BUT gives 2x improvementA niche for robustnessBUT / SRI worked for two years for a common setup / dataset Still BUT has a better system on telephoneSame lists, data, technologyExcept VAD

BEST Phase I PI conference, Nov. 29th, 201127HLF Systems Explored on BESTBEST Phase I PI conference, Nov. 29th, 2011

Prosodic systemsSyllable-based featuresContour modeling

Phonetic systemsMLLR from speech recognitionConstraints, analysis based on a linguistically motivated regionConstraintsMLLRAcoustic characterization28Crictical BAA, requirement28

building a robust speaker recognition system old řich plchot , ondřej glembek , pavel matějka

Documents

system best phase

speaker verification

low order features

pi conference

channel variability

ivector point estimate

overtraining best phase

robust systems best