speaker and speech recognition: speaker recognition and...
TRANSCRIPT
1
1
Centre for Vision, Speech and Signal Processing
Speaker and Speech Recognition:Speaker Recognition and Verification
Josef KittlerCentre for Vision, Speech and Signal
Processing
University of Surrey, Guildford GU2 7XH
2
OUTLINE
Introduction Terminology Problem formulation Speech representation Text independent methods
Text dependent methods
2
3
Introduction
Person identification is crucial to thefabric of the society Security Access to services Business transactions Law enforcement Border control
4
Establishing Identity
One or more of the following What entity knows (eg. password) What entity has (eg. badge, smart
card) What entity is (eg. fingerprints, retinal
characteristics) Where entity is (eg. In front of a
particular terminal)
3
5
Authentication Overview
Difficultto
replace
Biometricaccess
Driverslicense
Copy-resistant
Unique-ness
ID
Insecureif lost
Key-lesscar entry
Metal keyCloselyheld
Posses-sion
Token
Lesssecret
with eachuse
Computerpassword
Combolock
Closelykept
Secrecy,obscurity
Secret
FlawDigitalExample
TraditionalExample
DefenseProofAuthen-ticatorTypes
6
Authentication Overview
Authenticator Subtypes:1. Secret
- secrecy, e.g., password- obscurity, e.g., mother’s maiden name, SSN
2. Token- active, e.g., synchronised password generator- passive, e.g., smart card password storage
3. ID- inalterable, e.g., fingerprint, face, hand, eye- alterable biometric signal, e.g., voice,keystroke, signature
4
7
Biometric functionalities
Biometrics- a means to preventidentity theft
Biometric functionalities Verification Identification Screening (watch list) Retrieval (detection of multiple identities) Negation of denial
8
Historical notes
Bertillon system Advantages of biometrics
Convenience Fraud reduction
Disadvantages Otput is a score Cannot be replaced Open to attack Privacy concerns
5
9
Application Characteristics
Cooperative / non-cooperative Scalable / non-scalable Private / public Closed / open High-level / low-level security Habituated / non-habituated
10
Selection criteria
Universality (all users possess thisbiometric)
Permanence Uniqueness Collectability Acceptability Open to attack Failure to enroll
6
11
Voice based speaker recognition/verification, Why ?Voice is one of the main biometric modality usedfor personal Identity Recognition and Verificationby humans.User friendlyNatural interfaceConveying emotionCan be used over telephone
Introduction
12
Terminology
Speaker identification -- using utterancesfrom a speaker, determine who he/she isout of a set of known speakers
Speaker verification -- using utterancesfrom a speaker, determine whether thecaller is who he/she claims to be(requires an identity claim)
Training -- using utterances from aspeaker to train a unique voiceprint thatcan later be used to identify/verify aspeaker. Applies to both SI/SV.
7
13
Voice biometricsproperties
Biometric Signal (Voice)Speaker verification is very compelling:
• Voice is convenient.• Voice is ubiquitous.• Voice is inexpensive.• Voice provides challenge-response security.
Unfortunately:• It is sometimes inconvenient because
it does not work well ubiquitously, and requiresa back up solution
14
Voice RecognitionApplications
Voice based identity verification for Web banking Physical access control Border control Identity cards/driver’s licence Personalisation
Entertainment Robotic toys
Law enforcement/surveillance Telephone surveillance
8
15
Internet
Smart Card
Client
Server
Card Reader
ServicesProviderServer
PrivateNetwork
Card Reader
Microphone
Application to web banking
16
Biometric Applications
PersonalisationSocial security
Mobile phoneNomadicworking
E-commerce/banking
Driver’s licence
ATMIdentity card/passport
Criminalinvestigation
CommercialGovernmentForensic
9
Market Outlook forthe Future
* Voice ID: Applications and Markets for the New Millennium 1999 J. Markowitz, Consultants
Context of speakerrecognition
…
Speech Recognition
Speech Processing
Speech SynthesisDigitized Speech…
InputOutput
SpeakerVerification
SpeakerIdentification
VoiceBiometrics
…
10
19
Voice template
Set of features extracted from the rawbiometric data during enrolment
Represents typical values of voicebiometric
Multiple templates may be stored toaccount for intra class variability
Template issues Aging (maintenance) Central/distributed storage Privacy and protection
20
Deficiencies of existingcommercial solutions
Sensitive to speech acquisitionconditions
Sensitive to background noise Sensitive to emotional state Sensitive to physical state of
the user
• Quite effective for closed set applications under tightly controlled voice acquisition conditions
11
21
Main causes of acousticvariation in speech
ChannelSpeakerrecognitionsystem
Speaker• Voice quality• Pitch• Gender• Dialect
Speaking style• Stress/Emotion• Speaking rate• Lombard effect
Task/Context• Man-machine dialogue• Dictation• Free conversation
Phonetic/Prosodic context
Noise• Other speakers• Background noise• Reverberations
DistortionNoiseEchoesDropouts
Microphone• Distortion• Electrical noise• Directional characteristics
22
Voice recognition challenge
Large number of classes Segmentation Noise and distortion Variability of deployed microphones
(interoperability) Population coverage and scalability System performance System attacks Aging Non-uniqueness of biometrics Privacy concerns (smart cards)
12
24
Generic Architecture
Silence detection
Energy norm.
Feature extraction
Recogniton
Background noiseremoval
Phoneme/speechrecognition
Related processes
25
Privacy issues
Template protection Biometrics can be used to track
people (secretely)-violation of theirright to privacy (big brother)
Biometric data may be used forother than intended purposes
Biometric database linking
13
26
Evaluation protocol
Test data and procedures adopted toevaluate a biometric system
Evaluation should be conducted by anindependent body
Test and biometric data used should not havepreviously been seen by the system
Data use cases Training set Evaluation set Test set
Data set sizes should allow statisticallysignificant evaluation
27
XM2VTS database
XM2VTS database (Messer et al 1999) Face images and speech recording of
295 people Subjects recorded in 4 sessions Frontal view images selected from the rotation sequence
Lausanne Protocol
14
28
Performance Evaluation
Performance criteria Failure to enrol Accuracy Speed Storage Costs Ease of use Failure to acquire
29
Accuracy
Measured in terms of False rejections/identification False acceptances
Falsely accepted users are impostors Performance characterisation issues
Genuine ambiguity Confidence Competence
15
30
Performance characterisation(verification)
False rejectionFalse acceptanceTotal error rate/Half total error rateOperating point
Equal error rate (civilian)Zero false rejection (high security)Zero false acceptance (forensic)
Test set/evaluation setReceiver operating characteristic
31
Performance characterisation(identification)
Confusion matrix
16
32
Reference
1.Douglas A. Reynolds, Thomas F. Quatieri,Robert B. Dunn,
Speaker Verification Using Adapted GaussianMixture Models.
Digital Signal Processing. 10(2000), 19-41.
2.Martin A., Doddington G., Kamm T., OrdowskiM., Przybocki M.
The DET curve in assessment of detection taskperformance.
Eurospeech 97, 1895-1898
33
cepstrum and delta cepstrum
coefficients
A/DConverter Hamming
WindowingLPC
Analysis
Preprocessing andfeature extraction
17
34
Speech input andspectra
Client
Impostor
35
Speech representation
MFCC feature vectors (24 filterbankanalysis), with delta coefficients anddelta log-energy appended (2coefficient-window)
33 component feature vectors Energy normalisation :
18
36
Spectral envelop reconstructed fromdifferent feature parameters
FFT-based signalSpectrum
LP SpectrumSpectrum
derived fromLP-Cepstrum
CepstralProcessingSpectrum
Amplitude(dB)
Hz
37
Speaker verificationproblem
Consider that the system has beentrained using samples of the inputwaveform provided by the client.
Each sample is represented by a featurevector
The feature vectors are assumed to benormally distributed with mean andthe covariance matrix
The training speech segment is longenough to create a representative model.
19
38
Hypothesis testing
Now a test speech segment is acquired from aspeaker claiming to be client
Given a feature vector corresponding towaveform sample , the probability that theclaim is true is given as
39
Likelihood Ratio
The claim will be accepted if Assuming the priors are equal, the test becomes
This is also referred to as the likelihood ratio Note
We base the decision on more than one sample,hence on
20
40
Assuming the samples are independent andidentically distributed, we can express the jointprobability density in terms of marginals as
Substituting and taking a log we find
The left hand side of the inequality can beexpanded as
41
The first term can be rewritten as
Thus the decision rule finally becomes
Its symmetric form
21
42
Notes
# samples for both, model and probe should be as largeas possible
Ignoring means we have
If model and probe match, , the product of thetwo matrices is an identity matrix, i.e. isotropicdistribution. Hence the matching criterion measures‘sphericity’.
It is a sphericity measure
43
Other sphericity measures
In essence, in matching a probe and a model,we are measuring the distance between twogaussian probability densities
Any feature selection criterion could be used forthat purpose
The derived sphericity measure resemblesdivergence
Bhattacharrya measure
22
44
Decision threshold
The matching process maps multidimensionalspeech data into 1D space
In theory, the decision threshold could bederived from the known parameters
In practice The distributions will not be exactly gaussian The parameters are estimates subject to error We may wish to control the trade-off between false
acceptances and false rejections
Hence, decision threshold determined Experimentally Modelling score distributions
45
If s
> Threshold Reject the claimant
≤ Threshold Accept the claimant
Accept Reject Score s
Matching score distribution
The selected threshold defines an operating point
23
46
ROC curve
ROC – receiver operating characteristics Defines a relationship between the operating
point, false acceptances and false rejections inverification
DET curve- log scale ROC
47
Score normalisation
It may be desirable to normalise the scores, e.g.for the purposes of fusion or thresholddetermination
Possibilities Map to posterior probabilities Map to designated means, e.g. so that the client and
impostor means coincide with –1 and +1 respectively Map so that the variance of the score is normalised others
24
48
Score normalisation (cont)
Min-max
Scaling
Z-score
49
Score normalisation (cont)
Median
Double sigmoid
Tanh
Min-max, Z-score and tanh are efficient, median, double-sigmoid and tanh are robust
25
50
Score normalisation (cont)
Mapping to designated means (forverification)
51
Score normalisation:Aposteriori class probabilities
Aposteriori class probabilities areautomatically normalised to [0,1]
Some systems compute a matchingscore , rather than
Scores have to be normalised tofacilitate fusion by simple rules aposteriori probability estimate
26
52
Score distributionmodelling
Probability density function of authentic claimsand impostors can be estimated Parametric/nonparametric pdfs e.g. gaussian pdf.
Standard deviation for true claims is likely to besmaller than for impostors
For distance type scores, the mean of true claimscores lower than the mean of impostors
53
Example