speaker and speech recognition: speaker recognition and...

1

1

Centre for Vision, Speech and Signal Processing

Speaker and Speech Recognition:Speaker Recognition and Verification

Josef KittlerCentre for Vision, Speech and Signal

Processing

University of Surrey, Guildford GU2 7XH

[email protected]

2

OUTLINE

Introduction Terminology Problem formulation Speech representation Text independent methods

Text dependent methods

2

3

Introduction

Person identification is crucial to thefabric of the society Security Access to services Business transactions Law enforcement Border control

4

Establishing Identity

One or more of the following What entity knows (eg. password) What entity has (eg. badge, smart

card) What entity is (eg. fingerprints, retinal

characteristics) Where entity is (eg. In front of a

particular terminal)

3

5

Authentication Overview

Difficultto

replace

Biometricaccess

Driverslicense

Copy-resistant

Unique-ness

ID

Insecureif lost

Key-lesscar entry

Metal keyCloselyheld

Posses-sion

Token

Lesssecret

with eachuse

Computerpassword

Combolock

Closelykept

Secrecy,obscurity

Secret

FlawDigitalExample

TraditionalExample

DefenseProofAuthen-ticatorTypes

6

Authentication Overview

Authenticator Subtypes:1. Secret

- secrecy, e.g., password- obscurity, e.g., mother’s maiden name, SSN

2. Token- active, e.g., synchronised password generator- passive, e.g., smart card password storage

3. ID- inalterable, e.g., fingerprint, face, hand, eye- alterable biometric signal, e.g., voice,keystroke, signature

4

7

Biometric functionalities

Biometrics- a means to preventidentity theft

Biometric functionalities Verification Identification Screening (watch list) Retrieval (detection of multiple identities) Negation of denial

8

Historical notes

Bertillon system Advantages of biometrics

Convenience Fraud reduction

Disadvantages Otput is a score Cannot be replaced Open to attack Privacy concerns

5

9

Application Characteristics

Cooperative / non-cooperative Scalable / non-scalable Private / public Closed / open High-level / low-level security Habituated / non-habituated

10

Selection criteria

Universality (all users possess thisbiometric)

Permanence Uniqueness Collectability Acceptability Open to attack Failure to enroll

6

11

Voice based speaker recognition/verification, Why ?Voice is one of the main biometric modality usedfor personal Identity Recognition and Verificationby humans.User friendlyNatural interfaceConveying emotionCan be used over telephone

Introduction

12

Terminology

Speaker identification -- using utterancesfrom a speaker, determine who he/she isout of a set of known speakers

Speaker verification -- using utterancesfrom a speaker, determine whether thecaller is who he/she claims to be(requires an identity claim)

Training -- using utterances from aspeaker to train a unique voiceprint thatcan later be used to identify/verify aspeaker. Applies to both SI/SV.

7

13

Voice biometricsproperties

Biometric Signal (Voice)Speaker verification is very compelling:

• Voice is convenient.• Voice is ubiquitous.• Voice is inexpensive.• Voice provides challenge-response security.

Unfortunately:• It is sometimes inconvenient because

it does not work well ubiquitously, and requiresa back up solution

14

Voice RecognitionApplications

Voice based identity verification for Web banking Physical access control Border control Identity cards/driver’s licence Personalisation

Entertainment Robotic toys

Law enforcement/surveillance Telephone surveillance

8

15

Internet

Smart Card

Client

Server

Card Reader

ServicesProviderServer

PrivateNetwork

Card Reader

Microphone

Application to web banking

16

Biometric Applications

PersonalisationSocial security

Mobile phoneNomadicworking

E-commerce/banking

Driver’s licence

ATMIdentity card/passport

Criminalinvestigation

CommercialGovernmentForensic

9

Market Outlook forthe Future

* Voice ID: Applications and Markets for the New Millennium 1999 J. Markowitz, Consultants

Context of speakerrecognition

…

Speech Recognition

Speech Processing

Speech SynthesisDigitized Speech…

InputOutput

SpeakerVerification

SpeakerIdentification

VoiceBiometrics

…

10

19

Voice template

Set of features extracted from the rawbiometric data during enrolment

Represents typical values of voicebiometric

Multiple templates may be stored toaccount for intra class variability

Template issues Aging (maintenance) Central/distributed storage Privacy and protection

20

Deficiencies of existingcommercial solutions

Sensitive to speech acquisitionconditions

Sensitive to background noise Sensitive to emotional state Sensitive to physical state of

the user

• Quite effective for closed set applications under tightly controlled voice acquisition conditions

11

21

Main causes of acousticvariation in speech

ChannelSpeakerrecognitionsystem

Speaker• Voice quality• Pitch• Gender• Dialect

Speaking style• Stress/Emotion• Speaking rate• Lombard effect

Task/Context• Man-machine dialogue• Dictation• Free conversation

Phonetic/Prosodic context

Noise• Other speakers• Background noise• Reverberations

DistortionNoiseEchoesDropouts

Microphone• Distortion• Electrical noise• Directional characteristics

22

Voice recognition challenge

Large number of classes Segmentation Noise and distortion Variability of deployed microphones

(interoperability) Population coverage and scalability System performance System attacks Aging Non-uniqueness of biometrics Privacy concerns (smart cards)

12

24

Generic Architecture

Silence detection

Energy norm.

Feature extraction

Recogniton

Background noiseremoval

Phoneme/speechrecognition

Related processes

25

Privacy issues

Template protection Biometrics can be used to track

people (secretely)-violation of theirright to privacy (big brother)

Biometric data may be used forother than intended purposes

Biometric database linking

13

26

Evaluation protocol

Test data and procedures adopted toevaluate a biometric system

Evaluation should be conducted by anindependent body

Test and biometric data used should not havepreviously been seen by the system

Data use cases Training set Evaluation set Test set

Data set sizes should allow statisticallysignificant evaluation

27

XM2VTS database

XM2VTS database (Messer et al 1999) Face images and speech recording of

295 people Subjects recorded in 4 sessions Frontal view images selected from the rotation sequence

Lausanne Protocol

14

28

Performance Evaluation

Performance criteria Failure to enrol Accuracy Speed Storage Costs Ease of use Failure to acquire

29

Accuracy

Measured in terms of False rejections/identification False acceptances

Falsely accepted users are impostors Performance characterisation issues

Genuine ambiguity Confidence Competence

15

30

Performance characterisation(verification)

False rejectionFalse acceptanceTotal error rate/Half total error rateOperating point

Equal error rate (civilian)Zero false rejection (high security)Zero false acceptance (forensic)

Test set/evaluation setReceiver operating characteristic

31

Performance characterisation(identification)

Confusion matrix

16

32

Reference

1.Douglas A. Reynolds, Thomas F. Quatieri,Robert B. Dunn,

Speaker Verification Using Adapted GaussianMixture Models.

Digital Signal Processing. 10(2000), 19-41.

2.Martin A., Doddington G., Kamm T., OrdowskiM., Przybocki M.

The DET curve in assessment of detection taskperformance.

Eurospeech 97, 1895-1898

33

cepstrum and delta cepstrum

coefficients

A/DConverter Hamming

WindowingLPC

Analysis

Preprocessing andfeature extraction

17

34

Speech input andspectra

Client

Impostor

35

Speech representation

MFCC feature vectors (24 filterbankanalysis), with delta coefficients anddelta log-energy appended (2coefficient-window)

33 component feature vectors Energy normalisation :

18

36

Spectral envelop reconstructed fromdifferent feature parameters

FFT-based signalSpectrum

LP SpectrumSpectrum

derived fromLP-Cepstrum

CepstralProcessingSpectrum

Amplitude(dB)

Hz

37

Speaker verificationproblem

Consider that the system has beentrained using samples of the inputwaveform provided by the client.

Each sample is represented by a featurevector

The feature vectors are assumed to benormally distributed with mean andthe covariance matrix

The training speech segment is longenough to create a representative model.

19

38

Hypothesis testing

Now a test speech segment is acquired from aspeaker claiming to be client

Given a feature vector corresponding towaveform sample , the probability that theclaim is true is given as

39

Likelihood Ratio

The claim will be accepted if Assuming the priors are equal, the test becomes

This is also referred to as the likelihood ratio Note

We base the decision on more than one sample,hence on

20

40

Assuming the samples are independent andidentically distributed, we can express the jointprobability density in terms of marginals as

Substituting and taking a log we find

The left hand side of the inequality can beexpanded as

41

The first term can be rewritten as

Thus the decision rule finally becomes

Its symmetric form

21

42

Notes

# samples for both, model and probe should be as largeas possible

Ignoring means we have

If model and probe match, , the product of thetwo matrices is an identity matrix, i.e. isotropicdistribution. Hence the matching criterion measures‘sphericity’.

It is a sphericity measure

43

Other sphericity measures

In essence, in matching a probe and a model,we are measuring the distance between twogaussian probability densities

Any feature selection criterion could be used forthat purpose

The derived sphericity measure resemblesdivergence

Bhattacharrya measure

22

44

Decision threshold

The matching process maps multidimensionalspeech data into 1D space

In theory, the decision threshold could bederived from the known parameters

In practice The distributions will not be exactly gaussian The parameters are estimates subject to error We may wish to control the trade-off between false

acceptances and false rejections

Hence, decision threshold determined Experimentally Modelling score distributions

45

If s

> Threshold Reject the claimant

≤ Threshold Accept the claimant

Accept Reject Score s

Matching score distribution

The selected threshold defines an operating point

23

46

ROC curve

ROC – receiver operating characteristics Defines a relationship between the operating

point, false acceptances and false rejections inverification

DET curve- log scale ROC

47

Score normalisation

It may be desirable to normalise the scores, e.g.for the purposes of fusion or thresholddetermination

Possibilities Map to posterior probabilities Map to designated means, e.g. so that the client and

impostor means coincide with –1 and +1 respectively Map so that the variance of the score is normalised others

24

48

Score normalisation (cont)

Min-max

Scaling

Z-score

49


Median

Double sigmoid

Tanh

Min-max, Z-score and tanh are efficient, median, double-sigmoid and tanh are robust

25

50


Mapping to designated means (forverification)

51

Score normalisation:Aposteriori class probabilities

Aposteriori class probabilities areautomatically normalised to [0,1]

Some systems compute a matchingscore , rather than

Scores have to be normalised tofacilitate fusion by simple rules aposteriori probability estimate

26

52

Score distributionmodelling

Probability density function of authentic claimsand impostors can be estimated Parametric/nonparametric pdfs e.g. gaussian pdf.

Standard deviation for true claims is likely to besmaller than for impostors

For distance type scores, the mean of true claimscores lower than the mean of impostors

53

Example

speaker and speech recognition: speaker recognition and...

Documents