voice conversion and spoofing countermeasure in speaker...

Voice Conversion and Spoofing

Countermeasure in Speaker Verification

Haizhou Li

1

SIDAS 2017, Beijing

Acknowledgement: Zhizheng Wu, Xiaohai Tian, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi

2

Kong Aik Lee, Bin Ma, and Haizhou Li, “Speaker

Verification Makes Its Debut in Smartphone”, IEEE

SLTC Newsletter, February 2013

Voice Biometrics

3

HSBC has been left red-faced after a BBC reporter and his non-identical twin tricked its voice ID authentication service.

The BBC says its “Click” (a weekly TV show) reporter Dan Simmons created an HSBC account and signed up to the bank’s service. HSBC states

that the system is secure because each person’s voice is “unique”.

As Banking Technology reported last year, HSBC launched voice recognition and touch security services in the UK, available to 15 million banking

customers. At that time, HSBC said the system “works by cross-checking against over 100 unique identifiers including both behavioural features

such as speed, cadence and pronunciation, and physical aspects including the shape of larynx, vocal tract and nasal passages”.

According to the BBC, the “bank let Dan Simmons’ non-identical twin, Joe, access the account via the telephone after he mimicked his brother’s

voice.

“Customers simply give their account details and date of birth and then say: ‘My voice is my password.’”

Despite this biometric bamboozle, Joe Simmons couldn’t withdraw money, but he was able to access balances and recent transactions, and was

offered the chance to transfer money between accounts.

Joe Simmons says: “What’s really alarming is that the bank allowed me seven attempts to mimic my brothers’ voiceprint and get it wrong, before I

got in at the eighth time of trying.”

Separately, the BBC says a Click researcher “found HSBC Voice ID kept letting them try to access their account after they deliberately failed on 20

separate occasions spread over 12 minutes”.

The BBC says Click’s successful thwarting of the system is believed to be “the first time the voice security measure has been breached”.

HSBC declined to comment to the BBC on “how secure the system had been until now”.

An HSBC spokesman says: “The security and safety of our customers’ accounts is of the utmost importance to us. Voice ID is a very secure

method of authenticating customers.

“Twins do have a similar voiceprint, but the introduction of this technology has seen a significant reduction in fraud, and has proven to be more

secure than PINS, passwords and memorable phrases.”

Not a great response is it? But very typical of the kind of bland statements that have taken hold in the UK. There is a problem and HSBC needs to

get it fixed.

The rest of the BBC report just contains security experts saying the same things – like “I’m shocked”. Whatever. No point in sharing such dull

insight.

You can see the full BBC Click investigation into biometric security in a special edition of the show on BBC News and on the iPlayer from 20 May.

Agenda

• Spoofing Attacks

• Voice Conversion

• Artifacts

• ASVspoof 2015

7

Speaker Verification

Speaker

Verification

Yes,

John!

This is

John!

Reject!

8

Speaker

Verification

Yes,

John!

Impersonation

Replay

Speech

Synthesis

Voice

Conversion

Spoofing Attacks

This is

John!Reject!

9

Spoofing attack AccessibilityEffectiveness (risk) Countermeasure

availabilityText-independent Text-dependent

Impersonation Low Low/unknown Low/unknown N.A.

Replay High LowLow

to highLow

Speech synthesis

Medium to high

High High Medium

Voice conversion

Medium to high

High High Medium

10

Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: a survey,”

Speech Communication, vol. 66, pp. 130–153, 2015.

Spoofing Attacks

11

Traits of Replay

• Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent speaker

verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014

• Luca Cuccovillo, Patrick Aichroth, "OPEN-SET MICROPHONE CLASSIFICATION VIA BLIND CHANNEL ANALYSIS," Luca

Cuccovillo, Patrick Aichroth, ICASSP 2016

Time (Seconds)

Fre

qu

ency (

Hz)

Genuine speech

1.0 2.0 3.0

8000

0

Time (Seconds)

Fre

qu

en

cy (

Hz)

Replay speech

1.0 2.0 3.0

8000

0

12

1. A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Symposium on Music Information Retrieval (ISMIR),

2003, pp. 7–13.

2. Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent

speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA

ASC) 2014.

Ideas Against Replay Attack

13

App uses smartphone compass to stop voice

hacking


Tandem double-channel pop noise detection

14


Sayaka Shiota et al, VOICE LIVENESS DETECTION FOR SPEAKER VERIFICATION BASED ON A TANDEM SINGLE/DOUBLE-

CHANNEL POP NOISE DETECTOR, Speaker Odyssey 2016

15

There are 3 subfields of Phonetics, i.e. Articulatory Phonetics, Acoustic Phonetics, and

Auditory Phonetics. Denes & Pinson (1993)

Tomi Kinnunen and Haizhou Li, “An Overview of Text-Independent Speaker Recognition: from Features to

Supervectors”, Speech Communication 52(1): 12--40, January 2010


16

• Modeling the human voice

production system

• Modeling the peripheral auditory

system


transducer, channel

state of health, mood, aging

session variability

Challenges and Opportunities

Systems assume natural speech inputs

More robust = More vulnerable

Machines and Humans listen in different ways**

Better speech perceptual quality ≠ less artifacts*

MFCC &

Hmm…

Spoofing: Speaker Verification

17

*K.K.Paliwal, et al,“Comparative Evaluation of Speech Enhancement Methods for Robust Automatic Speech Recognition,”

Int. Conf.Sig. Proce.and Comm.Sys.,Gold Coast, Australia, ICSPCS, Dec.2010

** Duc Hoang Ha Nguyen, Xiong Xiao, Eng Siong Chng, Haizhou Li, “Feature Adaptation Using Linear Spectro-Temporal

Transform for Robust Speech Recognition”, IEEE/ACM Trans. Audio, Speech & Language Processing 24(6): 1006-1019

(2016)

This is

Ming!

Speaker

Verification

Yes,

Ming!

Synthetic

Speech

Detection

Spoofing Attacks

No

Reject! Reject!

18

Agenda

19



• Artifacts

• ASVspoof 2015

AnalysisFeature

conversionSynthesis

20

Source

SpeakerTarget

Speaker

Voice Conversion: Vocoder

AnalysisFeature

conversionSynthesis

21

Source

SpeakerTarget

Speaker

Vocoder: Analysis - Synthesis

Vocoder: STRAIGHT

AnalysisFeature

conversionSynthesis

23

Source

SpeakerTarget

Speaker

Voice Conversion: Feature Conversion

24

Speaker A Speaker B

Z. Wu, Spectral Mapping for Voice Conversion, Ph.D Thesis, Nanyang Technological University, 2015

Differences between Speakers

25

Training Conversion

Basics of Voice Conversion

AMA[22]

Parametric

methods

Frequency

warping

methods

Exemplar-

based

methods

Neural

network-

based

methods

Codebook

mapping

methodsVQ[1

]

DFW[8]

GMM[3] &

JD-GMM[4]

JD-GMM with

MLPG and GV[5]

Formant

Mapping[10]VTLN[9]

DFW+AS[12]

NMF[13]

NMF+RC[14] CUTE[16]

ANN[17]Boltzmann

machine[18]

DNN[19]

LSTM[20]

PLS[6] DKPLS[7]

EFW+RC[15]

WFW[11]

Fuzzy VQ[2]

PPG[21]

Chronological Map of Voice Conversion

27

Subjective Analysis

1. Same/difference

2. Better/worse/same

Objective Analysis

1. Harmonics to Noise ratio

2. Jitter

3. Perceptual Evaluation of Speech Quality (PESQ)

4. …

“Spoofing Analysis”

1. Spectral distortion

2. Temporal discontinuity

3. Spectro-temporal artifacts

4. Pitch pattern

5. ASVspoof 2015 ?

Evaluation of Synthesized Voice

Agenda

28



• Artifacts

• ASVspoof 2015

Magnitude

• Short-time Fourier transform

• Smoothing effect (local vs global optimization)

• HMM/GMM optimization

• Vocoder effect

• Inaccuracy in statistical acoustic modeling

• Glottal flow

• Vocal tract modeling

Phase

• Minimum phase vocoding

• Phase distortion

• Phase discontinuity

29

Artifacts

𝑋(ω)= 𝑋 ω 𝑒𝑗φ(ω)

Y(ω)=H(ω) X(ω)

Gain & Phase shift

• Time-Frequency resolution• Fixed length windows

• Individual frequency bands are distributed linearly, which is different from

the distribution in the human cochlea

• Windowing & spectral leakage

30

Magnitude: STFT

31

Hideki Kawahara, STRAIGHT, exploitation of the other aspect of VOCODER:

Perceptually isomorphic decomposition of speech sounds, Acoust. Sci. & Tech.

27, 6 (2006)

Magnitude: Smoothing in Vocoder

32

Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi

Yamagishi, and Keiichiro Oura “Speech Synthesis Based on Hidden Markov

Models” Proceedings of The IEEE, 2013

Magnitude : Smoothing in synthesized/converted speech

A. Kain and M. W. Macon, “Spectral voice conversion for text-to-

speech synthesis,” in ICASSP 1998.

• Inter-Frame Difference of Log-Likelihood (IFDLL)

∆𝒍𝒕 = |𝒍𝒐𝒈 𝒑(𝒙𝒏|λ𝑪)- 𝒍𝒐𝒈 𝒑(𝒙𝒏−𝟏|λ𝑪)|

• Δ-Cepstrum and Δ2-Cepstrum

Magnitude: Difference of Log-Likelihood

• Takayuki Satoh, Takashi Masuko, Takao Kobayashi, Keiichi Tokuda, “A Robust Speaker Verification System against Imposture Using an HMM-based Speech Synthesis System”, EUROSPEECH 2001

• F. Soong and A. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 36, no. 6, pp. 871–879, Jun 1988

• Sahidullah, Md., Kinnunen, Tomi, Hanilçi, Cemal (2015): "A comparison of features for synthetic speech detection", In INTERSPEECH-2015, 2087-2091.

33

34

Magnitude: Pitch patterns in HMM-based synthesized speech

• Akio OGIHARA, Hitoshi UNNO, and Akira

SHIOZAKI, Discrimination Method of Synthetic

Speech Using Pitch Frequency against Synthetic

Speech Falsification, IEICE TRANS.

FUNDAMENTALS, VOL.E88–A, NO.1

JANUARY 2005

• P. L. De Leon, B. Stewart, and J. Yamagishi,

“Synthetic speech discrimination using pitch pattern

statistics derived from image analysis,” in Proc.

Interspeech, 2012.

• R. D. McClanahan, B. Stewart, and P. L. De Leon,

“Performance of ivector speaker verification and

the detection of synthetic speech,” in Proc. IEEE

Int. Conf. on Acoustics, Speech, and Signal

Processing (ICASSP), 2014.

• Wrapping

• Discontinuity

• Distortion

35

Phase

Oppenheim, Schafer & Buck, Discrete time digital signal

processing, 2nd Edition, Prentice Hall

Instantaneous Frequency

L. D Alsteris and K. K. Paliwal,“Short-time phase spectrum in speech processing: A review and some experimental results,” Digital Signal Processing, 2007.

Time-derivative of phase

for signal:

𝒔𝒂 𝒕 = 𝒂 𝒕 𝒆𝒋φ(𝒕)

Instantaneous

Frequency:

𝒇(𝒕) =𝟏

𝟐𝝅

𝑑𝝋(𝒕)

𝑑𝒕

36

Frequency-derivative of phase

Group Delay Function

B. Yegnanarayana and H. A Murthy, “Significance of group delay functions in spectrum estimation,” IEEE Transactions on Signal Processing,1992

37

High-dimensional features: phase & magnitude

38

Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Eng Siong Chng, Haizhou Li, "Spoofing Speech Detection Using High Dimensional

Magnitude and Phase Features: the NTU Approach for ASVspoof 2015 Challenge", Interspeech 2015

System D (NTU)



• Artifacts

• ASVspoof 2015

ASVspoof 2015: Speaker verification spoofing

and countermeasures challenge

Organisers

Zhizheng Wu, University of Edinburgh, UK

Tomi Kinnunen, University of Eastern Finland, Finland

Nicholas Evans, EURECOM, France

Junichi Yamagishi, University of Edinburgh, UK 39

Spoofing Attacks

Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md Sahidullah, Aleksandr

Sizov, "ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge",

INTERSPEECH 2015

This is

Ming!

Synthetic

Speech

Detection

synthetic

Reject!

genuine

40

ASVspoof Database

Joint effort of speech synthesis, voice conversion

and speaker verification community

Zhizheng Wu, Ali Khodabakhsh, Cenk Demiroglu, Junichi Yamagishi, Daisuke Saito, Tomoki Toda, Simon King,

"SAS: A speaker verification spoofing database containing diverse attacks", ICASSP 2015

41

# utterances

Algorithm VocoderTraining(10 male/15 female)

Development(15 male/20 female)

Evaluation(20 male/26 female)

Genuine 3750 3497 9404 None None

S1 2525 9975 18400 VC :Frame-selection STRAIGHT

S2 2525 9975 18400 VC: Slope-shifting STRAIGHT

S3 2525 9975 18400 SS: HMM STRAIGHT

S4 2525 9975 18400 SS: HMM STRAIGHT

S5 2525 9975 18400 VC: GMM MLSA

S6 0 0 18400 VC: GMM STRAIGHT

S7 0 0 18400 VC: GMM STRAIGHT

S8 0 0 18400 VC: Tensor STRAIGHT

S9 0 0 18400 VC: KPLS STRAIGHT

S10 0 0 18400 SS: unit-selection None

Voice Conversion Algorithms

42

S1-S5: voice conversion in train/dev/eval sets

S1: VC - Frame selection

S2: VC - Slope shifting

S3: TTS – HTS with 20 adaptation sentences

S4: TTS – HTS with 40 adaptation sentences

S5: VC – Festvox (http://festvox.org/)

43

Database: Known Attacks

S6 – S10: Only appear in the eval set

S6: VC – ML-GMM with GV enhancement

S7: VC – Similar to S6 but using LSP features

S8: VC – Tensor (eigenvoice)-based approach

S9: VC – Nonlinear regression (KPLS)

S10: TTS – MARY TTS unit selection (http://mary.dfki.de/)

44

16 Primary Submission Results (EER)

Zhizheng Wu, Tomi Kinnunen, Nicholas Evans and Junichi Yamagishi, "ASVspoof 2015: the first automatic

speaker verification spoofing and countermeasures challenge", IEEE Signal Processing Society Speech and

Language Technical Committee Newsletter (SLTC Newsletter), 20 November 2015

Four times higher than

that of known attacks

Team Average (all) Average (without S10) S10

A 1.211 0.402 8.490

B 1.965 0.008 19.571

C 2.528 0.076 24.601

D 2.617 0.003 26.142

E 2.694 0.060 26.393

F 3.218 0.400 28.581

G 3.326 0.360 30.021

H 3.726 0.021 37.068

I 3.898 0.703 32.651

J 4.097 0.029 40.708

K 4.547 0.203 43.638

L 6.719 3.478 35.890

M 14.391 12.482 31.574

N 14.568 11.299 43.991

O 18.826 16.304 41.519

P 21.518 18.786 46.102

45

16 Primary Submission Results (EER)

46

Recent Progress: Fourier Transform vs Auditory Transform

“…, we use a longer window for a lower frequency band to average out the background

noise and a shorter window for a higher frequency band to protect high-frequency

information.”

Q. Li and Y. Huang, “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions”,

IEEE Transactions on Audio, Speech and Language Processing, vol 19, no 6, pp. 1791-1801, 2011

47

Tanvina B. Patel, Hemant A. Patil, "Combining Evidences from Mel Cepstral, Cochlear Filter Cepstral and Instantaneous Frequency Features

for Detection of Natural vs. Spoofed Speech", Interspeech 2015

Recent Progress: Instantaneous Frequency of Auditory Transform (CFCCIF)

• In STFT, the time and

frequency resolutions are

constant.

• CQT employs a variable

time/frequency resolution:• greater time resolution for

higher frequencies

• greater frequency resolution for

lower frequencies

A comparison of the time-frequency resolution of the STFT (top) and CQT (down).

• Massimiliano Todisco, H´ector Delgado and Nicholas Evans, “A New Feature for Automatic Speaker Verification Anti-Spoofing:

Constant Q Cepstral Coefficients”, Odyssey 2016

• E. Ahissar, S. Nagarajan, M. Ahissar, A. Protopapas, H. Mahncke, and M. M. Merzenich, “Speech comprehension is correlated with

temporal response patterns recorded from auditory cortex,” Proc. Natl. Acad. Sci., vol. 98, no. 23, pp. 13 367–13 372, 2001

Recent Progress: Fourier Transform vs Constant Q

𝑁𝑘 =𝑓𝑠

𝑓𝑘Q𝑄 =

𝑓𝑘δ𝑓

Block diagram of CQCC feature extraction

Comparison of results (EER [%]) on ASVspoof2015 Database

Front-end: CQCC-A (19+0th second derivative coefficients)

Back-end: 2 GMMs (512 components, EM training), one for human speech and one for

spoofed speech

Matlab implementation of CQCC extraction can be downloaded from

http://audio.eurecom.fr/content/software

Constant-QTransform

Powerspectrum

LOG DCT

speech signal

Uniformresampling

CQCC

Recent Progress: Constant Q Cepstral Coefficients (CQCC)

http://audio.eurecom.fr/content/software

voice conversion and spoofing countermeasure in speaker...

Documents