voice conversion and spoofing countermeasure in speaker...
TRANSCRIPT
Voice Conversion and Spoofing
Countermeasure in Speaker Verification
Haizhou Li
1
SIDAS 2017, Beijing
Acknowledgement: Zhizheng Wu, Xiaohai Tian, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi
2
Kong Aik Lee, Bin Ma, and Haizhou Li, “Speaker
Verification Makes Its Debut in Smartphone”, IEEE
SLTC Newsletter, February 2013
Voice Biometrics
3
HSBC has been left red-faced after a BBC reporter and his non-identical twin tricked its voice ID authentication service.
The BBC says its “Click” (a weekly TV show) reporter Dan Simmons created an HSBC account and signed up to the bank’s service. HSBC states
that the system is secure because each person’s voice is “unique”.
As Banking Technology reported last year, HSBC launched voice recognition and touch security services in the UK, available to 15 million banking
customers. At that time, HSBC said the system “works by cross-checking against over 100 unique identifiers including both behavioural features
such as speed, cadence and pronunciation, and physical aspects including the shape of larynx, vocal tract and nasal passages”.
According to the BBC, the “bank let Dan Simmons’ non-identical twin, Joe, access the account via the telephone after he mimicked his brother’s
voice.
“Customers simply give their account details and date of birth and then say: ‘My voice is my password.’”
Despite this biometric bamboozle, Joe Simmons couldn’t withdraw money, but he was able to access balances and recent transactions, and was
offered the chance to transfer money between accounts.
Joe Simmons says: “What’s really alarming is that the bank allowed me seven attempts to mimic my brothers’ voiceprint and get it wrong, before I
got in at the eighth time of trying.”
Separately, the BBC says a Click researcher “found HSBC Voice ID kept letting them try to access their account after they deliberately failed on 20
separate occasions spread over 12 minutes”.
The BBC says Click’s successful thwarting of the system is believed to be “the first time the voice security measure has been breached”.
HSBC declined to comment to the BBC on “how secure the system had been until now”.
An HSBC spokesman says: “The security and safety of our customers’ accounts is of the utmost importance to us. Voice ID is a very secure
method of authenticating customers.
“Twins do have a similar voiceprint, but the introduction of this technology has seen a significant reduction in fraud, and has proven to be more
secure than PINS, passwords and memorable phrases.”
Not a great response is it? But very typical of the kind of bland statements that have taken hold in the UK. There is a problem and HSBC needs to
get it fixed.
The rest of the BBC report just contains security experts saying the same things – like “I’m shocked”. Whatever. No point in sharing such dull
insight.
You can see the full BBC Click investigation into biometric security in a special edition of the show on BBC News and on the iPlayer from 20 May.
4
5
6
Agenda
• Spoofing Attacks
• Voice Conversion
• Artifacts
• ASVspoof 2015
7
Speaker Verification
Speaker
Verification
Yes,
John!
This is
John!
Reject!
8
Speaker
Verification
Yes,
John!
Impersonation
Replay
Speech
Synthesis
Voice
Conversion
Spoofing Attacks
This is
John!Reject!
9
Spoofing attack AccessibilityEffectiveness (risk) Countermeasure
availabilityText-independent Text-dependent
Impersonation Low Low/unknown Low/unknown N.A.
Replay High LowLow
to highLow
Speech synthesis
Medium to high
High High Medium
Voice conversion
Medium to high
High High Medium
10
Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, “Spoofing and countermeasures for speaker verification: a survey,”
Speech Communication, vol. 66, pp. 130–153, 2015.
Spoofing Attacks
11
Traits of Replay
• Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent speaker
verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2014
• Luca Cuccovillo, Patrick Aichroth, "OPEN-SET MICROPHONE CLASSIFICATION VIA BLIND CHANNEL ANALYSIS," Luca
Cuccovillo, Patrick Aichroth, ICASSP 2016
Time (Seconds)
Fre
qu
ency (
Hz)
Genuine speech
1.0 2.0 3.0
8000
0
Time (Seconds)
Fre
qu
en
cy (
Hz)
Replay speech
1.0 2.0 3.0
8000
0
12
1. A. Wang, “An industrial strength audio search algorithm,” in Proc. Int. Symposium on Music Information Retrieval (ISMIR),
2003, pp. 7–13.
2. Zhizheng Wu, Sheng Gao, Eng Siong Chng, Haizhou Li, "A study on replay attack and anti-spoofing for text-dependent
speaker verification", Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA
ASC) 2014.
Ideas Against Replay Attack
13
App uses smartphone compass to stop voice
hacking
Ideas Against Replay Attack
Tandem double-channel pop noise detection
14
Ideas Against Replay Attack
Sayaka Shiota et al, VOICE LIVENESS DETECTION FOR SPEAKER VERIFICATION BASED ON A TANDEM SINGLE/DOUBLE-
CHANNEL POP NOISE DETECTOR, Speaker Odyssey 2016
15
There are 3 subfields of Phonetics, i.e. Articulatory Phonetics, Acoustic Phonetics, and
Auditory Phonetics. Denes & Pinson (1993)
Tomi Kinnunen and Haizhou Li, “An Overview of Text-Independent Speaker Recognition: from Features to
Supervectors”, Speech Communication 52(1): 12--40, January 2010
Speaker Verification
16
• Modeling the human voice
production system
• Modeling the peripheral auditory
system
Speaker Verification
transducer, channel
state of health, mood, aging
session variability
Challenges and Opportunities
Systems assume natural speech inputs
More robust = More vulnerable
Machines and Humans listen in different ways**
Better speech perceptual quality ≠ less artifacts*
MFCC &
Hmm…
Spoofing: Speaker Verification
17
*K.K.Paliwal, et al,“Comparative Evaluation of Speech Enhancement Methods for Robust Automatic Speech Recognition,”
Int. Conf.Sig. Proce.and Comm.Sys.,Gold Coast, Australia, ICSPCS, Dec.2010
** Duc Hoang Ha Nguyen, Xiong Xiao, Eng Siong Chng, Haizhou Li, “Feature Adaptation Using Linear Spectro-Temporal
Transform for Robust Speech Recognition”, IEEE/ACM Trans. Audio, Speech & Language Processing 24(6): 1006-1019
(2016)
This is
Ming!
Speaker
Verification
Yes,
Ming!
Synthetic
Speech
Detection
Spoofing Attacks
No
Reject! Reject!
18
Agenda
19
• Spoofing Attacks
• Voice Conversion
• Artifacts
• ASVspoof 2015
AnalysisFeature
conversionSynthesis
20
Source
SpeakerTarget
Speaker
Voice Conversion: Vocoder
AnalysisFeature
conversionSynthesis
21
Source
SpeakerTarget
Speaker
Vocoder: Analysis - Synthesis
Vocoder: STRAIGHT
AnalysisFeature
conversionSynthesis
23
Source
SpeakerTarget
Speaker
Voice Conversion: Feature Conversion
24
Speaker A Speaker B
Z. Wu, Spectral Mapping for Voice Conversion, Ph.D Thesis, Nanyang Technological University, 2015
Differences between Speakers
25
Training Conversion
Basics of Voice Conversion
AMA[22]
Parametric
methods
Frequency
warping
methods
Exemplar-
based
methods
Neural
network-
based
methods
Codebook
mapping
methodsVQ[1
]
DFW[8]
GMM[3] &
JD-GMM[4]
JD-GMM with
MLPG and GV[5]
Formant
Mapping[10]VTLN[9]
DFW+AS[12]
NMF[13]
NMF+RC[14] CUTE[16]
ANN[17]Boltzmann
machine[18]
DNN[19]
LSTM[20]
PLS[6] DKPLS[7]
EFW+RC[15]
WFW[11]
Fuzzy VQ[2]
PPG[21]
Chronological Map of Voice Conversion
27
Subjective Analysis
1. Same/difference
2. Better/worse/same
Objective Analysis
1. Harmonics to Noise ratio
2. Jitter
3. Perceptual Evaluation of Speech Quality (PESQ)
4. …
“Spoofing Analysis”
1. Spectral distortion
2. Temporal discontinuity
3. Spectro-temporal artifacts
4. Pitch pattern
5. ASVspoof 2015 ?
Evaluation of Synthesized Voice
Agenda
28
• Spoofing Attacks
• Voice Conversion
• Artifacts
• ASVspoof 2015
Magnitude
• Short-time Fourier transform
• Smoothing effect (local vs global optimization)
• HMM/GMM optimization
• Vocoder effect
• Inaccuracy in statistical acoustic modeling
• Glottal flow
• Vocal tract modeling
Phase
• Minimum phase vocoding
• Phase distortion
• Phase discontinuity
29
Artifacts
𝑋(ω)= 𝑋 ω 𝑒𝑗φ(ω)
Y(ω)=H(ω) X(ω)
Gain & Phase shift
• Time-Frequency resolution• Fixed length windows
• Individual frequency bands are distributed linearly, which is different from
the distribution in the human cochlea
• Windowing & spectral leakage
30
Magnitude: STFT
31
Hideki Kawahara, STRAIGHT, exploitation of the other aspect of VOCODER:
Perceptually isomorphic decomposition of speech sounds, Acoust. Sci. & Tech.
27, 6 (2006)
Magnitude: Smoothing in Vocoder
32
Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi
Yamagishi, and Keiichiro Oura “Speech Synthesis Based on Hidden Markov
Models” Proceedings of The IEEE, 2013
Magnitude : Smoothing in synthesized/converted speech
A. Kain and M. W. Macon, “Spectral voice conversion for text-to-
speech synthesis,” in ICASSP 1998.
• Inter-Frame Difference of Log-Likelihood (IFDLL)
∆𝒍𝒕 = |𝒍𝒐𝒈 𝒑(𝒙𝒏|λ𝑪)- 𝒍𝒐𝒈 𝒑(𝒙𝒏−𝟏|λ𝑪)|
• Δ-Cepstrum and Δ2-Cepstrum
Magnitude: Difference of Log-Likelihood
• Takayuki Satoh, Takashi Masuko, Takao Kobayashi, Keiichi Tokuda, “A Robust Speaker Verification System against Imposture Using an HMM-based Speech Synthesis System”, EUROSPEECH 2001
• F. Soong and A. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 36, no. 6, pp. 871–879, Jun 1988
• Sahidullah, Md., Kinnunen, Tomi, Hanilçi, Cemal (2015): "A comparison of features for synthetic speech detection", In INTERSPEECH-2015, 2087-2091.
33
34
Magnitude: Pitch patterns in HMM-based synthesized speech
• Akio OGIHARA, Hitoshi UNNO, and Akira
SHIOZAKI, Discrimination Method of Synthetic
Speech Using Pitch Frequency against Synthetic
Speech Falsification, IEICE TRANS.
FUNDAMENTALS, VOL.E88–A, NO.1
JANUARY 2005
• P. L. De Leon, B. Stewart, and J. Yamagishi,
“Synthetic speech discrimination using pitch pattern
statistics derived from image analysis,” in Proc.
Interspeech, 2012.
• R. D. McClanahan, B. Stewart, and P. L. De Leon,
“Performance of ivector speaker verification and
the detection of synthetic speech,” in Proc. IEEE
Int. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP), 2014.
• Wrapping
• Discontinuity
• Distortion
35
Phase
Oppenheim, Schafer & Buck, Discrete time digital signal
processing, 2nd Edition, Prentice Hall
Instantaneous Frequency
L. D Alsteris and K. K. Paliwal,“Short-time phase spectrum in speech processing: A review and some experimental results,” Digital Signal Processing, 2007.
Time-derivative of phase
for signal:
𝒔𝒂 𝒕 = 𝒂 𝒕 𝒆𝒋φ(𝒕)
Instantaneous
Frequency:
𝒇(𝒕) =𝟏
𝟐𝝅
𝑑𝝋(𝒕)
𝑑𝒕
36
Frequency-derivative of phase
Group Delay Function
B. Yegnanarayana and H. A Murthy, “Significance of group delay functions in spectrum estimation,” IEEE Transactions on Signal Processing,1992
37
High-dimensional features: phase & magnitude
38
Xiong Xiao, Xiaohai Tian, Steven Du, Haihua Xu, Eng Siong Chng, Haizhou Li, "Spoofing Speech Detection Using High Dimensional
Magnitude and Phase Features: the NTU Approach for ASVspoof 2015 Challenge", Interspeech 2015
System D (NTU)
• Spoofing Attacks
• Voice Conversion
• Artifacts
• ASVspoof 2015
ASVspoof 2015: Speaker verification spoofing
and countermeasures challenge
Organisers
Zhizheng Wu, University of Edinburgh, UK
Tomi Kinnunen, University of Eastern Finland, Finland
Nicholas Evans, EURECOM, France
Junichi Yamagishi, University of Edinburgh, UK 39
Spoofing Attacks
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md Sahidullah, Aleksandr
Sizov, "ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge",
INTERSPEECH 2015
This is
Ming!
Synthetic
Speech
Detection
synthetic
Reject!
genuine
40
ASVspoof Database
Joint effort of speech synthesis, voice conversion
and speaker verification community
Zhizheng Wu, Ali Khodabakhsh, Cenk Demiroglu, Junichi Yamagishi, Daisuke Saito, Tomoki Toda, Simon King,
"SAS: A speaker verification spoofing database containing diverse attacks", ICASSP 2015
41
# utterances
Algorithm VocoderTraining(10 male/15 female)
Development(15 male/20 female)
Evaluation(20 male/26 female)
Genuine 3750 3497 9404 None None
S1 2525 9975 18400 VC :Frame-selection STRAIGHT
S2 2525 9975 18400 VC: Slope-shifting STRAIGHT
S3 2525 9975 18400 SS: HMM STRAIGHT
S4 2525 9975 18400 SS: HMM STRAIGHT
S5 2525 9975 18400 VC: GMM MLSA
S6 0 0 18400 VC: GMM STRAIGHT
S7 0 0 18400 VC: GMM STRAIGHT
S8 0 0 18400 VC: Tensor STRAIGHT
S9 0 0 18400 VC: KPLS STRAIGHT
S10 0 0 18400 SS: unit-selection None
Voice Conversion Algorithms
42
S1-S5: voice conversion in train/dev/eval sets
S1: VC - Frame selection
S2: VC - Slope shifting
S3: TTS – HTS with 20 adaptation sentences
S4: TTS – HTS with 40 adaptation sentences
S5: VC – Festvox (http://festvox.org/)
43
Database: Known Attacks
S6 – S10: Only appear in the eval set
S6: VC – ML-GMM with GV enhancement
S7: VC – Similar to S6 but using LSP features
S8: VC – Tensor (eigenvoice)-based approach
S9: VC – Nonlinear regression (KPLS)
S10: TTS – MARY TTS unit selection (http://mary.dfki.de/)
44
16 Primary Submission Results (EER)
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans and Junichi Yamagishi, "ASVspoof 2015: the first automatic
speaker verification spoofing and countermeasures challenge", IEEE Signal Processing Society Speech and
Language Technical Committee Newsletter (SLTC Newsletter), 20 November 2015
Four times higher than
that of known attacks
Team Average (all) Average (without S10) S10
A 1.211 0.402 8.490
B 1.965 0.008 19.571
C 2.528 0.076 24.601
D 2.617 0.003 26.142
E 2.694 0.060 26.393
F 3.218 0.400 28.581
G 3.326 0.360 30.021
H 3.726 0.021 37.068
I 3.898 0.703 32.651
J 4.097 0.029 40.708
K 4.547 0.203 43.638
L 6.719 3.478 35.890
M 14.391 12.482 31.574
N 14.568 11.299 43.991
O 18.826 16.304 41.519
P 21.518 18.786 46.102
45
16 Primary Submission Results (EER)
46
Recent Progress: Fourier Transform vs Auditory Transform
“…, we use a longer window for a lower frequency band to average out the background
noise and a shorter window for a higher frequency band to protect high-frequency
information.”
Q. Li and Y. Huang, “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions”,
IEEE Transactions on Audio, Speech and Language Processing, vol 19, no 6, pp. 1791-1801, 2011
47
Tanvina B. Patel, Hemant A. Patil, "Combining Evidences from Mel Cepstral, Cochlear Filter Cepstral and Instantaneous Frequency Features
for Detection of Natural vs. Spoofed Speech", Interspeech 2015
Recent Progress: Instantaneous Frequency of Auditory Transform (CFCCIF)
• In STFT, the time and
frequency resolutions are
constant.
• CQT employs a variable
time/frequency resolution:• greater time resolution for
higher frequencies
• greater frequency resolution for
lower frequencies
A comparison of the time-frequency resolution of the STFT (top) and CQT (down).
• Massimiliano Todisco, H´ector Delgado and Nicholas Evans, “A New Feature for Automatic Speaker Verification Anti-Spoofing:
Constant Q Cepstral Coefficients”, Odyssey 2016
• E. Ahissar, S. Nagarajan, M. Ahissar, A. Protopapas, H. Mahncke, and M. M. Merzenich, “Speech comprehension is correlated with
temporal response patterns recorded from auditory cortex,” Proc. Natl. Acad. Sci., vol. 98, no. 23, pp. 13 367–13 372, 2001
Recent Progress: Fourier Transform vs Constant Q
𝑁𝑘 =𝑓𝑠
𝑓𝑘Q𝑄 =
𝑓𝑘δ𝑓
Block diagram of CQCC feature extraction
Comparison of results (EER [%]) on ASVspoof2015 Database
Front-end: CQCC-A (19+0th second derivative coefficients)
Back-end: 2 GMMs (512 components, EM training), one for human speech and one for
spoofed speech
Matlab implementation of CQCC extraction can be downloaded from
http://audio.eurecom.fr/content/software
Constant-QTransform
Powerspectrum
LOG DCT
speech signal
Uniformresampling
CQCC
Recent Progress: Constant Q Cepstral Coefficients (CQCC)
50