all your voices are belong to us: stealing voices to fool humans and machines dibya mukhopadhyay,...
TRANSCRIPT
All Your Voices Are Belong to Us: Stealing Voices to Fool Humans and Machines
Dibya Mukhopadhyay, Maliheh Shirvanian, Nitesh SaxenaUniversity of Alabama at Birmingham, USA
Premise
• We leave voice traces behind• How difficult is it to make a machine talk like you?• What are the consequences?• Voice is used as a biometrics -> attacking voice-based user
authentication system• Voice makes us known to people -> attacking arbitrary
speech contexts
Voice Morphing
• TTS Voice Synthesis (e.g., [AT&T voice synthesizer])
• Voice Conversion (e.g., Festvox)
Trained Voice Conversion
System
Source (Attacker) Speaker samples
Target (Victim) Speaker samples
map the source voice to target voice
Training
TestingInput: Samples in Attacker Voice
Output: Samples Spoken in Victim’s Voice
Voila!
Speaker Verification
• Machine-based Speaker Verification (e.g., [Douglas et al., DSP, 2006])
• A 2-class problem to identify claimant • System creates a model of a speaker in the training phase
to be verified in testing phase
• Human-based Speaker Verification• A human user serves as the verifier • Implicit in arbitrary communication
Our Contributions
• We study voice impersonation attacks • We evaluate attack feasibility against state-of-the-art
automated speaker verification algorithms as well as manual verification
• Our attacks represent realistic settings and are practical• We use an off- the-shelf voice morphing engine
• We use very less amount of training samples for voice conversions : approx. 6-8 minutes of training speech
• Most of the training samples are recorded using low-end devices such as smartphones / laptops
System and Threat Model
Phase II: Building Voice Morphing Model
Training Conversion
Attacker’s (Source S) Voice
A =
(a1…
a m)
Any
utte
ranc
eOS = (s
1 …s
n )
Same utterance
as OT
M = µ(OS, OT)
Audio Recording
Target’s (T) Audio Samples
WiretappingSocial Media
OT = (t1…tn)
Phase I: Collecting Audio Samples
Bob
fT = M(A) = (f1…fm)
Human-based Speaker Verification
Machine-based Speaker Verification
Phase III: Attacking Applications with Morphed Voices
?Access Granted
I am Bob
Fake Utterance A in Bob’ voice
Experiments and Measures
• Benign Setting: Test samples spoken by original speaker
• Attack Setting• Different Speaker Attack • Conversion Attack
• Metrics Used:• False Rejection Rate (FRR): fraction of genuine samples
rejected in benign setting• False Acceptance Rate (FAR): fraction of attack samples
accepted in attack setting
Attacking Machine-based Speaker Verification
Tools and Algorithms
• Festvox Voice Conversion System• Bob Spear Speaker Verification System [E. Khoury; ICASSP, 2014]
• UBM-GMM: A modeling technique that uses the spectral features; computes a log-likelihood of the Gaussian Mixture Models for background modeling and speaker verification
• ISV: An improvement to UBM-GMM, where a speaker’s variability due to age, surroundings, etc., are compensated for, and it gives better performance for the same user in different scenarios
Datasets
• Voxforge• Recorded using standard recording devices, length: 5 secs• 28 (all male) speakers (chosen)
• MOBIO• Recorded using laptop microphones, length: 7-30 secs• 152 (99 male, 53 female) speakers
Conversion Attack Setup
• Voxforge: • Attacker: 1 male speaker (CMU Arctic)• Victims: 8 speakers• Training: 100 samples of 5 secs each (i.e.,≈ 8 mins speech)
• MOBIO: • Attackers: 6 male and 3 female speakers• Victims: 32 male and 17 female speakers• Training: 12 samples of 30 secs each (i.e.,≈ 6 mins speech)
CMU Arctic Databases: http://festvox.org/cmu_arctic/index.html
Different Speaker Attack Setup
• Testing Voxforge: Original samples were swapped by samples spoken by each of the chosen CMU Arctic speakers
• Testing MOBIO: Original samples were swapped with other speakers’ samples
Results
YesNoYesNo Yes
Yes
Attacking Human-based Speaker Verification
User Studies
• Famous Speaker Study: Attackers mimic celebrities, users have to identify celebrities’ samples
• Briefly Familiar Speaker Study: Attackers mimic speakers, users have to identify speakers’ samples
• Study Platform: Amazon Mechanical Turk (M-Turk)• # of Participants: 65 and 32 (for the two studies) M-Turk
online users• Related work: Prior work [Shirvanian-Saxena; CCS’14] studied
“Short Authenticated Strings”; we look at arbitrary speech
Famous Speaker Study Setup
• Samples collected using an application published on M-Turk • 5 Female speakers mimicked Oprah Winfrey (100 samples)• 5 Male speakers mimicked Morgan Freeman (100 samples)
• Users listen to a 2-min speech of Oprah and Morgan followed by several benign and attacked challenges
• Speaker Verification: identify the original speaker• Voice Similarity Test: rank the similarity of voice to
the original speaker
Attack Setup
• Different Speaker Attack • Female M-Turk Speakers for Oprah• Male M-Turk Speakers for Morgan
• Conversion Attack:• # of Training samples: 100 sentences of 4 secs each• Source: Male/Female M-Turk Speakers• Target: Oprah/Morgan
Tests
• Speaker Verification Test: • Question: Is the speaker Oprah/Morgan?• Answer options: Yes, No, Not Sure
• Voice Similarity Test• Question: How similar is each sample to Oprah/Morgan?• Answer options: exactly similar, very similar, somehow
similar, not very similar, different
Briefly Familiar Study Setup
• Male and female M-Turk speakers as victims • from the previous dataset
• 90 secs long victim’s voices played for familiarization• Speaker Verification Test (as before)• Voice Similarity Test (as before)
Attack Setup
• Different Speaker Attack • Female M-Turk Speakers for Female Speaker• Male M-Turk Speakers for Male Speaker
• Conversion Attack:• Source: Female/male M-Turk Speakers• Target: Female/male M-Turk Speakers
Results: Speaker Verification Test
Results: Voice Similarity TestOprah
Morgan
• Original Speaker: 88.08% found “exactly similar” or “very similar”
• Different Speaker: 86.81% found “different” or “not very similar”
• Conversion Attack: 74.10% rated “somehow similar” or “very similar”
• Original Speaker: 95.77% found “exactly similar” or “very similar”
• Different Speaker: 94.36% found “different” or “not very similar”
• Conversion Attack: 59.74% rated “somehow similar” or “very similar”
Results: Voice Similarity Test
Briefly Familiar Speaker Study
• Original Speaker: 88.08% found “exactly similar” or “very similar”
• Different Speaker: 86.81% found “different” or “not very similar”
• Conversion Attack: 74.10% rated “somehow similar” or “very similar”
Conclusions
• Conversion attack is successful about 80-90% against state-of-the-art speaker verification algorithms
• About 50% of the cases, human verifiers were fooled by morphed samples
• Attacks against human verifiers will improve as voice conversion/synthesis techniques will continue to improve
Limitations and Future Work
• We only used the known state-of-the-art biometric speaker verification system and an off-the-shelf voice conversion tool.
• The possibility of accepting an attacked sample may increase in real-life as people may not pay due attention.
• Attacks might improve when the human subjects have any hearing disability
• The current study does not tell us how the attacks might work in other scenarios such as faking real-time communication, or faking court evidences.
Thank You!