neural voice cloning with a few samples - nips.cc04-15-30)-04-15-30-12577-neural_voice_cl.pdf ·...

13
Neural Voice Cloning with a Few Samples Sercan O. Arik, Jitong Chen, Kainan Peng*, Wei Ping, Yanqi Zhou

Upload: others

Post on 25-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Neural Voice Cloningwith a Few Samples

Sercan O. Arik, Jitong Chen, Kainan Peng*, Wei Ping, Yanqi Zhou

Page 2: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Motivations• Text-to-speech (TTS) models can be conditioned on text and speaker identity. • Text: linguistic information, content of the generated speech.• Speaker identity: speaker information (accent, pitch, speech rate…).

Page 3: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Motivations• Text-to-speech (TTS) models can be conditioned on text and speaker identity. • Text: linguistic information, content of the generated speech.• Speaker identity: speaker information (accent, pitch, speech rate…).

• Limitations:• Can only generate speech for observed speakers during training.• Require lots of speech samples per speaker (e.g., Deep Voice 2).

Page 4: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Voice Cloning• Voice cloning: synthesize the voices of new speakers from a few speech

samples (few-shot generative model).

• Applications: personalized speech interfaces, content creation, assistive technology…

Page 5: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Voice Cloning• Voice cloning: synthesize the voices of new speakers from a few speech

samples (few-shot generative model).

• Applications: personalized speech interfaces, content creation, assistive technology…

• Challenges: • Generalization: learn the voice of a new speaker.• Efficiency: extract the speaker characteristics from a few speech samples.• Computational cost: cloning with low latency and small footprint.

• Two approaches:• Speaker adaptation.• Speaker encoding.

Page 6: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Speaker Adaptation• Fine-tune a pre-trained multi-speaker model for a new speaker.

• Training data: a few text and audio pairs.

Page 7: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

• Two options for speaker adaptation:

Fine-tune the whole model Fine-tune the speaker embedding only

Speaker Adaptation• Fine-tune a pre-trained multi-speaker model for a new speaker.

• Training data: a few text and audio pairs.

Page 8: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Speaker Adaptation Analysis

ApproachesSpeaker Adaptation

Embedding-only Whole-model

Cloning time 8 h 5 min

# of parameters per speaker 128 25 million

Page 9: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Speaker Encoding• Directly predict a new speaker embedding for a multi-speaker model.

• Train a speaker encoder with audio and speaker embedding pairs.

Page 10: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Speaker Encoding• Directly predict a new speaker embedding for a multi-speaker model.

• Train a speaker encoder with audio and speaker embedding pairs.

• Cloning time: a few seconds, more favorable for low-resource deployment.

Page 11: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Results• Vocoder: classical Griffin-Lim algorithm.

• Demo website: http://audiodemos.github.io

ApproachesSpeaker Adaptation Speaker

EncodingEmbedding-only Whole-model

Mean Opinion Score (MOS)

Naturalness (5-scale) 2.67 3.16 2.99

Similarity (4-scale) 2.95 3.16 2.85

Page 12: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Voice Morphing via Embedding Manipulation• BritishMale + AveragedFemale - AveragedMale = BritishFemale

• BritishMale + AveragedAmerican - AveragedBritish = AmericanMale

Page 13: Neural Voice Cloning with a Few Samples - nips.cc04-15-30)-04-15-30-12577-Neural_Voice_Cl.pdf · Voice Cloning •Voice cloning: synthesize the voices of new speakers from a few speech

Thank you!

Welcome to our poster,and listen to samples!

Today, Session B, #91