foreign accent conversion by synthesizing speech from phonetic … · 2020. 12. 6. · interspeech...

*Work funded by NSF

Guanlong Zhao, Shaojin Ding, and Ricardo Gutierrez-OsunaPresented by Guanlong Zhao ([email protected])

Department of Computer Science and EngineeringTexas A&M University, U.S.

Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams

Interspeech 2019 2

Introduction

• Foreign accent conversion– Create a new voice that has the voice quality of a

given non-native speaker and the pronunciation patterns of a native speaker

• Challenge– Divide the speech signal into accent-related cues

and voice quality

Nativespeech

L2speech

Goldenspeaker

ID carrier

L1 gestures

ID carrier

L2 gestures

Interspeech 2019 3

Related work

• Prior approaches on accent conversion– Acoustics [Aryal 2015, Zhao 2018]– Articulatory [Aryal 2016]

• Phonetic Posteriorgrams (PPGs)– Spoken term detection [Hazen 2009]–Mispronunciation detection [Lee 2013]– Personalized TTS [Sun 2016]– Voice conversion [Xie 2016, Zhou 2016]

Aryal, Sandesh, and Ricardo Gutierrez-Osuna. "Can voice conversion be used to reduce non-native accents?." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.Zhao, Guanlong, et al. "Accent conversion using phonetic posteriorgrams." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018.Aryal, Sandesh, and Ricardo Gutierrez-Osuna. "Data driven articulatory synthesis with deep neural networks." Computer Speech & Language 36 (2016): 260-273.Hazen, Timothy J., Wade Shen, and Christopher White. "Query-by-example spoken term detection using phonetic posteriorgram templates." 2009 IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE, 2009.Lee, Ann, Yaodong Zhang, and James Glass. "Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams." 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.Sun, Lifa, et al. "Personalized, Cross-Lingual TTS Using Phonetic Posteriorgrams." INTERSPEECH. 2016.Xie, Feng-Long, Frank K. Soong, and Haifeng Li. "A KL Divergence and DNN-Based Approach to Voice Conversion without Parallel Training Sentences." Interspeech. 2016.Zhou, Yi, et al. "Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Interspeech 2019 4

Method overview

L1 AM

L2 speechsynthesizer

TestingL2 wav

L2 PPG

TrainingL1 wav

L2 speechsynthesizer

Accent conversion

L1 AM

L2 PPG → L2 mel

L1 PPG

WaveGlowvocoder

L2 mel → L2 wav

WaveGlowvocoder

Interspeech 2019 5

Acoustic modeling

De-correlate

Hidden layer

Hidden layer

Soft-max

Linear transform

Fully connected

Fully connected

Mixing-up

Group-sum

Input feature vector

Senones

P-norm non-linearity𝒙 ! 𝒙 !𝒙 !

Normalization

Input to the hidden layer

The layer after the hidden layer

One hidden layer

[Zhang, ICASSP’14]Zhang, Xiaohui, et al. "Improving deep neural network acoustic models using generalized maxout networks." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

Interspeech 2019 6

PPG->Mel-spectrogram

PPG

PPGPreNet

Conv.Layers

EncoderLSTM

DecoderPreNet

Attention LSTM

Decoder LSTM

LinearProjection

PostNet

LinearProjection

Stop token Melspectrogram

Attentioncontext

Eqs. (3)-(9)

||

+

||

||

||

+Concatenate

Sum

Interspeech 2019 7

Long sequence and attention– Original Tacotron 2 was designed to accept

character sequences as input– Significantly shorter than our PPG sequences• Tens of characters vs. a few hundred frames

– Solutions• Train the PPG-to-Mel model with shorter PPG

sequences• Add a locality constraint to the attention mechanism

Interspeech 2019 8

Locality constraint on attentionLocation-sensitive attention

Hidden state of attention LSTM

Attention context

Locality constraint

Constrained attention weights

ℎ𝑗 - encoder hidden states

𝑦𝑖 - predicted acoustic feature

Interspeech 2019 9

Mel-spectrogram->Speech

• WaveGlow

[Prenger, Valle, and Catanzaro, ICASSP, 2019]Prenger, Ryan, Rafael Valle, and Bryan Catanzaro. "Waveglow: A flow-based generative network for speech synthesis." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Interspeech 2019 10

Experimental setup• Acoustic model– Trained on Librispeech (960h) with Kaldi– Five hidden layers and an output layer with 5816

senones

• Data– Native English speakers: General American; BDL (M) &

CLB (F); CMU-ARCTIC corpus [Kominek and Black, 2004]

– Non-native English speakers: YKWK (Korean, M) & ZHAA (Arabic, F); L2-ARCTIC corpus [Zhao et al., 2018]

– One hour per speaker for training; 50 utts for testing– Conversion pairs: BDL-YKWK, CLB-ZHAA

Kominek, John, and Alan W. Black. "The CMU Arctic speech databases." Fifth ISCA workshop on speech synthesis. 2004.Zhao, Guanlong, et al. "L2-ARCTIC: A non-native English speech corpus." Interspeech, 2018.

Interspeech 2019 11

Baseline

STRAIGHT

GMM+MLPG+GV

Native speech

Nonnative speech

AP

Converted speech

F0

STRAIGHT-1

Mean and variance normalization

𝐟−ɪ+𝐥

𝐬−𝐢+𝐥

p−ɔ+t

𝐡−ʌ+𝐭

𝐬−𝐢+𝐥

p−ɔ+t

𝐟−ɪ+𝐥

𝐡−ʌ+𝐭

L1 PPGs L2AM AM

[Zhao & Gutierrez, TASLP, 2019]https://github.com/guanlongzhao/ppg-gmm

Zhao, Guanlong, and Ricardo Gutierrez-Osuna. "Using Phonetic Posteriorgram Based Frame Pairing for Segmental Accent Conversion." IEEE/ACM Transactions on Audio, Speech, and Language Processing 27.10 (2019): 1649-1660.

Interspeech 2019 12

Results: MOS

3.04

2.92

3.53

3.46

3.98

3.50

4.40

3.54

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Quality

Naturalness

MOS Ratings

CMU-ARCTIC

L2-ARCTIC

Proposed

Baseline

• Rated 50 utts per system, ≥ 17 ratings per utt; Amazon MTurk• No statistically significant difference b/w “Proposed” and either CMU-ARCTIC

(𝑝 = 0.35) or L2-ARCTIC (𝑝 = 0.54) on the naturalness MOS• Other differences are significant

Interspeech 2019 13

Results: voice similarity preference

28% 72%

Preference score

Baseline Proposed

1.05

3.40

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Preference confidence score

• XAB style preference test with a 7-point confidence rating in the end• 17 raters rated 50 random L2-ARCTIC:Proposed:Baseline pairs; audios played in reverse

Interspeech 2019 14

Results: accentedness

2.94 3.93 1.2 7.170

1

2

3

4

5

6

7

8

9

Baseline Proposed CMU-ARCTIC L2-ARCTIC

• Rated in a 9-point scale, 9 being the most accented• Each utt rated by 18 listeners who live in the U.S.

Interspeech 2019 15

Discussion• Better MOS

– Proposed an easy-to-implement locality constraint on the attention mechanism to make the PPG-to-Mel model trainable on utterance-level samples

– MOS ratings are lower than those in the original Tacotron 2 and WaveGlow paper; largely because their systems were trained with 24× more data

• Better voice similarity– Generated the non-native speaker’s excitation directly from the

synthesized mel-spectrogram

• Accentedness– Significantly less accented than the non-native speech– Slightly increased accentedness ratings compared with baseline

• AM inevitably produces recognition errors when extracting the PPG• The proposed model does not explicitly model stress and intonation

patterns

Interspeech 2019 16

Conclusion• Future work– Training the PPG-to-Mel and WaveGlow models jointly– Incorporate intonation information into the modeling

process– Reduce training data size – transfer learning– Eliminate the need for a reference utterance at synthesis

time

• Open-source– Data: https://psi.engr.tamu.edu/l2-arctic-corpus/– Code and pre-trained models

• Proposed: https://github.com/guanlongzhao/fac-via-ppg• Baseline: https://github.com/guanlongzhao/ppg-gmm

– Demo: https://guanlongzhao.github.io/demo/fac-via-ppg/

https://psi.engr.tamu.edu/l2-arctic-corpus/

https://github.com/guanlongzhao/fac-via-ppg

https://github.com/guanlongzhao/ppg-gmm

https://guanlongzhao.github.io/demo/fac-via-ppg/

Interspeech 2019 17

ThanksQ & A

foreign accent conversion by synthesizing speech from phonetic … · 2020. 12. 6. · interspeech...

Documents