generative model-based [6pt] text-to-speech synthesis · generative model-based text-to-speech...
TRANSCRIPT
Outline
Generative TTS
Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks
Beyond parametric TTSLearned featuresWaveNetEnd-to-end
Conclusion & future topics
Outline
Generative TTS
Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks
Beyond parametric TTSLearned featuresWaveNetEnd-to-end
Conclusion & future topics
Text-to-speech as sequence-to-sequence mapping
Automatic speech recognition (ASR)
! “Hello my name is Heiga Zen”
Machine translation (MT)“Hello my name is Heiga Zen”! “Ich heiße Heiga Zen”
Text-to-speech synthesis (TTS)
“Hello my name is Heiga Zen”!
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Speech production process
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
ed/u
nvoi
ced
freq
tran
sfer
cha
r
air flow
Sound sourcevoiced: pulseunvoiced: noise
speech
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Typical �ow of TTS system
Sentence segmentation
Word segmentation
Text normalization
Part-of-speech tagging
Pronunciation
Prosody prediction
Waveform generation
TEXT
Text analysis
SYNTHESIZED
SEECH
Speech synthesisdiscrete ) discrete
discrete ) continuous
NLP
Speech
Frontend
Backend
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Speech synthesis approaches
Rule-based, formant synthesis [�] Sample-based, concatenativesynthesis [�]
Model-based, generative synthesis
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Speech synthesis approaches
Rule-based, formant synthesis [�]
Sample-based, concatenativesynthesis [�]
Model-based, generative synthesis
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Speech synthesis approaches
Rule-based, formant synthesis [�] Sample-based, concatenativesynthesis [�]
Model-based, generative synthesis
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Speech synthesis approaches
Rule-based, formant synthesis [�] Sample-based, concatenativesynthesis [�]
Model-based, generative synthesis
| text=”Hello, my name is Heiga Zen.”)p(speech=
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Probabilistic formulation of TTS
Random variables
X Speech waveforms (data) ObservedW Transcriptions (data) Observedw Given text Observedx Synthesized speech Unobserved
Synthesis• Estimate posterior predictive distribution! p(x | w,X ,W)
• Sample ¯
x from the posterior distribution
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� 6 of ��
Probabilistic formulation of TTS
Random variables
X Speech waveforms (data) ObservedW Transcriptions (data) Observedw Given text Observedx Synthesized speech Unobserved
Synthesis• Estimate posterior predictive distribution! p(x | w,X ,W)
• Sample ¯
x from the posterior distribution
X W w
x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� 6 of ��
Probabilistic formulation
Introduce auxiliary variables (representation) + factorize dependency
p(x | w,X ,W) =
y X
8l
X
8L
�p(x | o)p(o | l, �)p(l | w)
p(X | O)p(O | L, �)p(�)p(L | W)/ p(X )
dodOd�
where
O,o: Acoustic featuresL, l: Linguistic features
�: Model
X W w
x
L lO
�
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)
Approximate {sum & integral} by best point estimates (like MAP) [�]
p(x | w,X ,W) ⇡ p(x | ˆ
o)
where
{ˆ
o, ˆl, ˆO, ˆL, ˆ�} = arg max
o,l,O,L,�
�
p(x | o)p(o | l, �)p(l | w)
p(X | O)p(O | L, �)p(�)p(L | W)
ˆ
o
X W w
x
L lO
�
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� 8 of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Learn mapping
ˆ
l = arg max
l
p(l | w) Predict linguistic features
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Predict acoustic features
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Synthesize waveform
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W)
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�)
ˆ
l = arg max
l
p(l | w)
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�)
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o)
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�)
ˆ
l = arg max
l
p(l | w)
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�)
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o)
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Learn mapping
ˆ
l = arg max
l
p(l | w)
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�)
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o)
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Learn mapping
ˆ
l = arg max
l
p(l | w) Predict linguistic features
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�)
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o)
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Learn mapping
ˆ
l = arg max
l
p(l | w) Predict linguistic features
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Predict acoustic features
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o)
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Learn mapping
ˆ
l = arg max
l
p(l | w) Predict linguistic features
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Predict acoustic features
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Synthesize waveform
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Approximation (�)Joint! Step-by-step maximization [�]
ˆO = arg max
Op(X | O) Extract acoustic features
ˆL = arg max
Lp(L | W) Extract linguistic features
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Learn mapping
ˆ
l = arg max
l
p(l | w) Predict linguistic features
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Predict acoustic features
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Synthesize waveform
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Representations: acoustic, linguistic, mapping
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��
Representation – Linguistic features
Hello, world.
Hello, world.
hello world
h-e2 l-ou1 w-er1-l-d
h e l ou w er l d Phone: voicing, manner, ...
Syllable: stress, tone, ...
Word: POS, grammar, ...
Phrase: intonation, ...
Sentence: length, ...
! Based on knowledge about spoken language• Lexicon, letter-to-sound rules• Tokenizer, tagger, parser• Phonology rules
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Linguistic features
Hello, world.
Hello, world.
hello world
h-e2 l-ou1 w-er1-l-d
h e l ou w er l d Phone: voicing, manner, ...
Syllable: stress, tone, ...
Word: POS, grammar, ...
Phrase: intonation, ...
Sentence: length, ...
! Based on knowledge about spoken language• Lexicon, letter-to-sound rules• Tokenizer, tagger, parser• Phonology rules
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Acoustic features
Piece-wise stationary, source-�lter generative model p(x | o)
Pulse train (voiced)
White noise (unvoiced)
Speech
Vocal source Vocal tract filter
Fundamental
frequency
Ap
erio
dic
ity,
vo
icin
g, ...
+
0
80
0
[dB]
8 [kHz]
Cepstrum, LPC, ...
e(n)
x(n) = h(n)*e(n)
h(n)
overlap/shift
windowing
! Needs to solve inverse problem• Estimate parameters from signals• Use estimated parameters (e.g., cepstrum) as acoustic features
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Acoustic features
Piece-wise stationary, source-�lter generative model p(x | o)
Pulse train (voiced)
White noise (unvoiced)
Speech
Vocal source Vocal tract filter
Fundamental
frequency
Ap
erio
dic
ity,
vo
icin
g, ...
+
0
80
0
[dB]
8 [kHz]
Cepstrum, LPC, ...
e(n)
x(n) = h(n)*e(n)
h(n)
overlap/shift
windowing
! Needs to solve inverse problem• Estimate parameters from signals• Use estimated parameters (e.g., cepstrum) as acoustic features
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Mapping
Rule-based, formant synthesis [�]
ˆO = arg max
Op(X | O) Vocoder analysis
ˆL = arg max
Lp(L | W) Text analysis
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Extract rules
ˆ
l = arg max
l
p(l | w) Text analysis
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Apply rules
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Vocoder synthesis
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
! Hand-crafted rules on knowledge-based features
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Mapping
Rule-based, formant synthesis [�]
ˆO = arg max
Op(X | O) Vocoder analysis
ˆL = arg max
Lp(L | W) Text analysis
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Extract rules
ˆ
l = arg max
l
p(l | w) Text analysis
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Apply rules
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Vocoder synthesis
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
! Hand-crafted rules on knowledge-based featuresHeiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Mapping
HMM-based [�], statistical parametric synthesis [�]
ˆO = arg max
Op(X | O) Vocoder analysis
ˆL = arg max
Lp(L | W) Text analysis
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Train HMMs
ˆ
l = arg max
l
p(l | w) Text analysis
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Parameter generation
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Vocoder synthesis
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
! Replace rules by HMM-based generative acoustic model
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Representation – Mapping
HMM-based [�], statistical parametric synthesis [�]
ˆO = arg max
Op(X | O) Vocoder analysis
ˆL = arg max
Lp(L | W) Text analysis
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�) Train HMMs
ˆ
l = arg max
l
p(l | w) Text analysis
ˆ
o = arg max
o
p(o | ˆ
l, ˆ�) Parameter generation
¯
x ⇠ fx
(
ˆ
o) = p(x | ˆ
o) Vocoder synthesis
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆ
o
! Replace rules by HMM-based generative acoustic modelHeiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Outline
Generative TTS
Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks
Beyond parametric TTSLearned featuresWaveNetEnd-to-end
Conclusion & future topics
HMM-based generative acoustic model for TTS
• Context-dependent subword HMMs• Decision trees to cluster & tie HMM states! interpretable
q1
o
1
q2
o
2
q3
o
3
q4
o
4
l
l1
lN
o
1
o
2
o
3 Too
2
... ... ... ...
o
4
o
2
o
6
o
5
...
...
: Discrete
: Continuous
p(o | l, �) =
X
8q
TY
t=1
p(o
t
| qt
, �)P (q | l, �) qt
: hidden state at t
=
X
8q
TY
t=1
N (o
t
;µ
qt ,⌃qt)P (q | l, �)
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
HMM-based generative acoustic model for TTS
• Non-smooth, step-wise statistics! Smoothing is essential
• Dif�cult to use high-dimensional acoustic features (e.g., raw spectra)! Use low-dimensional features (e.g., cepstra)
• Data fragmentation! Ineffective, local representation
A lot of research work have been done to address these issues
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �6 of ��
Alternative acoustic model
HMM: Handle variable length & alignmentDecision tree: Map linguistic! acoustic
yes noyes no
...
yes no
yes no yes no
Statistics of acoustic features o
Linguistic features l
Regression tree: linguistic features! Stats. of acoustic features
Replace the tree w/ a general-purpose regression model! Arti�cial neural network
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Alternative acoustic model
HMM: Handle variable length & alignmentDecision tree: Map linguistic! acoustic
yes noyes no
...
yes no
yes no yes no
Statistics of acoustic features o
Linguistic features l
Regression tree: linguistic features! Stats. of acoustic features
Replace the tree w/ a general-purpose regression model! Arti�cial neural network
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
FFNN-based acoustic model for TTS [6]
ht
ltFrame-level linguistic feature
otFrame-level acoustic feature
Input
Target
lt
o t o t+1
o t−1
lt+1
lt−1
h
t
= g (W
hl
l
t
+ b
h
)
ˆ� = arg min
�
X
t
kot
� ˆ
o
t
k2
ˆ
o
t
= W
oh
h
t
+ b
o
� = {Whl
,Woh
, bh
, bo
}ˆ
o
t
⇡ E [o
t
| lt
]! Replace decision trees & Gaussian distributions
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��
RNN-based acoustic model for TTS [�]
Recurrentconnections
lt
o t o t+1
o t−1
lt+1
lt−1
ltFrame-level linguistic feature
otFrame-level acoustic feature
Input
Target
h
t
= g (W
hl
l
t
+ W
hh
h
t�1 + b
h
)
ˆ� = arg min
�
X
t
kot
� ˆ
o
t
k2
ˆ
o
t
= W
oh
h
t
+ b
o
� = {Whl
,Whh
,Woh
, bh
, bo
}FFNN: ˆ
o
t
⇡ E [o
t
| lt
] RNN: ˆ
o
t
⇡ E [o
t
| l1, . . . , lt]Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
NN-based generative acoustic model for TTS
• Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [�, 8]
• Dif�cult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [�]
• Data fragmentation! Distributed representation [��]
NN-based approach is now mainstream in research & products• Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��]• Products: e.g., Google [��]
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
NN-based generative acoustic model for TTS
• Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [�, 8]
• Dif�cult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [�]
• Data fragmentation! Distributed representation [��]
NN-based approach is now mainstream in research & products• Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��]• Products: e.g., Google [��]
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Outline
Generative TTS
Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks
Beyond parametric TTSLearned featuresWaveNetEnd-to-end
Conclusion & future topics
NN-based generative model for TTS
text(concept)
frequencytransfercharacteristics
magnitudestart--end
fundamentalfrequency
mod
ulat
ion
of c
arrie
r wav
eby
spe
ech
info
rmat
ion
fund
amen
tal f
req
voic
ed/u
nvoi
ced
freq
tran
sfer
cha
r
air flow
Sound sourcevoiced: pulseunvoiced: noise
speech
Text! Linguistic! (Articulatory)! Acoustic!Waveform
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Knowledge-based features! Learned features
Unsupervised feature learning
x(t) (raw FFT spectrum)
Acousticfeatureo(t)
)
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0hello
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0w
orld
w(n)
(1-hot representation of word)
w(n+1)
~x(t)
Linguisticfeaturel(n)
)l(n-1)
• Speech: auto-encoder at FFT spectra [�, ��]! positive results• Text: word [�6], phone & syllable [��]! less positive
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Relax approximationJoint acoustic feature extraction & model training
Two-step optimization! Joint optimization
8><
>:
ˆO = arg max
Op(X | O)
ˆ� = arg max
�
p(
ˆO | ˆL, �)p(�)
+{ˆ�, ˆO} = arg max
�,Op(X | O)p(O | ˆL, �)p(�)
Joint source-�lter & acoustic model optimization• HMM [�8, ��, ��]• NN [��, ��]
X
O
W
L
w
l
�
ˆL
o
ˆ�
ˆ
l
x
ˆ
o
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Relax approximationJoint acoustic feature extracion & model training
Mixed-phase cepstral analysis + LSTM-RNN [��]
Linguisticfeatures ......
...
Back propagation
Forw
ard
prop
agat
ion
z-1 z-1
-1
Derivatives
z-1 z-1z z-
...
Speech
Cepstrum
... ...
Pulse train
lt
pn
G(z)
H -1u (z)
f
s
e
o
(v)t
d
(u)td
(v)t
o
(u)t
xn
n
n
n
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Relax approximationDirect mapping from linguistic to waveform
No explicit acoustic features
{ˆ�, ˆO} = arg max
�,Op(X | O)p(O | ˆL, �)p(�)
+ˆ� = arg max
�
p(X | ˆL, �)p(�)
Generative models for raw audio• LPC [��]• WaveNet [��]• SampleRNN [��]
X W
L
w
l
�
ˆL
ˆ�
ˆ
l
x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �6 of ��
WaveNet: A generative model for raw audio
Autoregressive (AR) modelling of speech signals
x = {x0, x1, . . . , xN�1} : raw waveform
p(x | �) = p(x0, x1, . . . , xN�1 | �) =
N�1Y
n=0
p(xn
| x0, . . . , xn�1,�)
WaveNet [��]! p(x
n
| x0, . . . , xn�1, �) is modeled by convolutional NN
Key components• Causal dilated convolution: capture long-term dependency• Gated convolution + residual + skip: powerful non-linearity• Softmax at output: classi�cation rather than regression
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet: A generative model for raw audio
Autoregressive (AR) modelling of speech signals
x = {x0, x1, . . . , xN�1} : raw waveform
p(x | �) = p(x0, x1, . . . , xN�1 | �) =
N�1Y
n=0
p(xn
| x0, . . . , xn�1,�)
WaveNet [��]! p(x
n
| x0, . . . , xn�1, �) is modeled by convolutional NN
Key components• Causal dilated convolution: capture long-term dependency• Gated convolution + residual + skip: powerful non-linearity• Softmax at output: classi�cation rather than regression
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet: A generative model for raw audio
Autoregressive (AR) modelling of speech signals
x = {x0, x1, . . . , xN�1} : raw waveform
p(x | �) = p(x0, x1, . . . , xN�1 | �) =
N�1Y
n=0
p(xn
| x0, . . . , xn�1,�)
WaveNet [��]! p(x
n
| x0, . . . , xn�1, �) is modeled by convolutional NN
Key components• Causal dilated convolution: capture long-term dependency• Gated convolution + residual + skip: powerful non-linearity• Softmax at output: classi�cation rather than regression
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet – Causal dilated convolution
���ms in �6kHz sampling = �,6�� time steps! Too long to be captured by normal RNN/LSTM
Dilated convolutionExponentially increase receptive �eld size w.r.t. # of layers
Input
Output
Hiddenlayer3
Hiddenlayer2
Hiddenlayer1
. . .
p(x | x ,..., x )n n-16n-1
xn-1
xn-2
xn-3
xn-16
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��
WaveNet – Non-linearity
Residual block
Residual block
Residual block
Residual block
30
ReLU
1x1256
ReLU
1x1256
Softmax
256
Skip connections
0
p(x | x ,..., x )n n-1
: 1x1 convolution1x1
ReLU : ReLU activation
Gated : Gated activationSoftmax : Softmax activation
...
2x1 dilated
Gated
1x1
1x1256
512
Residual block
To residual block
To skip connection
+
...
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet – Softmax
Time
Am
plitu
de
Analog audio signal
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet – Softmax
Time
Am
plitu
de
Sampling & Quantization
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet – Softmax
Time
Am
plitu
de
1
16
Ca
teg
ory
in
de
x
Categorical distribution → Histogram
- Unimodal
- Multimodal
- Skewed
...
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet – Conditional modelling
Embeddingat time n
hn
Linguisticfeatures l
2x1 dilated
Gated
1x1
1x1
Residual block
1x1
1x1
1x1
1x1
Residual block
Residual block
Residual block
Residual block
......
ReLU
1x1
ReLU
1x1
Softmax
0
p(x | x ,..., x ,l)n n-1... +
conditioning
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
WaveNet vs conventional audio generative models
Assumptions in conventional audio generative models [��, �6, ��, ��]• Stationary process w/ �xed-length analysis window! Estimate model within ��–��ms window w/ �–�� shift
• Linear, time-invariant �lter within a frame! Relationship between samples can be non-linear
• Gaussian process! Assumes speech signals are normally distributed
WaveNet• Sample-by-saple, non-linear, capable to take additional inputs• Arbitrary-shaped signal distribution
SOTA subjective naturalness w/ WaveNet-based TTS [��]HMM LSTM Concatenative WaveNet
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Relax approximationTowards Bayesian end-to-end TTS
Integrated end-to-end8><
>:
ˆL = arg max
Lp(L | W)
ˆ� = arg max
�
p(X | ˆL, �)p(�)
+ˆ� = arg max
�
p(X | W , �)p(�)
Text analysis is integrated to model
X W w
�
ˆ�
x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Relax approximationTowards Bayesian end-to-end TTS
Bayesian end-to-end8<
:
ˆ� = arg max
�
p(X | W , �)p(�)
¯
x ⇠ fx
(w, ˆ�) = p(x | w, ˆ�)
+¯
x ⇠ fx
(w,X ,W) = p(x | w,X ,W)
=
Zp(x | w, �)p(� | X ,W)d�
⇡ 1
K
KX
k=1
p(x | w, ˆ�k
) Ensemble
Marginalize model parameters & architecture
X W w
�
x
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Outline
Generative TTS
Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks
Beyond parametric TTSLearned featuresWaveNetEnd-to-end
Conclusion & future topics
Generative model-based text-to-speech synthesis
• Bayes formulation + factorization + approximations
• Representation: acoustic features, linguistic features, mapping� Mapping: Rules! HMM! NN� Feature: Engineered! Unsupervised, learned
• Less approximations� Joint training, direct waveform modelling� Moving towards integrated & Bayesian end-to-end TTS
Naturalness: Concatenative Generative
Flexibility: Concatenative⌧ Generative (e.g., multiple speakers)
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �6 of ��
Beyond “text”-to-speech synthesis
TTS on conversational assistants
• Texts aren’t fully contained
• Need more context� Location to resolve homographs� User query to put right emphasis
We need representation that can
organize the world information & make it accessible & useful
from TTS generative models ,
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Beyond “text”-to-speech synthesis
TTS on conversational assistants
• Texts aren’t fully contained
• Need more context� Location to resolve homographs� User query to put right emphasis
We need representation that can
organize the world information & make it accessible & useful
from TTS generative models ,
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
Beyond “generative” TTS
Generative model-based TTS
• Model represents process behind speech production� Trained to minimize error against human-produced speech� Learned model! speaker
• Speech is for communication� Goal: maximize the amount of information to be received
Missing “listener”! “listener” in training / model itself?
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��
Beyond “generative” TTS
Generative model-based TTS
• Model represents process behind speech production� Trained to minimize error against human-produced speech� Learned model! speaker
• Speech is for communication� Goal: maximize the amount of information to be received
Missing “listener”! “listener” in training / model itself?
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��
References I
[�] D. Klatt.Real-time speech synthesis by rule.Journal of ASA, 68(S�):S�8–S�8, ��8�.
[�] A. Hunt and A. Black.Unit selection in a concatenative speech synthesis system using a large speech database.In Proc. ICASSP, pages ���–��6, ���6.
[�] K. Tokuda.Speech synthesis as a statistical machine learning problem.https://www.sp.nitech.ac.jp/
~
tokuda/tokuda_asru2011_for_pdf.pdf.Invited talk given at ASRU ����.
[�] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis.IEICE Trans. Inf. Syst., J8�-D-II(��):����–����, ����.(in Japanese).
[�] H. Zen, K. Tokuda, and A. Black.Statistical parametric speech synthesis.Speech Commn., ��(��):����–��6�, ����.
[6] H. Zen, A. Senior, and M. Schuster.Statistical parametric speech synthesis using deep neural networks.In Proc. ICASSP, pages ��6�–��66, ����.
[�] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong.TTS synthesis with bidirectional LSTM based recurrent neural networks.In Proc. Interspeech, pages ��6�–��68, ����.
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
References II
[8] H. Zen.Acoustic modeling for speech synthesis: from HMM to RNN.http://research.google.com/pubs/pub44630.html.Invited talk given at ASRU ����.
[�] S. Takaki and J. Yamagishi.A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speechsynthesis.In Proc. ICASSP, pages ����–����, ���6.
[��] G. Hinton, J. McClelland, and D. Rumelhart.Distributed representation.In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in themicrostructure of cognition. MIT Press, ��86.
[��] H. Zen and A. Senior.Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis.In Proc. ICASSP, pages �8��–�8�6, ����.
[��] X. Wang, S. Takaki, and J. Yamagishi.Investigating very deep highway networks for parametric speech synthesis.In Proc. ISCA SSW�, ���6.
[��] Y. Saito, S. Takamichi, and Saruwatari.Training algorithm to deceive anti-spoo�ng veri�cation for DNN-based speech synthesis.In Proc. ICASSP, ����.
[��] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak.Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices.In Proc. Interspeech, ���6.
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
References III
[��] P. Muthukumar and A. Black.A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis.arXiv:����.8��8, ����.
[�6] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao.Word embedding for recurrent neural network based TTS synthesis.In Proc. ICASSP, pages �8��–�88�, ����.
[��] X. Wang, S. Takaki, and J. Yamagishi.Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis.IEICE Trans. Inf. Syst., E��-D(��):����–��8�, ���6.
[�8] T. Toda and K. Tokuda.Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm.In Proc. ICASSP, pages ����–���8, ���8.
[��] Y.-J. Wu and K. Tokuda.Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis.In Proc. Interspeech, pages ���–�8�, ���8.
[��] R. Maia, H. Zen, and M. Gales.Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters.In Proc. ISCA SSW�, pages 88–��, ����.
[��] K. Tokuda and H. Zen.Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis.In Proc. ICASSP, pages ����–����, ����.
[��] K. Tokuda and H. Zen.Directly modeling voiced and unvoiced components in speech waveforms by neural networks.In Proc. ICASSP, pages �6��–�6��, ���6.
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
References IV
[��] F. Itakura and S. Saito.A statistical method for estimation of speech spectral density and formant frequencies.Trans. IEICE, J���A:��–��, ����.
[��] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.WaveNet: A generative model for raw audio.arXiv:�6��.�����, ���6.
[��] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio.SampleRNN: An unconditional end-to-end neural audio generation model.arXiv:�6��.��8��, ���6.
[�6] S. Imai and C. Furuichi.Unbiased estimation of log spectrum.In Proc. EURASIP, pages ���–��6, ��88.
[��] H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux.Speech analysis with multi-kernel linear prediction.In Proc. Spring Conference of ASJ, pages ���–���, ����.(in Japanese).
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��
X
O
�
ˆL
x
ˆo
X W w
x
(1) Bayesian
X W w
x
L l
O
�
o
(2) Auxiliary variables + factorization
(5) Joint acoustic feature extraction + model training
(6) Conditional WaveNet-based TTS
(7) Integrated end-to-end (8) Bayesian end-to-end
X W w
L l
O
�
o
(3) Joint maximization
x
ˆo
W
L
w
l
o
ˆ�
ˆ
l
X
�
ˆL
x
W
L
w
l
ˆ�
ˆ
l
X
�
x
W w
ˆ�
X
�
x
W w
X
O
W
L
w
l
�
ˆO ˆL
o
ˆ�
ˆ
l
x
ˆo
(4) Step-by-step maximizatione.g., statistical parametric TTS
Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��