hidden markov model-based speech...
TRANSCRIPT
![Page 1: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/1.jpg)
Hidden Markov Model-based speech synthesis
Junichi Yamagishi, Korin Richmond, Simon King and many othersCentre for Speech Technology ResearchUniversity of Edinburgh, UK
www.cstr.ed.ac.uk
![Page 2: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/2.jpg)
Note
• I did not invent HMM-based speech synthesis!
• Core idea: Tokuda (Nagoya Institute of Technology, Japan)
• Developments: many other people
• Speaker adaptation: Junichi Yamagishi (Edinburgh) and colleagues
![Page 3: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/3.jpg)
Background
![Page 4: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/4.jpg)
Speech synthesis mini-tutorial
• Text to speech
• input: text
• output: a waveform that can be listened to
• Two main components
• front end: analyses text and converts to linguistic specification
• waveform generation: converts linguistic specification to speech
![Page 5: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/5.jpg)
Speech synthesis mini-tutorial
• Text to speech
• input: text
• output: a waveform that can be listened to
• Two main components
• front end: analyses text and converts to linguistic specification
• waveform generation: converts linguistic specification to speech
![Page 6: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/6.jpg)
From words to linguistic specification
"the cat sat"
![Page 7: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/7.jpg)
From words to linguistic specification
DET NN VB
"the cat sat"
![Page 8: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/8.jpg)
From words to linguistic specification
((the cat) sat)
DET NN VB
"the cat sat"
![Page 9: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/9.jpg)
From words to linguistic specification
sil dh ax k ae t s ae t sil
((the cat) sat)
DET NN VB
"the cat sat"
![Page 10: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/10.jpg)
From words to linguistic specification
sil dh ax k ae t s ae t sil
((the cat) sat)
DET NN VB
phrase finalphrase initialpitch accent
"the cat sat"
![Page 11: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/11.jpg)
From words to linguistic specification
sil^dh-ax+k=ae, "phrase initial", "unstressed syllable", ...
sil dh ax k ae t s ae t sil
((the cat) sat)
DET NN VB
phrase finalphrase initialpitch accent
"the cat sat"
![Page 12: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/12.jpg)
Full context models used in synthesis
aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....
![Page 13: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/13.jpg)
phonetic
Full context models used in synthesis
aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....
![Page 14: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/14.jpg)
phonetic
Full context models used in synthesis
aa^b-l+ax=s@1_3/A:1_1_3/B:0-0-3@2-1&3-3#2-2$2-3!1- .....
prosodic
![Page 15: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/15.jpg)
Example linguistic specification
pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$.....pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....pau^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$.....th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....
“Author of the ...”
![Page 16: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/16.jpg)
From linguistic specification to speech
• Two possible methods
• Concatenate small pieces of pre-recorded speech
• Generate speech from a model
![Page 17: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/17.jpg)
From linguistic specification to speech
• Two possible methods
• Concatenate small pieces of pre-recorded speech
• Generate speech from a model
![Page 18: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/18.jpg)
HMM mini-tutorial
• HMMs are models of sequences
• speech signals
• gene sequences
• etc
![Page 19: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/19.jpg)
HMMs
• a HMM consists of
• sequence model: a weighted finite state network of states and transitions
• observation model: multivariate Gaussian distribution in each state
• can generate from the model
• can also use for pattern recognition (e.g., automatic speech recognition)
![Page 20: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/20.jpg)
HMMs are generative models
![Page 21: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/21.jpg)
HMMs are generative models
![Page 22: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/22.jpg)
HMMs are generative models
![Page 23: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/23.jpg)
HMM-based speech synthesis mini-tutorial
• HMMs are used to generate sequences of speech (in a parameterised form)
• From the parameterised form, we can generate a waveform
• The parameterised form contains sufficient information to generate speech:
• spectral envelope
• fundamental frequency (F0) - sometimes called ‘pitch’
• aperiodic (noise-like) components (e.g. for sounds like ‘sh’ and ‘f’)
![Page 24: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/24.jpg)
Trajectory HMMs
• Using an HMM to generate speech parameters
• because of the Markov assumption, the most likely output is the sequence of the means of the Gaussians in the states visited
• this is piecewise constant, and ignores important dynamic properties of speech
• Trajectory HMM algorithm (Tokuda and colleagues)
• solves this problem, by correctly using statistics of the dynamic properties during the generation process
![Page 25: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/25.jpg)
Generation
• Generate the most likely observation sequence from the HMM
• but take the statistics of not only the static coefficients, but also the delta and delta-delta too
• Maximum Likelihood Parameter Generation Algorithm
![Page 26: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/26.jpg)
Trajectory HMMs
![Page 27: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/27.jpg)
Trajectory HMMs
![Page 28: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/28.jpg)
Trajectory HMMs
![Page 29: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/29.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 30: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/30.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 31: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/31.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 32: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/32.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 33: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/33.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 34: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/34.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 35: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/35.jpg)
time
sp
eech p
ara
mete
r
Trajectory HMMs
![Page 36: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/36.jpg)
Constructing the HMM
• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information
• There is one 5-state HMM for each phoneme, in every required context
• To synthesise a given sentence,
• use front end to predict the linguistic specification
• concatenate the corresponding HMMs
• generate from the HMM
![Page 37: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/37.jpg)
Constructing the HMM
• Linguistic specification (from the front end) is a sequence of phonemes, annotated with contextual information
• There is one 5-state HMM for each phoneme, in every required context
• To synthesise a given sentence,
• use front end to predict the linguistic specification
• concatenate the corresponding HMMs
• generate from the HMM
Sparsi
ty pr
oblem
!
![Page 38: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/38.jpg)
Example linguistic specification
pau^pau-pau+ao=th@x_x/A:0_0_0/B:x-x-x@x-x&x-x#x-x$.....pau^pau-ao+th=er@1_2/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....pau^ao-th+er=ah@2_1/A:0_0_0/B:1-1-2@1-2&1-7#1-4$.....ao^th-er+ah=v@1_1/A:1_1_2/B:0-0-1@2-1&2-6#1-4$.....th^er-ah+v=dh@1_2/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....er^ah-v+dh=ax@2_1/A:0_0_1/B:1-0-2@1-1&3-5#1-3$.....ah^v-dh+ax=d@1_2/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....v^dh-ax+d=ey@2_1/A:1_0_2/B:0-0-2@1-1&4-4#2-3$.....
“Author of the ...”
![Page 39: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/39.jpg)
HMM-based speech synthesis
• Differences from automatic speech recognition include
• Synthesis uses a much richer model set, with a lot more context
• For speech recognition: triphone models
• For speech synthesis: “full context” models
• “Full context” = both phonetic and prosodic factors
• Observation vector for HMMs contains the necessary parameters to generate speech, such as spectral envelope + F0 + multi-band noise amplitudes
![Page 40: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/40.jpg)
Sparsity
• In practically all speech or language applications, sparsity is a problem
• Distribution of classes is usually long-tailed (Zipf-like)
• We also ‘create’ even more sparsity by using context-dependent models
• thus, most models have no training data at all
• Common solution is to merge classes or contexts
• i.e., use the same model for several classes or contexts
• for HMMs, we call this ‘parameter tying’
![Page 41: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/41.jpg)
Decision-tree-based clustering
Clustering
Context Dependent HMMs
Yes No
NoYes
Voiced?
Vowel?Description length for
State occupancy probability for node
Covariance matrix for node
Dimension
20
![Page 42: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/42.jpg)
Model parameter estimation from ‘labelled’ data
• Actually, we only have word labels for the training data
• Convert these to full linguistic specification using the front end of our text-to-speech system (text processing, pronunciation, prosody)
• these labels will not exactly match the speech signal (we do a few tricks to try to make the match closer, but it’s never perfect)
• We still only know the model sequence, but no information about the state alignment
• So, we use EM (we could call this ‘semi-supervised’ learning)
![Page 43: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/43.jpg)
Model adaptation
• Training the models needs 1000+ sentences of data from one speaker
• What if we have insufficient data for this target speaker?
• Adaptation:
• Train the model on lots of data from other speakers
• Adapt the trained model’s parameters using a small amount of target speaker data
• estimate linear transforms to maximise the likelihood (MLLR)
• also in combination with MAP
![Page 44: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/44.jpg)
Training, adaptation, synthesis
![Page 45: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/45.jpg)
speech labels
awb
clb
rms
...Train
awb
clb
rms
...
Training, adaptation, synthesis
![Page 46: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/46.jpg)
speech labels
awb
clb
rms
...Train
awb
clb
rms
...
Training, adaptation, synthesis
![Page 47: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/47.jpg)
speech labels
Average
voice model
awb
clb
rms
...Train
awb
clb
rms
...
Training, adaptation, synthesis
![Page 48: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/48.jpg)
speech labels
Average
voice model
Training, adaptation, synthesis
![Page 49: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/49.jpg)
Adaptbdl bdl
speech labels
Average
voice model
Training, adaptation, synthesis
![Page 50: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/50.jpg)
Adaptbdl bdl
speech labels
Average
voice model
Training, adaptation, synthesis
![Page 51: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/51.jpg)
Adaptbdl bdl
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 52: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/52.jpg)
Adaptbdl bdl
speech labels
Average
voice model
Recognise
Transforms
Training, adaptation, synthesis
![Page 53: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/53.jpg)
Adaptbdl bdl
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 54: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/54.jpg)
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 55: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/55.jpg)
Synthesise
Test
sentence
labels
Synthetic
speech
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 56: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/56.jpg)
Synthesise
Test
sentence
labels
Synthetic
speech
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 57: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/57.jpg)
Synthesise
Test
sentence
labels
Synthetic
speech
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 58: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/58.jpg)
Synthesise
Test
sentence
labels
Synthetic
speech
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 59: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/59.jpg)
Synthesise
Test
sentence
labels
Synthetic
speech
speech labels
Average
voice model
Transforms
Training, adaptation, synthesis
![Page 60: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/60.jpg)
Evaluation
• Objective measures that compare synthetic speech with a natural example (e.g., spectral distortion) have their uses, but don’t necessarily correlate with human perception
• main problem: there is more than one ‘correct answer’ in speech synthesis
• a single natural example does not capture this
• So, we mainly rely on playing examples to listeners
• opinion scores for quality & naturalness, typically on 5 point scales
• objective measures of intelligibility (type-in tests)
![Page 61: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/61.jpg)
A J S B O V M C L E G Q T F H D R I N0
2040
6080
100
198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n
Word error rate for voice B (All listeners)
System
WER
(%)
Intelligibility (WER), English
A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)
A J S K B P O V M C L E G Q T F H D R I N
020
4060
8010
0
245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n
Word error rate for voice A (All listeners)
System
WER
(%)
![Page 62: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/62.jpg)
A J S B O V M C L E G Q T F H D R I N0
2040
6080
100
198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n
Word error rate for voice B (All listeners)
System
WER
(%)
Intelligibility (WER), English
A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)
A J S K B P O V M C L E G Q T F H D R I N
020
4060
8010
0
245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n
Word error rate for voice A (All listeners)
System
WER
(%)
No significant difference between
A, V and T
![Page 63: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/63.jpg)
A J S B O V M C L E G Q T F H D R I N0
2040
6080
100
198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n
Word error rate for voice B (All listeners)
System
WER
(%)
Intelligibility (WER), English
A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)
A J S K B P O V M C L E G Q T F H D R I N
020
4060
8010
0
245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n
Word error rate for voice A (All listeners)
System
WER
(%)
No significant difference between
A, V and T
No significant difference between
A, C, V and T
![Page 64: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/64.jpg)
A J S B O V M C L E G Q T F H D R I N0
2040
6080
100
198 198 198 198 198 199 199 198 199 198 198 198 198 199 198 198 199 198 199n
Word error rate for voice B (All listeners)
System
WER
(%)
Intelligibility (WER), English
A natural speechB Festival benchmarkC HTS 2005 benchmarkV HTS 2008 (aka HTS 2007’)
A J S K B P O V M C L E G Q T F H D R I N
020
4060
8010
0
245 248 245 245 245 246 247 246 245 246 246 245 246 248 245 248 245 246 246 246 248n
Word error rate for voice A (All listeners)
System
WER
(%)
No significant difference between
A, V and T
No significant difference between
A, C, V and T
HTS is as intelligible as human speech
![Page 65: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/65.jpg)
Recent extensions
![Page 66: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/66.jpg)
Articulatory-controllable HMM-based speech synthesis
• can manipulate articulator positions explicitly
• ability to synthesise new phonemes, not seen in training data
• requires parallel articulatory+acoustic corpus, which we have in CSTR
![Page 67: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/67.jpg)
Articulatory-controllable HMM-based speech synthesis
+1.5
+1.0
+0.5
default
-0.5
-1.0
-1.5
Tong
ue h
eigh
t (cm
)
![Page 68: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/68.jpg)
Articulatory-controllable HMM-based speech synthesis
+1.5
+1.0
+0.5
default
-0.5
-1.0
-1.5
set
Tong
ue h
eigh
t (cm
)
![Page 69: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/69.jpg)
Dirichlet process HMMs
a e i o u
010
020
030
040
050
0
Japanese vowel
Dura
tion [
ms]
aa ae ah ao aw ax ay eh el em en er ey ih iy ow oy uh uw
010
020
030
040
050
0
English vowel
Dura
tion [
ms]
a ai an ang ao e ei en eng er i ia ian iang iao ic ich ie in ing iong iu o ong ou u ua uai uan uang ui un uo v van ve vn
010
020
030
040
050
0
Mandarin final
Dura
tion [
ms]
• Fixed number of states may not be optimal
• Cross-validation, information criteria (AIC, BIC, or MDL) or variational Bayes can be used for determining the number of states
• Or use Dirichlet process (HDP-HMM or infinite HMM)
![Page 70: Hidden Markov Model-based speech synthesishomepages.inf.ed.ac.uk/ckiw/rpml/HMM_speech_synthesis.pdf · 2009-09-13 · Hidden Markov Model-based speech synthesis Junichi Yamagishi,](https://reader034.vdocuments.site/reader034/viewer/2022042114/5e918d35fdd28d0c5b721133/html5/thumbnails/70.jpg)
Summary
• HMM-based speech synthesis has many opportunities for using machine learning:
• learning the model from data
• parameters (alternatives to maximum likelihood such as minimum generation error)
• model complexity (context clustering, number of mixture components, number of states, ...)
• semi-supervised and unsupervised learning (labels for data are unreliable or missing)
• adapting the model, given limited new data
• generation algorithms