speech synthesis as a statistical machine learning problem slides

8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

1/66

Speech Synthesis as A Statistical

Keiichi Tokuda

Nagoya Institute of Technology

ASRU2011, HawaiiDecember 14, 2011


2/66

Introduction

Rule-based, formant synthesis ~90s Hand-crafting each phonetic units by rules

Corpus-based, concatenative synthesis (90s~)

Concatenate speech units (waveform) from a database

Single inventory: diphone synthesis

Multiple inventory: unit selection synthesisorpus- ase , s a s ca parame r c syn es s

Source-filter model + statistical acoustic model

Hidden Markov model: HMM-based synthesis

How we can formulate and understand the whole

corpus-based speech synthesis process in aunified statistical framework?


3/66

Problem of speech synthesis

We have a speech database, i.e., a set of texts and

corresponding speech waveforms.

Given a text to be synthesized, what is the speech

: s eech waveforms

: textsdatabase

Given

W

: text to be synthesizedw

: speec wave orm


4/66

Statistical formulation of speech synthesis (1/8)

ayes an ramewor or pre c on

: set of texts

database Given

: text to be synthesized

1. Estimate redictive distribution iven variables

: speech waveform unknown

3

2. Draw sample from the distribution


5/66


1. Estimating predictive distribution is hard Introduce acoustic model parameters

: acoustic model (e.g. HMM )


6/66


2. Using speech waveform directly is difficult Introduce parametric its representation

: parametric representation of speech waveform

(e.g., cepstrum, LPC, LSP, F0, aperiodicity)


7/66


3. Same texts can have multiple pronunciations, POS, etc. Introduce labels

: labels derived from text

e.g. prons, , ex ca s ress, grammar, pause


8/66


4. Difficult to perform integral & sum over auxiliary variables Approximated by joint max


9/66


5. Joint maximization is hard Approximated by step-by-step maximizations


10/66


6. Training also requires parametric form of wav & labels Introduce them & approx by step-by-step maximizations

: parametric representation of speech waveforms

: a e s er ve rom ex s


11/66



12/66

Overview of this talk

1. Mathematical formulation2. Implementation of individual components

3. Examples demonstrating its flexibility


13/66

HMM-based speech synthesis system

SPEECH

DATABASEExcitation Spectral

Speech signal Training part

Parameter

extraction

Parameter

ExtractionSpectralparameters

Excitation

parameters

Text analysis

Training HMMsLabels

TEXTContext-dependent HMMs

Text analysisParameter generation

from HMMs

LabelsExcitation Spectral

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH

parame ersExcitation

parame ers

Synthesis part


14/66


SPEECH



Parameterextraction

Parameter


Excitation

parameters

Text analysis

,maxargL

Training HMMsLabels ),|(maxarg LO

P

TEXTContext-dependent HMMs),|(maxarg loo

oP


from HMMs

LabelsExcitation Spectral,maxar wll P

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH


parame ers

Synthesis part

l


15/66


Training partSPEECHDATABASE

Excitation Spectral

Speech signal

Text analysis

Parameter

extraction

Parameter


Excitation

parameters

Training HMMsLabels



from HMMs


Synthesis part

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH


parame ers


16/66

Human speech production

Modulation of carrier wave

by speech information

SpeechFrequency

transfer

characteristicsSound source

Voiced: pulse

Unvoiced: noise

Magnitude

start--end

air flow

frequency


17/66

Source-filter model

LinearExcitationPulse train

Source excitation part Vocal tract resonance part

time-invariant

system

zH

)(ne

)(*)()( nenhnx White noise

Speech


18/66

ML estimation of spectral parameter

Mel-cepstral representation of speech spectra)

m

m

zmczH 0 )(exp)(ncy

(ra

M

m

m

zmczH 0

~)(exp)(

rped

frequ

warped frequencymel-scale frequency

~

1

11

1

~ jez

zz

Wa

Frequency (rad)

0 2/

-est mat on o me -cepstrum

x : speech waveform (Gaussian process)

c

c : mel-cepstrum


19/66

Waveform reconstruction

Original speech

Mel-cepstrumUnvoiced / voicedF0

u se ra n

Synthesis filter

H)(ne )(nxWhite noise

These speech parameters should be modeled by HMM


20/66


SPEECH



Parameter

extraction

Parameter

ExtractionText analysis Spectralparameters

Excitation

parameters

Training HMMsLabels

),

|(maxarg

LO P



from HMMsLabelsExcitation Spectral

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH


parame ers

Synthesis part


21/66

Hidden Markov model (HMM)

11a 22a 33a

12a 23a

ij

)( tqb o

: output probability

b o b o b o

1o 2o 3o 4o 5o To oObservation

sequence

1 1 1 1 2 2 3 3 qState

sequence


22/66

Structure of state output (observation) vector

Spectral parameters

(e.g., mel-cepstrum, LSPs)

Spectrum part

log F0 with V/UV


23/66

Observation of F0

1R

voiced0

2 Runvoiced

req

uenc

LogF

Time

Multi-space distribution HMM (MSD-HMM)


24/66

Multi-space probablity distribution HMM(MSD-HMM)

1 2 3

1,1w 1,2w 1,3w1

1n

R

)(11 xN )(21 xN )(31 xN

2,1w 2,2w 2,3w2

2n

R)(12 xN )(22 xN )(32 xN

Mw ,1 Mw ,2 Mw ,3

M

M

)(1 xMN )(2 xMN )(3 xMN


25/66


26/66

Structure of state-output distributions

t

Mel-cepstrumtc

Voiced

pe

ctru

tc

Unvoiced

Voiced

tc2

Unvoiced

Voiced

itation

tptp

Unvoiced

Ex tp2


27/66

Contextual factors

Phoneme {preceding, succeeding} two phonemes

current phoneme

# of phonemes in {preceding, current, succeeding} syllable

{accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word

o prece ng, succee ng accen e , s resse sy a e n curren p rase

# of syllables {from previous, to next} {accented, stressed} syllable

Vowel within current syllable

Word Part of speech of {preceding, current, succeeding} word

# of syllables in {preceding, current, succeeding} word

Position of current word in current phrase

, # of words {from previous, to next} content word

Phrase # of syllables in {preceding, current, succeeding} phrase

Huge # of combinations Difficult to have all possible models

..


28/66

Decision tree-based state clustering [Odell; 95]

k-a+b yes noR=silence?

L=voice?

L=w?t-a+h

yes

eses

yes no

no

no

no

L=gy?R=silence?s-i+n

C=unvoice?

C=vowel?

R=silence?

R=cl?

L=gy?

L=voice?


29/66

Stream-dependent tree-based clustering (1)

pectrum excitation have di erent contextdependency Build decision trees separately

for

mel-cepstrum

Decision trees

for F0


30/66

State duration modeling

HMM (Hidden Markov Model) State duration prob. depends only on transition prob.

State duration probability exponentially decreases

HSMM (Hidden Semi Markov Model)

HMM + explicit duration model HSMM

3-demensional Gaussian

1q 2q 3q

K

1O 2O 3O 4O 5O 6O 7O 9O8O

1d 2d 3d iii

1,


31/66

Stream-dependent tree-based clustering (2)

State durationThree dimensional Gaussian

Decision treeec s on rees

for

mel-cepstrum

for state dur.

models

Decision trees

for F0


32/66


SPEECHDATABASE

Excitation Spectral


Parameter

extraction

Parameter

ExtractionSpectral

parameters

Excitation

parameters

Text analysis

Training HMMsLabels


)

,

|(maxarg loo P


from HMMs


o

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH

ExcitationSynthesis part

parame ers parame ers

C i i f HMM f i


33/66

Composition of sentence HMM for given text

TEXTG2P

w

Text analysisText normalization

Pause prediction

context-dependent label

sequence l

sentence HMM given labels

,

S h t ti l ith


35/66

Determination of state sequence

K

ii dpP ),|( lq

i 1

ip

id

: state-duration distribution of -th state

: state duration of -th state

i

i

K : # of states in a sentence HMM for l

Gaussian

iiiiiii mdmdNdp ,|

S h t ti l ith


36/66

Speech parameter generation algorithm

For given HMM ,determine a speech parameter vectorSequence which maximizes

TTTT ],,,[ 21 Toooo

),|(),|(),|( lqqolo PPP

)

,

|()

,|(max lqqoq

q

PP

),|(maxarg lqqq

P

),(maxarg qoo o P

With t d i f t


37/66

Without dynamic feature

becomes a sequence of mean vectors

D i f t


38/66

Dynamic features

ct

2

.

ttt

t

c112

ttttt

cccc

tc1tc 1tc tc

1tc 1tc

5.0 5.0 1 12

tc tc1tc 1tc 1t 1t

I t ti f d i f t


39/66

Integration of dynamic features

Relationship between speech parameter vectors &static feature vectors

o c

T

TTT

tttt ccco 2,, M M

M3

M

S l ti f th bl (1/2)


40/66

Solution for the problem (1/2)

By setting O

,,

O

c

,

1

1

qqq

WWcW

TT

TTTT ],,,[ 21 Tcccc w ere

TTTT ],,,[ 21 Tqqqq

TTTT ,,, 21 Tqqqq

S l ti f th bl (2/2)


41/66

Solution for the problem (2/2)

1qTW W c

TW 1

q

q

Generated speech parameter trajector


42/66

Generated speech parameter trajectory

Trajectory HMM


43/66

Trajectory HMM

is not a proper distribution of c),|(),|( lWclo PP

Conventional HMM Trajectory HMM

Training ),|(maxarg LO

P ),|(maxarg LC

P

Synthesis

maxar

,maxarg

lc

o Wcoo

P

,

maxar lcP

Solve inconsistenc between trainin & s nthesis

c

improving the model accuracy

Generated spectra


44/66

Generated spectra

a i silsil

0123

45

(kHz)

w/o dyn

01

(345

Hz)

w dyn

Spectra changing smoothly between phonemes

Generated F0


45/66

Generated F0

uency

100150200

natural speech

Fre 50

0 1 time (s)2

equency

100150200

without dynamic features

Fr 50

0 1 2 time (s)

requency

100150200

F

0 1 2 time (s)

Effect of dynamic features


46/66

Effect of dynamic features

Mel-cepstrum

static+ static2

static+

Smooth!2log F0static



47/66




Emotional speech synthesis


48/66

Emotional speech synthesis

text neutral angry

class! Turn off it!

You must attend the weekly meeting!

trained with 200 utterances

Speaker adaptation (mimicking voices)


49/66

Speaker adaptation (mimicking voices)

MLLR-based adaptation

Adaptation

data

speaker A

Speaker-independent Adapted model

adaptation

w/o adaptation (initial model)

Adapted with 50 utterances

pea er s spea er- epen en sys em

Speaker interpolation (mixing voices)


50/66

Speaker interpolation (mixing voices)

Linear combination of two speaker-dependent models

Model A Model BInterpolated model

1.00 0.75 0.50 0.25 0.00A:

0.00 0.25 0.50 0.75 1.00B:

Voice morphing


51/66

Voice morphing

Two voices:

Four voices:

Interpolation of speaking styles


52/66

Interpolation of speaking styles

Base model A Base model B

Interpolation extrapolation

Neutral High Tension

Eigenvoice (producing voices) [Shichiri; 02]


53/66

Eigenvoice (producing voices) [Shichiri; 02]

Speaker 1 Speaker 2 Speaker S

)1(e ( )ke ( )Ke

System OverviewClick here for a demo

Multilingual speech synthesis


54/66

Multilingual speech synthesis

apanese

American English

Chinese (Mandarin) (by ATR)

Brazilian Portuguese (by Nitech, and )

European Portuguese (by Nitech, Univ of Porto, and UFRJ)

(by Bostjan Vesnicer, University of Ljubljana, Slovenia)

Swedish (by Anders Lundgren, KTH, Sweden) ,

Korean (by Sang-Jin Kim, ETRI, Korea)

Finish (by TKK, Finland)

Baby English (by Univ of Edinburgh, UK)

Polish, Slovak, Arabic, Farsi, Croatian, Polyglot, etc.

Singing voice synthesis [Oura; 10] (1/2)


55/66

Singing voice synthesis [Oura; 10] (1/2)

Trained HMMs Male:

Female:

Musical score Singing voicedatabase

Synthesized

singing voice

Male:

Female:

ap a on:



56/66




Inclusion of all components


57/66

Inclusion of all components

Problem of statistical parametric speech synthesis

WXw ,,|P

lcP wlP c)(P

Draw from

Waveform generation Parameter generation Text processing

Ll,

LC ,|P

os er or o mo e parame er

Cc ddddP ,|WLP XC|P

Text processing PriorSpeech analysis

Relaxing approximations


58/66

Relaxing approximations

Marginalizing model parametersVariational Bayesian acoustic modeling for speech synthesis

an a u;

Marginalizing labelso n ron -en ac -en mo e ra n ng ura;

Inclusion of waveform generation partave orm- eve s a s ca mo e a a;

Summary


59/66

Summary

Statistical approach to speech synthesis

Whole speech synthesis process is described in astatistical framework

It gives a unified view and reveals what is correct

and what is wron

Importance of the database

Still we have many problems should be solved:

Combination with text processing part, etc.

Final message


60/66

Final message

s speec syn es s a messy pro em

Let us join speech synthesis research!

Thanks!

References (1/4)


61/66

References (1/4)

ag sa a; - nu- speec syn es s sys em, , .

Black;'96 - "Automatically clustering similar units...," Euro speech, '97.Beutnagel;'99 - "The AT&T Next-Gen TTS system," Joint ASA, EAA, & DAEA meeting, '99.

Yoshimura;'99 - "Simultaneous modeling of spectrum ...," Eurospeech, '99.

' - " " - ' ..., . , , .

Imai;'88 - "Unbiased estimator of log spectrum and its application to speech signal...," EURASIP, '88.

Kobayashi;'84 - "Spectral analysis using generalized cepstrum," IEEE Trans. ASSP, 32, '84.Tokuda;'94 - "Mel-generalized cepstral analysis -- A unified approach to speech spectral...," ICSLP, '94.

Imai;'83 - "Ce stral anal sis s nthesis on the mel fre uenc scale," ICASSP, '83.

Fukada;'92 - "An adaptive algorithm for mel-cepstral analysis of speech," ICASSP, '92.

Itakura;'75 - "Line spectrum representation of linear predictive coefficients of speech...," J. ASA (57), '75.

Tokuda;'02 - "Multi-space probability distribution HMM," IEICE Trans. E85-D(3), '02.

Odell;'95 - "The use of context in large vocaburary...," PhD thesis, University of Cambridge, '95.Shinoda;'00 - "MDL-based context-dependent subword modeling...," Journal of ASJ(E) 21(2), '00.

Yoshimura;'98 - "Duration modeling for HMM-based speech synthesis," ICSLP, '98.

Tokuda;'00 - "Speech parameter generation algorithms for HMM-based speech synthesis," ICASSP, '00.

Kobayashi;'85 - "Mel generalized-log spectrum approximation...," IEICE Trans. J68-A (6), '85.

' " " '- ..., , .Donovan;'95 - "Improvements in an HMM-based speech synthesiser," Eurospeech, '95.

Kawai;'04 - "XIMERA: A new TTS from ATR based on corpus-based technologies," ISCA SSW5, '04.

Hirai;'04 - "Using 5 ms segments in concatenative speech synthesis," Proc. ISCA SSW5, '04.

References (2/4)


62/66

References (2/4)

ou a; - n se ec on or speec syn es s ase on a new acous c arge cos , n erspeec , .

Huang;'96 - "Whistler: A trainable text-to-speech system," ICSLP, '96.Mizutani;'02 - "Concatenative speech synthesis based on HMM," ASJ autumn meeting, '02.

Ling;'07 - "The USTC and iFlytek speech synthesis systems...," Blizzard Challenge workshop, 07.

' - " - " ..., , .

Plumpe;'98 - "HMM-based smoothing for concatenative speech synthesis," ICSLP, '98.

Wouters;'00 - "Unit fusion for concatenative speech synthesis," ICSLP, '00.Okubo;'06 - "Hybrid voice conversion of unit selection and generation...," IEICE Trans. E89-D(11), '06.

A lett;'08 - "Combinin statistical arametric s eech s nthesis and unit selection..." Lan Tech, '08.

Pollet;'08 - "Synthesis by generation and concatenation of multiform segments," Interspeech, '08.

Yamagishi;'06 - "Average-voice-based speech synthesis," PhD thesis, Tokyo Inst. of Tech., '06.

Yoshimura;'97 - "Speaker interpolation in HMM-based speech synthesis system," Eurospeech, '97.

Tachibana;'05 - "Speech synthesis with various emotional expressions...," IEICE Trans. E88-D(11), '05.Kuhn;'00 - "Rapid speaker adaptation in eigenvoice space," IEEE Trans. SAP 8(6), '00.

Shichiri;'02 - "Eigenvoices for HMM-based speech synthesis," ICSLP, '02.

Fujinaga;'01 - "Multiple-regression hidden Markov model," ICASSP, '01.

Nose;'07 - "A style control technique for HMM-based expressive speech...," IEICE Trans. E90-D(9), '07.

' " " '- - , , .Kawahara;'97 - "Restructuring speech representations using a ...", Speech Communication, 27(3), '97.

Zen;'07 - "Details of the Nitech HMM-based speech synthesis system...", IEICE Trans. E90-D(1), '07.

Abdl-Hamid;'06 - "Improving Arabic HMM-based speech synthesis quality," Interspeech, '06.

References (3/4)


63/66

References (3/4)

emp nne; - n egra on o e armon c p us no se mo e n o e..., as er es s, , .

Banos;'08 - "Flexible harmonic/stochastic modeling...," V. Jornadas en Tecnologias del Habla, '08.Cabral;'07 - "Towards an improved modeling of the glottal source in...," ISCA SSW6, '07.

Maia;'07 - "An excitation model for HMM-based speech synthesis based on ...," ISCA SSW6, '07.

' - " - - - " ' , , .

Drugman;'09 - "Using a pitch-synchronous residual codebook for hybrid HMM/frame...", ICASSP, '09.

Dines;'01 - "Trainable speech synthesis with trended hidden Markov models," ICASSP, '01.Sun;'09 - "Polynomial segment model based statistical parametric speech synthesis...," ICASSP, '09.

Bul ko;'02 - "Robust s licin costs and efficient search with BMM models for...," ICASSP, '02.

Shannon;'09 - "Autoregressive HMMs for speech synthesis," Interspeech, '09.

Zen;'06 - "Reformulating the HMM as a trajectory model...", Computer Speech & Language, 21(1), '06.

Wu;'06 - "Minimum generation error training for HMM-based speech synthesis," ICASSP, '06.

Hashimoto;'09 - "A Bayesian approach to HMM-based speech synthesis," ICASSP, '09.Wu;'08 - "Minimum generation error training with log spectral distortion for...," Interspeech, '08.

Toda;'08 - "Statistical approach to vocal tract transfer function estimation based on...," ICASSP, '08.

Oura;'08 - "Simultaneous acoustic, prosodic, and phrasing model training for TTS...," ISCSLP, '08.

Ferguson;'80 - "Variable duration models...," Symposium on the application of HMM to text speech, '80.

' " " '- ..., , , .Beal;'03 - "Variational algorithms for approximate Bayesian inference," PhD thesis, Univ. of London, '03.

Masuko;'03 - "A study on conditional parameter generation from HMM...," Autumn meeting of ASJ, '03.

Yu;'07 - "A novel HMM-based TTS system using both continuous HMMs and discrete...," ICASSP, '07.

References (4/4)


64/66

References (4/4)

an; - enera ng na ura ra ec ory w a ve rees, n erspeec , .

Latorre;'08 - "Multilevel parametric-base F0 model for speech synthesis," Interspeech, '08.Tiomkin;'08 - "Statistical text-to-speech synthesis with improved dynamics," Interspeech, '08.

Toda;'07 - "A speech parameter generation algorithm considering global...," IEICE Trans. E90-D(5), '07.

' - " " ' ..., , .

Toda;'09 - "Trajectory training considering global variance for HMM-based speech...," ICASSP, '09.

Saino;'06 - "An HMM-based singing voice synthesis system," Interspeech, '06.

Tsuzuki;'04 - "Constructing emotional speech synthesizers with limited speech...," Interspeech, '04.

Sako;'00 - "HMM-based text-to-audio-visual s eech s nthesis," ICSLP, '00.

Tamura;'98 - "Text-to-audio-visual speech synthesis based on parameter generation...," ICASSP, '98.

Haoka;'02 - "HMM-based synthesis of hand-gesture animation," IEICE Technical report,102(517), '02.

Niwase;'05 - "Human walking motion synthesis with desired pace and...," IEICE Trans. E88-D(11), '05.

Hofer;'07 - "Speech driven head motion synthesis based on a trajectory model," SIGGRAPH, '07.Ma;'07 - "A MSD-HMM approach to pen trajectory modeling for online handwriting...," ICDAR, '07.

Morioka;'04 - "Miniaturization of HMM-based speech synthesis," Autumn meeting of ASJ, '04.

Kim;'06 - "HMM-Based Korean speech synthesis system for...," IEEE Trans. Consumer Elec., 52(4), '06.

Klatt;'82 - "The Klatt-Talk text-to-speech system," ICASSP, '82.

Acknowledgement


65/66

Acknowledgement

Keiichi Tokuda would like to thank HTS working

rou members, includin Hei a Zen, Keiichiro

Oura, Junichi Yamagichi, Tomoki Toda, Yoshihiko

Nankaku, Kei Hahimoto, and Sayaka Shiota fortheir help.


66/66

HTS Slidesreleased by HTS Working Group

htt ://hts.s .nitech.ac. /

Copyright (c) 1999 - 2011

Department of Computer Science

Some rights reserved.

This work is licensed under the Creative Commons Attribution 3.0 license.

See http://creativecommons.org/ for details.

speech synthesis as a statistical machine learning problem slides

Documents