speech synthesis as a statistical machine learning problem slides

Upload: aditvas

Post on 03-Jun-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    1/66

    Speech Synthesis as A Statistical

    Keiichi Tokuda

    Nagoya Institute of Technology

    ASRU2011, HawaiiDecember 14, 2011

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    2/66

    Introduction

    Rule-based, formant synthesis ~90s Hand-crafting each phonetic units by rules

    Corpus-based, concatenative synthesis (90s~)

    Concatenate speech units (waveform) from a database

    Single inventory: diphone synthesis

    Multiple inventory: unit selection synthesisorpus- ase , s a s ca parame r c syn es s

    Source-filter model + statistical acoustic model

    Hidden Markov model: HMM-based synthesis

    How we can formulate and understand the whole

    corpus-based speech synthesis process in aunified statistical framework?

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    3/66

    Problem of speech synthesis

    We have a speech database, i.e., a set of texts and

    corresponding speech waveforms.

    Given a text to be synthesized, what is the speech

    : s eech waveforms

    : textsdatabase

    Given

    W

    : text to be synthesizedw

    : speec wave orm

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    4/66

    Statistical formulation of speech synthesis (1/8)

    ayes an ramewor or pre c on

    : set of texts

    database Given

    : text to be synthesized

    1. Estimate redictive distribution iven variables

    : speech waveform unknown

    3

    2. Draw sample from the distribution

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    5/66

    Statistical formulation of speech synthesis (2/8)

    1. Estimating predictive distribution is hard Introduce acoustic model parameters

    : acoustic model (e.g. HMM )

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    6/66

    Statistical formulation of speech synthesis (3/8)

    2. Using speech waveform directly is difficult Introduce parametric its representation

    : parametric representation of speech waveform

    (e.g., cepstrum, LPC, LSP, F0, aperiodicity)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    7/66

    Statistical formulation of speech synthesis (4/8)

    3. Same texts can have multiple pronunciations, POS, etc. Introduce labels

    : labels derived from text

    e.g. prons, , ex ca s ress, grammar, pause

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    8/66

    Statistical formulation of speech synthesis (5/8)

    4. Difficult to perform integral & sum over auxiliary variables Approximated by joint max

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    9/66

    Statistical formulation of speech synthesis (6/8)

    5. Joint maximization is hard Approximated by step-by-step maximizations

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    10/66

    Statistical formulation of speech synthesis (7/8)

    6. Training also requires parametric form of wav & labels Introduce them & approx by step-by-step maximizations

    : parametric representation of speech waveforms

    : a e s er ve rom ex s

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    11/66

    Statistical formulation of speech synthesis (8/8)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    12/66

    Overview of this talk

    1. Mathematical formulation2. Implementation of individual components

    3. Examples demonstrating its flexibility

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    13/66

    HMM-based speech synthesis system

    SPEECH

    DATABASEExcitation Spectral

    Speech signal Training part

    Parameter

    extraction

    Parameter

    ExtractionSpectralparameters

    Excitation

    parameters

    Text analysis

    Training HMMsLabels

    TEXTContext-dependent HMMs

    Text analysisParameter generation

    from HMMs

    LabelsExcitation Spectral

    Excitation

    generation

    Synthesis

    Filter

    SYNTHESIZED

    SPEECH

    parame ersExcitation

    parame ers

    Synthesis part

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    14/66

    HMM-based speech synthesis system

    SPEECH

    DATABASEExcitation Spectral

    Speech signal Training part

    Parameterextraction

    Parameter

    ExtractionSpectralparameters

    Excitation

    parameters

    Text analysis

    ,maxargL

    Training HMMsLabels ),|(maxarg LO

    P

    TEXTContext-dependent HMMs),|(maxarg loo

    oP

    Text analysisParameter generation

    from HMMs

    LabelsExcitation Spectral,maxar wll P

    Excitation

    generation

    Synthesis

    Filter

    SYNTHESIZED

    SPEECH

    parame ersExcitation

    parame ers

    Synthesis part

    l

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    15/66

    HMM-based speech synthesis system

    Training partSPEECHDATABASE

    Excitation Spectral

    Speech signal

    Text analysis

    Parameter

    extraction

    Parameter

    ExtractionSpectralparameters

    Excitation

    parameters

    Training HMMsLabels

    TEXTContext-dependent HMMs

    Text analysisParameter generation

    from HMMs

    LabelsExcitation Spectral

    Synthesis part

    Excitation

    generation

    Synthesis

    Filter

    SYNTHESIZED

    SPEECH

    parame ersExcitation

    parame ers

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    16/66

    Human speech production

    Modulation of carrier wave

    by speech information

    SpeechFrequency

    transfer

    characteristicsSound source

    Voiced: pulse

    Unvoiced: noise

    Magnitude

    start--end

    air flow

    frequency

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    17/66

    Source-filter model

    LinearExcitationPulse train

    Source excitation part Vocal tract resonance part

    time-invariant

    system

    zH

    )(ne

    )(*)()( nenhnx White noise

    Speech

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    18/66

    ML estimation of spectral parameter

    Mel-cepstral representation of speech spectra)

    m

    m

    zmczH 0 )(exp)(ncy

    (ra

    M

    m

    m

    zmczH 0

    ~)(exp)(

    rped

    frequ

    warped frequencymel-scale frequency

    ~

    1

    11

    1

    ~ jez

    zz

    Wa

    Frequency (rad)

    0 2/

    -est mat on o me -cepstrum

    x : speech waveform (Gaussian process)

    c

    c : mel-cepstrum

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    19/66

    Waveform reconstruction

    Original speech

    Mel-cepstrumUnvoiced / voicedF0

    u se ra n

    Synthesis filter

    H)(ne )(nxWhite noise

    These speech parameters should be modeled by HMM

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    20/66

    HMM-based speech synthesis system

    SPEECH

    DATABASEExcitation Spectral

    Speech signal Training part

    Parameter

    extraction

    Parameter

    ExtractionText analysis Spectralparameters

    Excitation

    parameters

    Training HMMsLabels

    ),

    |(maxarg

    LO P

    TEXTContext-dependent HMMs

    Text analysisParameter generation

    from HMMsLabelsExcitation Spectral

    Excitation

    generation

    Synthesis

    Filter

    SYNTHESIZED

    SPEECH

    parame ersExcitation

    parame ers

    Synthesis part

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    21/66

    Hidden Markov model (HMM)

    11a 22a 33a

    12a 23a

    ij

    )( tqb o

    : output probability

    b o b o b o

    1o 2o 3o 4o 5o To oObservation

    sequence

    1 1 1 1 2 2 3 3 qState

    sequence

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    22/66

    Structure of state output (observation) vector

    Spectral parameters

    (e.g., mel-cepstrum, LSPs)

    Spectrum part

    log F0 with V/UV

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    23/66

    Observation of F0

    1R

    voiced0

    2 Runvoiced

    req

    uenc

    LogF

    Time

    Multi-space distribution HMM (MSD-HMM)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    24/66

    Multi-space probablity distribution HMM(MSD-HMM)

    1 2 3

    1,1w 1,2w 1,3w1

    1n

    R

    )(11 xN )(21 xN )(31 xN

    2,1w 2,2w 2,3w2

    2n

    R)(12 xN )(22 xN )(32 xN

    Mw ,1 Mw ,2 Mw ,3

    M

    M

    )(1 xMN )(2 xMN )(3 xMN

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    25/66

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    26/66

    Structure of state-output distributions

    t

    Mel-cepstrumtc

    Voiced

    pe

    ctru

    tc

    Unvoiced

    Voiced

    tc2

    Unvoiced

    Voiced

    itation

    tptp

    Unvoiced

    Ex tp2

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    27/66

    Contextual factors

    Phoneme {preceding, succeeding} two phonemes

    current phoneme

    # of phonemes in {preceding, current, succeeding} syllable

    {accent, stress} of {preceding, current, succeeding} syllable Position of current syllable in current word

    o prece ng, succee ng accen e , s resse sy a e n curren p rase

    # of syllables {from previous, to next} {accented, stressed} syllable

    Vowel within current syllable

    Word Part of speech of {preceding, current, succeeding} word

    # of syllables in {preceding, current, succeeding} word

    Position of current word in current phrase

    , # of words {from previous, to next} content word

    Phrase # of syllables in {preceding, current, succeeding} phrase

    Huge # of combinations Difficult to have all possible models

    ..

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    28/66

    Decision tree-based state clustering [Odell; 95]

    k-a+b yes noR=silence?

    L=voice?

    L=w?t-a+h

    yes

    eses

    yes no

    no

    no

    no

    L=gy?R=silence?s-i+n

    C=unvoice?

    C=vowel?

    R=silence?

    R=cl?

    L=gy?

    L=voice?

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    29/66

    Stream-dependent tree-based clustering (1)

    pectrum excitation have di erent contextdependency Build decision trees separately

    for

    mel-cepstrum

    Decision trees

    for F0

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    30/66

    State duration modeling

    HMM (Hidden Markov Model) State duration prob. depends only on transition prob.

    State duration probability exponentially decreases

    HSMM (Hidden Semi Markov Model)

    HMM + explicit duration model HSMM

    3-demensional Gaussian

    1q 2q 3q

    K

    1O 2O 3O 4O 5O 6O 7O 9O8O

    1d 2d 3d iii

    1,

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    31/66

    Stream-dependent tree-based clustering (2)

    State durationThree dimensional Gaussian

    Decision treeec s on rees

    for

    mel-cepstrum

    for state dur.

    models

    Decision trees

    for F0

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    32/66

    HMM-based speech synthesis system

    SPEECHDATABASE

    Excitation Spectral

    Speech signal Training part

    Parameter

    extraction

    Parameter

    ExtractionSpectral

    parameters

    Excitation

    parameters

    Text analysis

    Training HMMsLabels

    TEXTContext-dependent HMMs

    )

    ,

    |(maxarg loo P

    Text analysisParameter generation

    from HMMs

    LabelsExcitation Spectral

    o

    Excitation

    generation

    Synthesis

    Filter

    SYNTHESIZED

    SPEECH

    ExcitationSynthesis part

    parame ers parame ers

    C i i f HMM f i

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    33/66

    Composition of sentence HMM for given text

    TEXTG2P

    w

    Text analysisText normalization

    Pause prediction

    context-dependent label

    sequence l

    sentence HMM given labels

    ,

    S h t ti l ith

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    34/66

    Speech parameter generation algorithm [Tokuda; 00]

    For given sentence HMM, determine a speech parametervector sequence which maximizes

    TTTT ],,,[ 21 Toooo

    ),|(),|(),|( lqqolo PPP

    )

    ,

    |()

    ,|(max lqqoq

    q

    PP

    ),|(maxarg lqqq

    P

    ),|(maxarg qooo

    P

    D t i ti f t t

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    35/66

    Determination of state sequence

    K

    ii dpP ),|( lq

    i 1

    ip

    id

    : state-duration distribution of -th state

    : state duration of -th state

    i

    i

    K : # of states in a sentence HMM for l

    Gaussian

    iiiiiii mdmdNdp ,|

    S h t ti l ith

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    36/66

    Speech parameter generation algorithm

    For given HMM ,determine a speech parameter vectorSequence which maximizes

    TTTT ],,,[ 21 Toooo

    ),|(),|(),|( lqqolo PPP

    )

    ,

    |()

    ,|(max lqqoq

    q

    PP

    ),|(maxarg lqqq

    P

    ),(maxarg qoo o P

    With t d i f t

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    37/66

    Without dynamic feature

    becomes a sequence of mean vectors

    D i f t

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    38/66

    Dynamic features

    ct

    2

    .

    ttt

    t

    c112

    ttttt

    cccc

    tc1tc 1tc tc

    1tc 1tc

    5.0 5.0 1 12

    tc tc1tc 1tc 1t 1t

    I t ti f d i f t

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    39/66

    Integration of dynamic features

    Relationship between speech parameter vectors &static feature vectors

    o c

    T

    TTT

    tttt ccco 2,, M M

    M3

    M

    S l ti f th bl (1/2)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    40/66

    Solution for the problem (1/2)

    By setting O

    ,,

    O

    c

    ,

    1

    1

    qqq

    WWcW

    TT

    TTTT ],,,[ 21 Tcccc w ere

    TTTT ],,,[ 21 Tqqqq

    TTTT ,,, 21 Tqqqq

    S l ti f th bl (2/2)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    41/66

    Solution for the problem (2/2)

    1qTW W c

    TW 1

    q

    q

    Generated speech parameter trajector

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    42/66

    Generated speech parameter trajectory

    Trajectory HMM

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    43/66

    Trajectory HMM

    is not a proper distribution of c),|(),|( lWclo PP

    Conventional HMM Trajectory HMM

    Training ),|(maxarg LO

    P ),|(maxarg LC

    P

    Synthesis

    maxar

    ,maxarg

    lc

    o Wcoo

    P

    ,

    maxar lcP

    Solve inconsistenc between trainin & s nthesis

    c

    improving the model accuracy

    Generated spectra

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    44/66

    Generated spectra

    a i silsil

    0123

    45

    (kHz)

    w/o dyn

    01

    (345

    Hz)

    w dyn

    Spectra changing smoothly between phonemes

    Generated F0

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    45/66

    Generated F0

    uency

    100150200

    natural speech

    Fre 50

    0 1 time (s)2

    equency

    100150200

    without dynamic features

    Fr 50

    0 1 2 time (s)

    requency

    100150200

    F

    0 1 2 time (s)

    Effect of dynamic features

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    46/66

    Effect of dynamic features

    Mel-cepstrum

    static+ static2

    static+

    Smooth!2log F0static

    Overview of this talk

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    47/66

    Overview of this talk

    1. Mathematical formulation2. Implementation of individual components

    3. Examples demonstrating its flexibility

    Emotional speech synthesis

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    48/66

    Emotional speech synthesis

    text neutral angry

    class! Turn off it!

    You must attend the weekly meeting!

    trained with 200 utterances

    Speaker adaptation (mimicking voices)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    49/66

    Speaker adaptation (mimicking voices)

    MLLR-based adaptation

    Adaptation

    data

    speaker A

    Speaker-independent Adapted model

    adaptation

    w/o adaptation (initial model)

    Adapted with 50 utterances

    pea er s spea er- epen en sys em

    Speaker interpolation (mixing voices)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    50/66

    Speaker interpolation (mixing voices)

    Linear combination of two speaker-dependent models

    Model A Model BInterpolated model

    1.00 0.75 0.50 0.25 0.00A:

    0.00 0.25 0.50 0.75 1.00B:

    Voice morphing

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    51/66

    Voice morphing

    Two voices:

    Four voices:

    Interpolation of speaking styles

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    52/66

    Interpolation of speaking styles

    Base model A Base model B

    Interpolation extrapolation

    Neutral High Tension

    Eigenvoice (producing voices) [Shichiri; 02]

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    53/66

    Eigenvoice (producing voices) [Shichiri; 02]

    Speaker 1 Speaker 2 Speaker S

    )1(e ( )ke ( )Ke

    System OverviewClick here for a demo

    Multilingual speech synthesis

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    54/66

    Multilingual speech synthesis

    apanese

    American English

    Chinese (Mandarin) (by ATR)

    Brazilian Portuguese (by Nitech, and )

    European Portuguese (by Nitech, Univ of Porto, and UFRJ)

    (by Bostjan Vesnicer, University of Ljubljana, Slovenia)

    Swedish (by Anders Lundgren, KTH, Sweden) ,

    Korean (by Sang-Jin Kim, ETRI, Korea)

    Finish (by TKK, Finland)

    Baby English (by Univ of Edinburgh, UK)

    Polish, Slovak, Arabic, Farsi, Croatian, Polyglot, etc.

    Singing voice synthesis [Oura; 10] (1/2)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    55/66

    Singing voice synthesis [Oura; 10] (1/2)

    Trained HMMs Male:

    Female:

    Musical score Singing voicedatabase

    Synthesized

    singing voice

    Male:

    Female:

    ap a on:

    Overview of this talk

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    56/66

    Overview of this talk

    1. Mathematical formulation2. Implementation of individual components

    3. Examples demonstrating its flexibility

    Inclusion of all components

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    57/66

    Inclusion of all components

    Problem of statistical parametric speech synthesis

    WXw ,,|P

    lcP wlP c)(P

    Draw from

    Waveform generation Parameter generation Text processing

    Ll,

    LC ,|P

    os er or o mo e parame er

    Cc ddddP ,|WLP XC|P

    Text processing PriorSpeech analysis

    Relaxing approximations

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    58/66

    Relaxing approximations

    Marginalizing model parametersVariational Bayesian acoustic modeling for speech synthesis

    an a u;

    Marginalizing labelso n ron -en ac -en mo e ra n ng ura;

    Inclusion of waveform generation partave orm- eve s a s ca mo e a a;

    Summary

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    59/66

    Summary

    Statistical approach to speech synthesis

    Whole speech synthesis process is described in astatistical framework

    It gives a unified view and reveals what is correct

    and what is wron

    Importance of the database

    Still we have many problems should be solved:

    Combination with text processing part, etc.

    Final message

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    60/66

    Final message

    s speec syn es s a messy pro em

    Let us join speech synthesis research!

    Thanks!

    References (1/4)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    61/66

    References (1/4)

    ag sa a; - nu- speec syn es s sys em, , .

    Black;'96 - "Automatically clustering similar units...," Euro speech, '97.Beutnagel;'99 - "The AT&T Next-Gen TTS system," Joint ASA, EAA, & DAEA meeting, '99.

    Yoshimura;'99 - "Simultaneous modeling of spectrum ...," Eurospeech, '99.

    ' - " " - ' ..., . , , .

    Imai;'88 - "Unbiased estimator of log spectrum and its application to speech signal...," EURASIP, '88.

    Kobayashi;'84 - "Spectral analysis using generalized cepstrum," IEEE Trans. ASSP, 32, '84.Tokuda;'94 - "Mel-generalized cepstral analysis -- A unified approach to speech spectral...," ICSLP, '94.

    Imai;'83 - "Ce stral anal sis s nthesis on the mel fre uenc scale," ICASSP, '83.

    Fukada;'92 - "An adaptive algorithm for mel-cepstral analysis of speech," ICASSP, '92.

    Itakura;'75 - "Line spectrum representation of linear predictive coefficients of speech...," J. ASA (57), '75.

    Tokuda;'02 - "Multi-space probability distribution HMM," IEICE Trans. E85-D(3), '02.

    Odell;'95 - "The use of context in large vocaburary...," PhD thesis, University of Cambridge, '95.Shinoda;'00 - "MDL-based context-dependent subword modeling...," Journal of ASJ(E) 21(2), '00.

    Yoshimura;'98 - "Duration modeling for HMM-based speech synthesis," ICSLP, '98.

    Tokuda;'00 - "Speech parameter generation algorithms for HMM-based speech synthesis," ICASSP, '00.

    Kobayashi;'85 - "Mel generalized-log spectrum approximation...," IEICE Trans. J68-A (6), '85.

    ' " " '- ..., , .Donovan;'95 - "Improvements in an HMM-based speech synthesiser," Eurospeech, '95.

    Kawai;'04 - "XIMERA: A new TTS from ATR based on corpus-based technologies," ISCA SSW5, '04.

    Hirai;'04 - "Using 5 ms segments in concatenative speech synthesis," Proc. ISCA SSW5, '04.

    References (2/4)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    62/66

    References (2/4)

    ou a; - n se ec on or speec syn es s ase on a new acous c arge cos , n erspeec , .

    Huang;'96 - "Whistler: A trainable text-to-speech system," ICSLP, '96.Mizutani;'02 - "Concatenative speech synthesis based on HMM," ASJ autumn meeting, '02.

    Ling;'07 - "The USTC and iFlytek speech synthesis systems...," Blizzard Challenge workshop, 07.

    ' - " - " ..., , .

    Plumpe;'98 - "HMM-based smoothing for concatenative speech synthesis," ICSLP, '98.

    Wouters;'00 - "Unit fusion for concatenative speech synthesis," ICSLP, '00.Okubo;'06 - "Hybrid voice conversion of unit selection and generation...," IEICE Trans. E89-D(11), '06.

    A lett;'08 - "Combinin statistical arametric s eech s nthesis and unit selection..." Lan Tech, '08.

    Pollet;'08 - "Synthesis by generation and concatenation of multiform segments," Interspeech, '08.

    Yamagishi;'06 - "Average-voice-based speech synthesis," PhD thesis, Tokyo Inst. of Tech., '06.

    Yoshimura;'97 - "Speaker interpolation in HMM-based speech synthesis system," Eurospeech, '97.

    Tachibana;'05 - "Speech synthesis with various emotional expressions...," IEICE Trans. E88-D(11), '05.Kuhn;'00 - "Rapid speaker adaptation in eigenvoice space," IEEE Trans. SAP 8(6), '00.

    Shichiri;'02 - "Eigenvoices for HMM-based speech synthesis," ICSLP, '02.

    Fujinaga;'01 - "Multiple-regression hidden Markov model," ICASSP, '01.

    Nose;'07 - "A style control technique for HMM-based expressive speech...," IEICE Trans. E90-D(9), '07.

    ' " " '- - , , .Kawahara;'97 - "Restructuring speech representations using a ...", Speech Communication, 27(3), '97.

    Zen;'07 - "Details of the Nitech HMM-based speech synthesis system...", IEICE Trans. E90-D(1), '07.

    Abdl-Hamid;'06 - "Improving Arabic HMM-based speech synthesis quality," Interspeech, '06.

    References (3/4)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    63/66

    References (3/4)

    emp nne; - n egra on o e armon c p us no se mo e n o e..., as er es s, , .

    Banos;'08 - "Flexible harmonic/stochastic modeling...," V. Jornadas en Tecnologias del Habla, '08.Cabral;'07 - "Towards an improved modeling of the glottal source in...," ISCA SSW6, '07.

    Maia;'07 - "An excitation model for HMM-based speech synthesis based on ...," ISCA SSW6, '07.

    ' - " - - - " ' , , .

    Drugman;'09 - "Using a pitch-synchronous residual codebook for hybrid HMM/frame...", ICASSP, '09.

    Dines;'01 - "Trainable speech synthesis with trended hidden Markov models," ICASSP, '01.Sun;'09 - "Polynomial segment model based statistical parametric speech synthesis...," ICASSP, '09.

    Bul ko;'02 - "Robust s licin costs and efficient search with BMM models for...," ICASSP, '02.

    Shannon;'09 - "Autoregressive HMMs for speech synthesis," Interspeech, '09.

    Zen;'06 - "Reformulating the HMM as a trajectory model...", Computer Speech & Language, 21(1), '06.

    Wu;'06 - "Minimum generation error training for HMM-based speech synthesis," ICASSP, '06.

    Hashimoto;'09 - "A Bayesian approach to HMM-based speech synthesis," ICASSP, '09.Wu;'08 - "Minimum generation error training with log spectral distortion for...," Interspeech, '08.

    Toda;'08 - "Statistical approach to vocal tract transfer function estimation based on...," ICASSP, '08.

    Oura;'08 - "Simultaneous acoustic, prosodic, and phrasing model training for TTS...," ISCSLP, '08.

    Ferguson;'80 - "Variable duration models...," Symposium on the application of HMM to text speech, '80.

    ' " " '- ..., , , .Beal;'03 - "Variational algorithms for approximate Bayesian inference," PhD thesis, Univ. of London, '03.

    Masuko;'03 - "A study on conditional parameter generation from HMM...," Autumn meeting of ASJ, '03.

    Yu;'07 - "A novel HMM-based TTS system using both continuous HMMs and discrete...," ICASSP, '07.

    References (4/4)

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    64/66

    References (4/4)

    an; - enera ng na ura ra ec ory w a ve rees, n erspeec , .

    Latorre;'08 - "Multilevel parametric-base F0 model for speech synthesis," Interspeech, '08.Tiomkin;'08 - "Statistical text-to-speech synthesis with improved dynamics," Interspeech, '08.

    Toda;'07 - "A speech parameter generation algorithm considering global...," IEICE Trans. E90-D(5), '07.

    ' - " " ' ..., , .

    Toda;'09 - "Trajectory training considering global variance for HMM-based speech...," ICASSP, '09.

    Saino;'06 - "An HMM-based singing voice synthesis system," Interspeech, '06.

    Tsuzuki;'04 - "Constructing emotional speech synthesizers with limited speech...," Interspeech, '04.

    Sako;'00 - "HMM-based text-to-audio-visual s eech s nthesis," ICSLP, '00.

    Tamura;'98 - "Text-to-audio-visual speech synthesis based on parameter generation...," ICASSP, '98.

    Haoka;'02 - "HMM-based synthesis of hand-gesture animation," IEICE Technical report,102(517), '02.

    Niwase;'05 - "Human walking motion synthesis with desired pace and...," IEICE Trans. E88-D(11), '05.

    Hofer;'07 - "Speech driven head motion synthesis based on a trajectory model," SIGGRAPH, '07.Ma;'07 - "A MSD-HMM approach to pen trajectory modeling for online handwriting...," ICDAR, '07.

    Morioka;'04 - "Miniaturization of HMM-based speech synthesis," Autumn meeting of ASJ, '04.

    Kim;'06 - "HMM-Based Korean speech synthesis system for...," IEEE Trans. Consumer Elec., 52(4), '06.

    Klatt;'82 - "The Klatt-Talk text-to-speech system," ICASSP, '82.

    Acknowledgement

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    65/66

    Acknowledgement

    Keiichi Tokuda would like to thank HTS working

    rou members, includin Hei a Zen, Keiichiro

    Oura, Junichi Yamagichi, Tomoki Toda, Yoshihiko

    Nankaku, Kei Hahimoto, and Sayaka Shiota fortheir help.

  • 8/12/2019 Speech Synthesis as a Statistical Machine Learning Problem Slides

    66/66

    HTS Slidesreleased by HTS Working Group

    htt ://hts.s .nitech.ac. /

    Copyright (c) 1999 - 2011

    Department of Computer Science

    Some rights reserved.

    This work is licensed under the Creative Commons Attribution 3.0 license.

    See http://creativecommons.org/ for details.