t-61.184 automatic speech recognition: from theory to practice · l. r. rabiner, “a tutorial on...

70
Automatic Speech Recognition: From Theory to Practice 1 T-61.184 T-61.184 Automatic Speech Recognition: From Theory to Practice http://www.cis.hut.fi/Opinnot/T-61.184/ October 4, 2004 Prof. Bryan Pellom Department of Computer Science Center for Spoken Language Research University of Colorado [email protected]

Upload: others

Post on 31-Jan-2021

9 views

Category:

Documents


1 download

TRANSCRIPT

  • Automatic Speech Recognition: From Theory to Practice 1T-61.184T-61.184

    T-61.184Automatic Speech Recognition:

    From Theory to Practice

    http://www.cis.hut.fi/Opinnot/T-61.184/October 4, 2004

    Prof. Bryan PellomDepartment of Computer Science

    Center for Spoken Language ResearchUniversity of Colorado

    [email protected]

  • Automatic Speech Recognition: From Theory to Practice 2T-61.184T-61.184

    Role of Hidden Markov Models in Speech Recognition

    Feature Extraction

    Feature Extraction

    Optimization /

    Search

    Optimization /

    Search

    Lexicon Language Model

    Acoustic Model

    Speech Text +Timing +Confidence

  • Automatic Speech Recognition: From Theory to Practice 3T-61.184T-61.184

    Motivation for Today’s Topic

    SPEECH

    S P IY CH

  • Automatic Speech Recognition: From Theory to Practice 4T-61.184T-61.184

    Expectations for this course

    Today to present a basic idea of Hidden Markov Models (HMMs)

    Next week we’ll discuss how HMMs are used in practice to model acoustics of speech

    Don’t be lost in the math today: I don’t expect you’ll get the theory from all these slides! If you want to learn about HMMs it’s important to take a look at the resources listed on the next slide.

  • Automatic Speech Recognition: From Theory to Practice 5T-61.184T-61.184

    Resources for Today’s Topic

    L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, Vol. 77, No. 2, pp. 257-286, February, 1989.

    L.R. Rabiner & B. W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, ISBN 0-13 015157-2, 1993 (see chapter 6)

    The Cambridge HTK Toolkit has a user manual called the “HTK Book”. This is an excellent resource and covers many of the equations related to HMM modeling

  • Automatic Speech Recognition: From Theory to Practice 6T-61.184T-61.184

    Discrete-Time Markov ProcessCharacterized by:

    A set of N states

    Transition Probabilities

    First-order Markov ChainCurrent system state only depends on previous stateLet be the system state at time t.Transition probability depends only on previous state

    { }NSSSS L,, 10=

    )|( 1 itjtij SqSqPa === −

    S0

    00a01a

    11a

    10a

    )|(),|( 121 itjtktitjt SqSqPSqSqSqP ====== −−− L

    tq

    S1ija

  • Automatic Speech Recognition: From Theory to Practice 7T-61.184T-61.184

    Discrete-Time Markov Process

    Properties:

    “Observable Markov Model” : output of the process is a set of states.

    ji, 0 ∀≥ija

    ∑=

    ∀=N

    jija

    0

    i 1

    =

    1110

    0100Aaaaa

    S0

    00a01a

    11a

    10a S1

  • Automatic Speech Recognition: From Theory to Practice 8T-61.184T-61.184

    Example 1: Single Fair Coin Observable Markov Process

    OutcomesHead (State 0)Tail (State 1)

    Observed outcomes uniquely define state sequence

    e.g., HHHTTTHHTT S=0001110011

    Transition Probabilities,

    HEAD TAILS

    S0

    00a

    01a11a

    10a S1

    =

    5.05.05.00.5

    A

  • Automatic Speech Recognition: From Theory to Practice 9T-61.184T-61.184

    Example 2: Observable Markov Model of Weather

    States:S0 : rainyS1 : cloudyS2 : sunny

    With state transition probabilities,

    02a

    00a

    01aS0

    11a

    S1

    S221a

    22a

    12a02a20a

    10a

    { }

    ==

    8.01.01.02.06.02.03.03.04.0

    ijaA

  • Automatic Speech Recognition: From Theory to Practice 10T-61.184T-61.184

    Observable Markov Model of Weather

    What is the probability that the weather for 8 consecutive days is “sun, sun, sun, rain, rain, sun, cloudy, sun”?

    SolutionObservation sequence:

    O = {sun, sun, sun, rain, rain, sun, cloudy, sun”}Corresponds to state sequence, S = { 2, 2, 2, 0, 0, 2, 1, 2}Want to determine, P(O | model)P(O | model) = P ( S={2,2,2,0,0,2,1,2} | model )

  • Automatic Speech Recognition: From Theory to Practice 11T-61.184T-61.184

    Observable Markov Model of Weather

    { }

    ( ) ( ) ( ) ( ) ( ) ( )2.01.03.04.01.08.0

    )1|2()2|2()2()model|2,1,2,0,0,2,2,2()model|(

    22

    122102002022222

    78121

    ⋅⋅⋅⋅⋅=

    ⋅⋅⋅⋅⋅⋅⋅=======

    ==

    π

    π aaaaaaaqqPqqPqP

    SPOPL

    { }

    ==

    8.01.01.02.06.02.03.03.04.0

    ijaA)(yprobabilit state initial :

    1 iqPii

    ==ππ

  • Automatic Speech Recognition: From Theory to Practice 12T-61.184T-61.184

    “Hidden” Markov Models

    Observations are a probabilistic function of state

    Underlying state sequence is not observable (hidden)

    First-order assumption

    Output independence: observations are dependent only on the state that generated them, not on eachother.

  • Automatic Speech Recognition: From Theory to Practice 13T-61.184T-61.184

    Example: 2-Coins, Observable Markov Process

    ObservationsHeadTail

    Observed outcomes do not uniquely define state sequence

    Transition Probabilities,1

    1

    1)()(

    PTPPHP−=

    =

    2

    2

    1)()(

    PTPPHP−=

    =

    S0

    00a

    01a11a

    10a S1

    −=

    1111

    0000

    11a

    Aaaa

  • Automatic Speech Recognition: From Theory to Practice 14T-61.184T-61.184

    Urn-and-Ball Illustration

    A Genie plays a game with a blind observer1. Genie initially selects an Urn at random2. Genie picks a colored ball, tells blind observer the color3. Genie puts ball back in Urn4. Genie moves to next Urn based on random selection

    procedure from current Urn.5. Steps 2-4 repeated

  • Automatic Speech Recognition: From Theory to Practice 15T-61.184T-61.184

    Urn-and-Ball Illustration

    Observations: The color sequence of the balls

    States:The identity of the urn

    State-Transition:The process of selecting the urns

  • Automatic Speech Recognition: From Theory to Practice 16T-61.184T-61.184

    Urn-and-Ball Illustration

    Observation sequence: R B Y Y G B Y G R …Individual colors (observations) don’t reveal urn (state)

    60.0)(10.0)(30.0)(20.0)(

    ====

    YPBPGPRP

    20.0)(20.0)(15.0)(45.0)(

    ====

    YPBPGPRP

    05.0)(10.0)(70.0)(15.0)(

    ====

    YPBPGPRP

  • Automatic Speech Recognition: From Theory to Practice 17T-61.184T-61.184

    Discrete Symbol Observation HMMA set of N states

    Transition Probabilities

    A set of M observation symbols

    Probability Distribution(for state j, symbol k)

    { }NSSSS L,, 10=

    )|( 1 itjtij SqSqPa === −

    { }MvvvV L,, 21=

    )|()( jqvoPkb tktj ===)|(

    )|()|(

    2

    1

    iM

    i

    i

    SvP

    SvPSvP

    M

    )|(

    )|(

    )|(

    2

    1

    jM

    j

    j

    SvP

    SvP

    SvP

    M

    Si

    iia

    ijajja

    jia Sj

  • Automatic Speech Recognition: From Theory to Practice 18T-61.184T-61.184

    Discrete Symbol Observation HMM

    Also characterized by:An initial state distribution

    Specification thus requires2 model parameters, N and MSpecification of M symbols3 probability measures A,B,π

    { } ( )iqPi === 1ππ

    )|(

    )|()|(

    2

    1

    iM

    i

    i

    SvP

    SvPSvP

    M

    )|(

    )|(

    )|(

    2

    1

    jM

    j

    j

    SvP

    SvP

    SvP

    M

    Si

    iia

    ijajja

    jia Sj

    ),,( πλ BA=

  • Automatic Speech Recognition: From Theory to Practice 19T-61.184T-61.184

    Left-to-Right HMMs

    =

    333231

    232221

    131211

    Aaaaaaaaaa ijaij

  • Automatic Speech Recognition: From Theory to Practice 20T-61.184T-61.184

    Ergodic HMMs

    jiaij ∀∀> , 0

    S1

    11a12a

    22a

    S2 S3

    23a33a

    13a21a 32a

    31a

  • Automatic Speech Recognition: From Theory to Practice 21T-61.184T-61.184

    HMM as a Symbol Generator

    Given parameters N, M, A, B, and π, HMMs can generate an observation sequence,

    1. Choose initial state from init. state distribution π2. Set 3. Choose according to distribution4. Transition to set state according to state-transition probability distribution5. Set go to step 3

    { }ToooO ,,, 21 L=iq =1

    1=tkt vo = )(kbijqt =+1

    ija1+= tt

  • Automatic Speech Recognition: From Theory to Practice 22T-61.184T-61.184

    HMM as a Symbol Generator

    Example output from symbol generation

    We can think of the HMM as generating the observation sequence as it transitions from state to state.

    Observation

    State

    T…654321Time, t

    1q 2q 3q 4q 5q 6q Tq1o 2o 3o 4o 5o 6o To

  • Automatic Speech Recognition: From Theory to Practice 23T-61.184T-61.184

    Three Interesting ProblemsProblem 1: Scoring & Evaluation

    How to efficiently compute the probability of an observation sequence (O) given a model (λ)? P(O| λ)

    Problem 2: DecodingGiven an observation sequence (O) and a model (λ), how do we determine the corresponding state-sequence (q) that “best explains” how the observations were generated?

    Problem 3: TrainingHow to adjust model parameters (λ = {A,B,π}) to maximize probability of generating a given observation sequence? maximize P(O| λ).

  • Automatic Speech Recognition: From Theory to Practice 24T-61.184T-61.184

    Problem 1: Scoring and Evaluation

    Given an observation sequence,

    Want to compute probability of generating it:

    Let’s assume a particular sequence of states,

    { }ToooO ,,, 21 L=

    )| λ(OP

    { }Tqqqq ,,, 21 L=

  • Automatic Speech Recognition: From Theory to Practice 25T-61.184T-61.184

    Problem 1: Scoring and Evaluation

    Using the chain rule we can decompose the problem by summing over all possible state sequences,

    The first term relates to the likelihood of generating the observed symbol sequence given the assumed state sequence.

    The second term relates to how likely the system is to step through those sequence of states.

    ∑=q all

    )|)P(,|P()|P( λλλ qqOO

  • Automatic Speech Recognition: From Theory to Practice 26T-61.184T-61.184

    Problem 1: Scoring and Evaluation

    Probability of the observation sequence given the state sequence,

    Probability of the state sequence,

    )()()(

    ),|(),|

    21

    1

    21 Tqqq

    T

    ttt

    Tbbb

    qpqP

    ooo

    o (O

    L⋅=

    =∏=

    λλ

    )()()()|132211 TT qqqqqqq

    aaaqP−

    ⋅= Lπλ (

  • Automatic Speech Recognition: From Theory to Practice 27T-61.184T-61.184

    Problem 1: Scoring and Evaluation

    Using the chain rule,

    This is not practical to compute. For N states, T observations, the number of state sequences is:

    ∑=q all

    )|)P(,|P()|P( λλλ qqOO

    )o()o()o(122111 2

    q all 1 Tqqqqqqqq TTT babab −∑= Lπ

    )*2(O TNT

  • Automatic Speech Recognition: From Theory to Practice 28T-61.184T-61.184

    Forward AlgorithmDefinition:

    (Probability of seeing observations o1 to ot and ending at state i given HMM λ)

    1. Initialization

    2. Induction

    3. Termination

    )|,()( 21 λα iqPi ttt == ooo K

    ii πα =)(0

    )()()( 11

    1 +=

    +

    = ∑ tj

    N

    iijtt baij oαα

    ∑=

    =N

    iT iP

    1)()|( αλO

  • Automatic Speech Recognition: From Theory to Practice 29T-61.184T-61.184

    Forward Algorithm Illustration)1(tα

    )2(tα

    )(itα

    )(Ntα

    ja1ja2

    ija

    Nja

    )()(

    )(

    11i

    1

    +=

    +

    =

    ∑ tjN

    ijt

    t

    bai

    j

    α

    time t time t+1 )*(O2NT

  • Automatic Speech Recognition: From Theory to Practice 30T-61.184T-61.184

    Forward Algorithm Example

    S0

    6.0

    4.0

    0.1

    S1

    2.0B8.0A

    7.0B3.0A

    =

    0.00.1

    π

    Given the above HMM with discrete observations “A” and “B”, what is the probability of generating the sequence “O = {A,A,B}”?

    In other words, find P( O={A,A,B} | λ )

  • Automatic Speech Recognition: From Theory to Practice 31T-61.184T-61.184

    Forward Algorithm Example

    S0

    6.0

    4.0

    0.1

    S1

    2.0B8.0A

    7.0B3.0A

    =

    0.00.1

    π

    t=0S0

    S1

    0.6*0.81.0

    0.01.0*0.3

    t=10.6*0.8

    0.48

    1.0*0.30.12

    t=20.6*0.2

    0.23

    t=30.03

    0.091.0*0.7

    0.13

    0.4*0.3 0.4*0.3 0.4*0.3

    A A B

    Answer:P(O|λ)=

    0.03+0.13= 0.16

    Answer:P(O|λ)=

    0.03+0.13= 0.16

  • Automatic Speech Recognition: From Theory to Practice 32T-61.184T-61.184

    Backward Algorithm

    Definition:

    Probability of observation sequence ot+1 to oTgiven state i at time t and HMM λ

    Initialization

    Induction

    )|,()( 21 λβ iqPi tTttt == ++ ooo K

    )()()( 11

    1 jbai tN

    jtjijt +

    =+∑= ββ o

    1)( =iTβ

    NiTTt

    ≤≤−−=

    11,,2,1 K

  • Automatic Speech Recognition: From Theory to Practice 33T-61.184T-61.184

    Backward Algorithm Illustration

    ( )122 +ti oba

    )1(1+tβ

    )2(1+tβ

    )(1 jt+β

    )(1 Nt+β

    ( )111 +ti oba

    )()(

    )(

    1j11∑

    =++

    =N

    ttjij

    t

    jba

    i

    β

    β

    o

    time t time t+1

    ( )122 +ti oba

    ( )1+tjij oba

    ( )1+tNiN oba

    )*(O 2NT

  • Automatic Speech Recognition: From Theory to Practice 34T-61.184T-61.184

    Backward Algorithm Example

    S0

    6.0

    4.0

    0.1

    S1

    2.0B8.0A

    7.0B3.0A

    =

    0.00.1

    π

    t=0S0

    S1

    0.6*0.80.16

    0.001.0*0.3

    t=10.6*0.8

    0.28

    1.0*0.30.21

    t=20.6*0.2

    0.40

    t=31.0

    0.71.0*0.7

    1.0

    0.4*0.3 0.4*0.3 0.4*0.7

    A A B

    1* 0 =π

    0* 1 =π

  • Automatic Speech Recognition: From Theory to Practice 35T-61.184T-61.184

    Problem 1: Scoring and Evaluation

    In fact, there are 2 ways to compute

    Computation of each term is

    )| λ(OP

    ∑=

    =N

    iT iP

    1)()|( αλO

    )()|( 01

    iPN

    iiβπλ ∑

    =

    =O

    )(O 2TN

    Forward Algorithm

    Backward Algorithm

  • Automatic Speech Recognition: From Theory to Practice 36T-61.184T-61.184

    Problem 2: Decoding

    Given an observation sequence,

    Find the single best sequence of states,

    Which maximizes,

    { }ToooO ,,, 21 L=

    )|,( λqP O

    { }Tqqqq ,,, 21 L=

  • Automatic Speech Recognition: From Theory to Practice 37T-61.184T-61.184

    Solution: Viterbi Algorithm

    Define:

    “Highest probable state sequence that accounts for observations 1 through t-1and ends in state i and time t”.

    )|,(max)( 21,121121

    λδ tttqqqt iqqqqPi tooo KL

    L== −

    ( )11 ])(max[)( ++ ⋅= tjijtit obaij δδ

  • Automatic Speech Recognition: From Theory to Practice 38T-61.184T-61.184

    Viterbi Algorithm Illustration

    )1(tδ

    )2(tδ

    )(itδ

    )(Ntδ

    ja1ja2

    ija

    Nja

    )(])([max )(

    11

    1

    +≤≤

    + =

    tjijtNi

    t

    baij

    oδδ

    time t time t+1

  • Automatic Speech Recognition: From Theory to Practice 39T-61.184T-61.184

    Viterbi Algorithm1. Initialization

    2. Recursion

    3. Termination

    4. Path Back trace

    )()( 11 oiibi πδ = 0)(1 =iψ

    )(])([max)( 11 tjijtNit baij o−≤≤= δδ

    ])([maxarg)( 11

    ijtNi

    t aij −≤≤

    = δψ

    [ ])(max1

    * iP TNi δ≤≤=[ ])(maxarg

    1

    * iq TNi

    T δ≤≤

    =

    )( * 11*

    ++= ttt qq ψ

  • Automatic Speech Recognition: From Theory to Practice 40T-61.184T-61.184

    Viterbi Algorithm Illustration

    S0

    6.0

    4.0

    0.1

    S1

    2.0B8.0A

    7.0B3.0A

    =

    0.00.1

    π

    t=0S0

    S1

    0.6*0.81.0

    0.01.0*0.3

    t=10.6*0.8

    0.48

    1.0*0.30.12

    t=20.6*0.2

    0.23

    t=30.03

    0.061.0*0.7

    0.06

    0.4*0.3 0.4*0.3 0.4*0.7

    A A B

  • Automatic Speech Recognition: From Theory to Practice 41T-61.184T-61.184

    Viterbi Algorithm in Log-Domain

    Due to numerical underflow issues, it is often common to conduct the Viterbi algorithm in the log-domain,

    Viterbi search in speech recognition systems is implemented in the log-domain for example.

    ( )( ) ( )( )

    ( )ijijtjtj

    ii

    aa

    obob

    log~log~

    log~

    =

    =

    = ππ

  • Automatic Speech Recognition: From Theory to Practice 42T-61.184T-61.184

    Viterbi Algorithm in Log-Domain1. Initialization

    2. Recursion

    3. Termination

    4. Path Back trace

    )(~~)(~ 11 oii bi +=πδ 0)(1 =iψ

    )(~]~)(~[max)( 11 tjijtNit baij o++= −≤≤ δδ

    ]~)(~[maxarg)( 11

    ijtNi

    t aij += −≤≤

    δψ

    [ ])(~max~1

    * iP TNi δ≤≤=[ ])(~maxarg

    1

    * iq TNi

    T δ≤≤

    =

    )( * 11*

    ++= ttt qq ψ

  • Automatic Speech Recognition: From Theory to Practice 43T-61.184T-61.184

    Problem 3: Training

    Train parameters of HMMTune λ to maximize P(O| λ)No efficient algorithm for global optimumEfficient iterative algorithm finds a local optimum

    (Baum-Welch) Forward-Backward AlgorithmCompute probabilities using current model λRefine λ λ based on computed valuesUses α and β from forward and backward algorithms

  • Automatic Speech Recognition: From Theory to Practice 44T-61.184T-61.184

    Forward-Backward Algorithm

    Define:

    “Probability of being in state i at time tand state j at time t+1 given the model and observation sequence”

    )|()|,,(

    ),|,(),(

    1

    1

    λλλξ

    OOO

    PjqiqPjqiqPji

    tt

    ttt

    ===

    ===

    +

    +

  • Automatic Speech Recognition: From Theory to Practice 45T-61.184T-61.184

    Forward-Backward Algorithm

    )|()()()(

    ),|,(),(

    11

    1

    λβα

    λξ

    Oo

    O

    Pjbai

    jqiqPji

    ttjijt

    ttt

    ++

    +

    =

    ===

    )(itα )(1 jt+β)( 1+tjijba o

  • Automatic Speech Recognition: From Theory to Practice 46T-61.184T-61.184

    Forward-Backward Algorithm

    ∑=

    ++

    ++

    +

    =

    =

    ===

    N

    itt

    ttjijt

    ttjijt

    ttt

    ii

    jbaiP

    jbaijqiqPji

    1

    11

    11

    1

    )()(

    )()()(

    )|()()()(

    ),|,(),(

    βα

    βαλβα

    λξ

    oOo

    O

  • Automatic Speech Recognition: From Theory to Practice 47T-61.184T-61.184

    Forward-Backward Algorithm

    =∑−

    =

    1

    1

    )(T

    tt iγ

    =∑−

    =

    1

    1),(

    T

    tt jiξ

    Expected number of transitions from state i in O.

    Expected number of transitions from state i tostate j in O.

    ∑=

    =N

    jtt jii

    1

    ),()( ξγ Probability of being in statei at time t

  • Automatic Speech Recognition: From Theory to Practice 48T-61.184T-61.184

    Forward-Backward Algorithm

    Using these definitions, we can compute the initial state occupancy probability,

    1 at time statein timesofnumber expected == tiiπ

    )(1 iγ=

  • Automatic Speech Recognition: From Theory to Practice 49T-61.184T-61.184

    Forward-Backward Algorithm

    i

    jiija

    state from ns transitioofnumber expected

    state to state from ns transitioofnumber expected =

    ∑∑

    ∑−

    =

    =−

    = =

    = == 1

    1

    1

    11

    1 1

    1

    1

    )(

    ),(

    ),(

    ),(

    T

    tt

    T

    tt

    T

    t

    N

    jt

    T

    tt

    i

    ji

    ji

    ji

    γ

    ξ

    ξ

    ξ

  • Automatic Speech Recognition: From Theory to Practice 50T-61.184T-61.184

    Forward-Backward Algorithm

    j

    jkbj statein esnumber tim expected

    symbolkth observing and statein timesofnumber expected )( =

    ∑∑

    ∑∑

    =

    =

    =

    =

    = =

    = ==== === T

    ttt

    T

    ttt

    T

    tt

    T

    tt

    T

    t

    N

    jt

    T

    t

    N

    jt

    jj

    jj

    j

    j

    jj

    jj

    kvtkvtkvt

    1

    1

    1

    1

    1 1

    1 1

    )()(

    )()(

    )(

    )(

    ),(

    ),(

    βα

    βα

    γ

    γ

    ξ

    ξooo

  • Automatic Speech Recognition: From Theory to Practice 51T-61.184T-61.184

    It can be shownThat

    P(O|λ) > P(O| λ)

    Unlessλ= λ

    It can be shownThat

    P(O|λ) > P(O| λ)

    Unlessλ= λ

    Forward-Backward Algorithm

    1. Initialize λ=(A,B,π)

    2. Compute α, β, and ξ

    3. Estimate λ = (A,B, π)

    4. Replace λ with λ

    5. If not converged go to 2

  • Automatic Speech Recognition: From Theory to Practice 52T-61.184T-61.184

    FB Algorithm Satisfies Constraints:

    ∑=

    =N

    ii

    1

    NiaN

    jij ≤≤=∑

    =

    1 11

    NjkbM

    kj ≤≤=∑

    =

    1 1)(1

  • Automatic Speech Recognition: From Theory to Practice 53T-61.184T-61.184

    Continuous Observation Densities

    Discussions up until this point have considered observations characterized by discrete symbols

    Often, we work with continuous valued data

    Can convert continuous variables to discrete symbols using a Vector Quantizer (VQ) codebook [with information loss]

    How to model continuous observations directly?

  • Automatic Speech Recognition: From Theory to Practice 54T-61.184T-61.184

    Mixture Gaussian PDFs

    Mixture Gaussian PDF with M components,

    Constraints on the mixture weights,

    ),,()(1

    jkjk

    M

    kjkj µcb ΣΝ=∑

    =tt oo

    ∑=

    =M

    kjkc

    1

    1

    Mk1 ,0 ≤≤≥jkc

  • Automatic Speech Recognition: From Theory to Practice 55T-61.184T-61.184

    Definition

    Probability of being in state j at time t with the kth mixture component accounting for ot:

    ΣΝ

    ΣΝ

    =

    ∑∑==

    M

    mjmjmtjm

    jkjktjkN

    jtt

    ttt

    c

    c

    jj

    jjkj

    11

    ),,(

    ),,(

    )()(

    )()(),(µ

    µ

    βα

    βαγo

    o

  • Automatic Speech Recognition: From Theory to Practice 56T-61.184T-61.184

    Resulting Equations for Mixture Weight & Mean Update

    ∑∑

    = =

    =

    ⋅= T

    t

    M

    kt

    t

    T

    tt

    jk

    kj

    kj

    1 1

    1

    ),(

    ),(

    γ

    γµ

    o

    ∑∑

    = =

    == T

    t

    M

    kt

    T

    tt

    jk

    kj

    kjc

    1 1

    1

    ),(

    ),(

    γ

    γ

  • Automatic Speech Recognition: From Theory to Practice 57T-61.184T-61.184

    Resulting Equations for Transition Matrix & Variance Update

    Transition probability aij same as case for discrete symbols. Covariance Matrix:

    ( )( )

    ∑∑

    = =

    =

    ′−−⋅=Σ T

    t

    M

    kt

    jktjkt

    T

    tt

    jk

    kj

    kj

    1 1

    1

    ),(

    ),(

    γ

    µµγ oo

  • Automatic Speech Recognition: From Theory to Practice 58T-61.184T-61.184

    Estimation from Multiple Observation Sequences

    We estimate HMM parameters from multiple examples of speech:

    Collected from different speakersIn different contexts

    We attempt to model variability in producing each sound unit by estimating parameters from large corpora of speech data.

  • Automatic Speech Recognition: From Theory to Practice 59T-61.184T-61.184

    Multiple Observations

    Assume K training patterns (observation sequences),

    Where,

    And,

    [ ](K)(2))1( , , , OOOO K=

    { }(k)T(k)1(k)1(k) koooO , , , K=

    )|(P (k) λOPk =

  • Automatic Speech Recognition: From Theory to Practice 60T-61.184T-61.184

    Transition Probabilitieswith Multiple Observations

    ∑ ∑

    ∑ ∑

    =

    =

    =

    =++

    =K

    k

    T

    t

    kt

    kt

    k

    K

    k

    T

    t

    ktjij

    kt

    kij k

    k

    iiP

    jbaiP

    a

    1

    1

    1

    1

    1

    11

    )()(1

    )()()(1

    βα

    βα (k)1to

  • Automatic Speech Recognition: From Theory to Practice 61T-61.184T-61.184

    Observation Probabilitieswith Multiple Observations

    ∑ ∑

    ∑ ∑

    =

    =

    =

    ===

    K

    k

    T

    t

    kt

    kt

    k

    K

    k

    T

    t

    kt

    kt

    k

    j k

    k

    lvtOts

    iiP

    iiP

    lb

    1

    1

    1

    1

    1

    1

    )()(1

    )()(1

    )( ..

    βα

    βα

  • Automatic Speech Recognition: From Theory to Practice 62T-61.184T-61.184

    Single State HMM Example

    How to estimate the parameters of a single-state HMM model with M component Gaussians?,

    ( )( ) ( )

    21exp

    2

    ),,()(

    1

    12/12/

    1

    =

    =

    −Σ′−−

    Σ=

    Σ=

    M

    kktkkt

    kd

    k

    kk

    M

    kkk

    uouow

    µbwb

    π

    tt oo

  • Automatic Speech Recognition: From Theory to Practice 63T-61.184T-61.184

    Single State HMM Example

    Define probability of tth observation being generated by the kth mixture component,

    Note that “k” (bk) here refers to mixture component, not HMM state. We are assuming just 1 HMM state.

    ( )

    ( )∑=

    = M

    ktkk

    tkkt

    obw

    obwokP

    1

    ),|( λ

  • Automatic Speech Recognition: From Theory to Practice 64T-61.184T-61.184

    Single State HMM Update Equations

    ( )( )2

    1

    1

    2

    1

    1

    1

    ),|(

    ),|(

    ),|(

    ),|(

    ),|(1

    kT

    tt

    T

    ttt

    k

    T

    tt

    T

    ttt

    k

    T

    ttk

    uokP

    ookP

    okP

    ookPu

    okPT

    w

    −⋅

    ⋅=

    =

    =

    =

    =

    =

    =

    λ

    λ

    λ

    λ

    λ This termis computedusing modelparametersfrom previousalgorithmiteration.

    Updated(new)

    parameterestimates

  • Automatic Speech Recognition: From Theory to Practice 65T-61.184T-61.184

    Mixture Gaussian (1-D case)

    -7.5 -5.5 -3.5 -1.5 0.5 2.5 4.5 6.5

    Example: 3 mixtures used to model underlying random process of 3 Gaussians

    )(ob

    o

  • Automatic Speech Recognition: From Theory to Practice 66T-61.184T-61.184

    Homework #3

    You will estimate the parameters of a single state multivariate HMM given a sequence of observations (O).

    Training parameters (observations) are MFCCs (13 dimensional) from 10 speakers

    Test parameters: from unknown speakers… which speaker most likely produced the sequence?

  • Automatic Speech Recognition: From Theory to Practice 67T-61.184T-61.184

    Homework #3

    Can compute the probability of generating the observation sequence for each estimated single-state HMM:

    Find the model which has the maximum log-probability…

    ∑=

    =T

    tts obOP

    1

    )(log)|(log λ

  • Automatic Speech Recognition: From Theory to Practice 68T-61.184T-61.184

    HMM Assumptions

    First-Order Markov ChainProbability of transitioning to a state only dependents on the current state

    Time-independence of State TransitionsTransitioning from state A to state B is independent of time

    Observation IndependenceObservations don’t depend on each other, just on the states thatgenerated them

    (sometimes) Left-to-Right TopologyAs we will see, it is generally assumed that a left-to-right HMM is used to model speech units (phonemes)

  • Automatic Speech Recognition: From Theory to Practice 69T-61.184T-61.184

    Important Concepts from Today

    Do you understand the basic idea of an HMM?

    Could you implement the Viterbi Algorithm if you had to?

    Can you estimate at least the parameters of a single-state HMM (mixture-Gaussians)?

    What assumptions do HMMs impose on the data being modeled?

  • Automatic Speech Recognition: From Theory to Practice 70T-61.184T-61.184

    Next Week

    How to use HMMs for modeling speech units?

    Consider major strategies for acoustic training

    How to model context dependencies using HMMs

    Some practical notes on HMM training for speech recognition