lsa l352 speech recognition and synthesis e mf orh s( the ... · speech recognition and synthesis...

16
1 1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 7: Baum Welch/Learning and Disfluencies IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try to accurately give credit on each slide. 2 LSA 352 Summer 2007 Outline for Today Learning EM for HMMs (the “Baum-Welch Algorithm”) Embedded Training Training mixture gaussians Disfluencies 3 LSA 352 Summer 2007 The Learning Problem Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation- Maximization algorithm (Dempster, Laird, Rubin) The algorithm will let us train the transition probabilities A= {a ij } and the emission probabilities B={b i (o t )} of the HMM 4 LSA 352 Summer 2007 Input to Baum-Welch O unlabeled sequence of observations Q vocabulary of hidden states For ice-cream task O = {1,3,2.,,,.} Q = {H,C} 5 LSA 352 Summer 2007 Starting out with Observable Markov Models How to train? Run the model on the observation sequence O. Since it’s not hidden, we know which states we went through, hence which transitions and observations were used. Given that information, training: B = {b k (o t )}: Since every state can only generate one observation symbol, observation likelihoods B are all 1.0 A = {a ij }: a ij = C(i " j ) C(i " q) q #Q $ 6 LSA 352 Summer 2007 Extending Intuition to HMMs For HMM, cannot compute these counts directly from observed sequences Baum-Welch intuitions: Iteratively estimate the counts. – Start with an estimate for a ij and b k , iteratively improve the estimates Get estimated probabilities by: – computing the forward probability for an observation – dividing that probability mass among all the different paths that contributed to this forward probability

Upload: others

Post on 27-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    1LSA 352 Summer 2007

    LSA 352Speech Recognition and Synthesis

    Dan Jurafsky

    Lecture 7: Baum Welch/Learningand Disfluencies

    IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes,as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try toaccurately give credit on each slide.

    2LSA 352 Summer 2007

    Outline for Today

    LearningEM for HMMs (the “Baum-Welch Algorithm”)Embedded TrainingTraining mixture gaussians

    Disfluencies

    3LSA 352 Summer 2007

    The Learning Problem

    Baum-Welch = Forward-Backward Algorithm(Baum 1972)Is a special case of the EM or Expectation-Maximization algorithm (Dempster, Laird, Rubin)The algorithm will let us train the transitionprobabilities A= {aij} and the emission probabilitiesB={bi(ot)} of the HMM

    4LSA 352 Summer 2007

    Input to Baum-Welch

    O unlabeled sequence of observationsQ vocabulary of hidden states

    For ice-cream taskO = {1,3,2.,,,.}Q = {H,C}

    5LSA 352 Summer 2007

    Starting out with ObservableMarkov Models

    How to train?Run the model on the observation sequence O.Since it’s not hidden, we know which states we wentthrough, hence which transitions and observationswere used.Given that information, training:

    B = {bk(ot)}: Since every state can only generateone observation symbol, observation likelihoods Bare all 1.0A = {aij}:

    !

    aij =C(i" j)

    C(i" q)q#Q

    $

    6LSA 352 Summer 2007

    Extending Intuition to HMMs

    For HMM, cannot compute these counts directly fromobserved sequencesBaum-Welch intuitions:

    Iteratively estimate the counts.– Start with an estimate for aij and bk, iteratively improve

    the estimates

    Get estimated probabilities by:– computing the forward probability for an observation– dividing that probability mass among all the different

    paths that contributed to this forward probability

  • 2

    7LSA 352 Summer 2007

    The Backward algorithm

    We define the backward probability asfollows:

    This is the probability of generating partialobservations Ot+1T from time t+1 to the end, giventhat the HMM is in state i at time t and of coursegiven Φ.!

    "t (i) = P(ot+1,ot+2,...oT ,|qt = i,#)

    8LSA 352 Summer 2007

    The Backward algorithm

    We compute backward prob by induction:

    9LSA 352 Summer 2007

    Inductive step of the backward algorithm(figure inspired by Rabiner and Juang)

    Computation of βt(i) by weighted sum of all successive values βt+1

    10LSA 352 Summer 2007

    Intuition for re-estimation of aij

    We will estimate ^aij via this intuition:

    Numerator intuition:Assume we had some estimate of probability that agiven transition i->j was taken at time t inobservation sequence.If we knew this probability for each time t, we couldsum over all t to get expected value (count) for i->j.

    !

    ˆ a ij =expected number of transitions from state i to state j

    expected number of transitions from state i

    11LSA 352 Summer 2007

    Re-estimation of aij

    Let ξt be the probability of being in state i at time tand state j at time t+1, given O1..T and model Φ:

    We can compute ξ from not-quite-ξ, which is:

    !

    " t (i, j) = P(qt = i,qt+1 = j |O,#)

    !

    not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)

    12LSA 352 Summer 2007

    Computing not-quite-ξ

    !

    The four components of P(qt = i,qt+1 = j,) | ") : #,$,aij and bj (ot )

  • 3

    13LSA 352 Summer 2007

    From not-quite-ξ to ξ

    We want:

    We’ve got:

    Which we compute as follows:

    !

    not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)

    !

    " t (i, j) = P(qt = i,qt+1 = j |O,#)

    14LSA 352 Summer 2007

    From not-quite-ξ to ξ

    We want:

    We’ve got:

    Since:

    We need:!

    not _quite_" t (i, j) = P(qt = i,qt+1 = j,O | #)

    !

    " t (i, j) = P(qt = i,qt+1 = j |O,#)

    !

    " t (i, j) =not _quite_" t (i, j)

    P(O | #)

    15LSA 352 Summer 2007

    From not-quite-ξ to ξ

    !

    " t (i, j) =not _quite_" t (i, j)

    P(O | #)

    16LSA 352 Summer 2007

    From ξ to aij

    The expected number of transitions from state i tostate j is the sum over all t of ξThe total expected number of transitions out of state iis the sum over all transitions out of state iFinal formula for reestimated aij

    !

    ˆ a ij =expected number of transitions from state i to state j

    expected number of transitions from state i

    17LSA 352 Summer 2007

    Re-estimating the observationlikelihood b

    !

    ˆ b j (vk ) =expected number of times in state j and observing symbol vk

    expected number of times in state j

    We’ll need to know the probability of being in state j at time t:

    18LSA 352 Summer 2007

    Computing gamma

  • 4

    19LSA 352 Summer 2007

    Summary

    The ratio between theexpected number oftransitions from state i to j and the expected number ofall transitions from state i

    The ratio between theexpected number of timesthe observation data emittedfrom state j is vk, and theexpected number of timesany observation is emittedfrom state j

    20LSA 352 Summer 2007

    Summary: Forward-BackwardAlgorithm

    1) Intialize Φ=(A,B)2) Compute α, β, ξ3) Estimate new Φ’=(A,B)4) Replace Φ with Φ’

    5) If not converged go to 2

    21LSA 352 Summer 2007

    The Learning Problem: Caveats

    Network structure of HMM is always created by handno algorithm for double-induction of optimal structureand probabilities has been able to beat simple hand-built structures.Always Bakis network = links go forward in timeSubcase of Bakis net: beads-on-string net:

    Baum-Welch only guaranteed to return local max,rather than global optimum

    22LSA 352 Summer 2007

    Complete Embedded Training

    Setting all the parameters in an ASR systemGiven:

    training set: wavefiles & word transcripts for eachsentenceHand-built HMM lexicon

    Uses:Baum-Welch algorithm

    We’ll return to this after we’ve introduced GMMs

    23LSA 352 Summer 2007

    Baum-Welch for Mixture Models

    By analogy with ξ earlier, let’s define the probability of being in state jat time t with the kth mixture component accounting for ot:

    Now,

    !

    " tm ( j) =

    # t$1( j)i=1

    N

    % aijc jmb jm (ot )& j (t)

    #F (T)

    !

    µ jm =

    " tm ( j)ott=1

    T

    #

    " tk ( j)k=1

    M

    #t=1

    T

    #

    !

    c jm =

    " tm ( j)t=1

    T

    #

    " tk ( j)k=1

    M

    #t=1

    T

    #

    !

    " jm =

    # tm ( j)(ot $µ j )t=1

    T

    % (ot $µ j )T

    # tm ( j)k=1

    M

    %t=1

    T

    %

    24LSA 352 Summer 2007

    How to train mixtures?

    • Choose M (often 16; or can tune M dependent onamount of training observations)

    • Then can do various splitting or clusteringalgorithms

    • One simple method for “splitting”:1) Compute global mean µ and global variance2) Split into two Gaussians, with means µ±ε

    (sometimes ε is 0.2σ)

    3) Run Forward-Backward to retrain4) Go to 2 until we have 16 mixtures

  • 5

    25LSA 352 Summer 2007

    Embedded Training

    Components of a speech recognizer:Feature extraction: not statisticalLanguage model: word transition probabilities,trained on some other corpusAcoustic model:

    – Pronunciation lexicon: the HMM structure for each word,built by hand

    – Observation likelihoods bj(ot)– Transition probabilities aij

    26LSA 352 Summer 2007

    Embedded training of acousticmodel

    If we had hand-segmented and hand-labeled training dataWith word and phone boundariesWe could just compute the

    B: means and variances of all our triphone gaussiansA: transition probabilities

    And we’d be done!But we don’t have word and phone boundaries, nor phonelabeling

    27LSA 352 Summer 2007

    Embedded training

    Instead:We’ll train each phone HMM embedded in an entiresentenceWe’ll do word/phone segmentation and alignmentautomatically as part of training process

    28LSA 352 Summer 2007

    Embedded Training

    29LSA 352 Summer 2007

    Initialization: “Flat start”

    Transition probabilities: set to zero any that you want to be “structurallyzero”

    – The γ probability computation includes previous value ofaij, so if it’s zero it will never change

    Set the rest to identical values

    Likelihoods:initialize µ and σ of each state to global mean andvariance of all training data

    30LSA 352 Summer 2007

    Embedded Training

    Now we have estimates for A and BSo we just run the EM algorithmDuring each iteration, we compute forward andbackward probabilitiesUse them to re-estimate A and BRun EM til converge

  • 6

    31LSA 352 Summer 2007

    Viterbi training

    Baum-Welch training says:We need to know what state we were in, to accumulate counts of agiven output symbol otWe’ll compute ξI(t), the probability of being in state i at time t, byusing forward-backward to sum over all possible paths that mighthave been in state i and output ot.

    Viterbi training says:Instead of summing over all possible paths, just take the singlemost likely pathUse the Viterbi algorithm to compute this “Viterbi” pathVia “forced alignment”

    32LSA 352 Summer 2007

    Forced Alignment

    Computing the “Viterbi path” over the training data iscalled “forced alignment”Because we know which word string to assign to eachobservation sequence.We just don’t know the state sequence.So we use aij to constrain the path to go through thecorrect wordsAnd otherwise do normal ViterbiResult: state sequence!

    33LSA 352 Summer 2007

    Viterbi training equations

    Viterbi Baum-Welch

    !

    ˆ b j (vk ) =n j (s.t.ot = vk )

    n j!

    ˆ a ij =nij

    ni

    For all pairs of emitting states, 1

  • 7

    37LSA 352 Summer 2007

    Repeating:

    With some rearrangement of terms

    Where:

    Note that this looks like a weighted Mahalanobis distance!!!Also may justify why we these aren’t really probabilities (pointestimates); these are really just distances.

    Log domain

    38LSA 352 Summer 2007

    State Tying: Young, Odell, Woodland 1994

    The steps in creating CDphones.Start with monophone, do EMtrainingThen clone Gaussians intotriphonesThen build decision tree andcluster GaussiansThen clone and train mixtures(GMMs

    39LSA 352 Summer 2007

    Summary: Acoustic Modeling forLVCSR.

    Increasingly sophisticated modelsFor each state:

    GaussiansMultivariate GaussiansMixtures of Multivariate Gaussians

    Where a state is progressively:CI PhoneCI Subphone (3ish per phone)CD phone (=triphones)State-tying of CD phone

    Forward-Backward TrainingViterbi training

    40LSA 352 Summer 2007

    Outline

    DisfluenciesCharacteristics of disfluencesDetecting disfluenciesMDE bakeoffFragments

    41LSA 352 Summer 2007

    Disfluencies

    42LSA 352 Summer 2007

    Disfluencies: standard terminology (Levelt)

    Reparandum: thing repairedInterruption point (IP): where speaker breaks offEditing phase (edit terms): uh, I mean, you knowRepair: fluent continuation

  • 8

    43LSA 352 Summer 2007

    Why disfluencies?

    Need to clean them up to get understandingDoes American airlines offer any one-way flights [uh] one-way fares for 160 dollars?Delta leaving Boston seventeen twenty one arriving FortWorth twenty two twenty one forty

    Might help in language modelingDisfluencies might occur at particular positions (boundariesof clauses, phrases)

    Annotating them helps readabilityDisfluencies cause errors in nearby words

    44LSA 352 Summer 2007

    Counts (from Shriberg, Heeman)

    Sentence disfluency rateATIS: 6% of sentences disfluent (10% long sentences)Levelt human dialogs: 34% of sentences disfluentSwbd: ~50% of multiword sentences disfluentTRAINS: 10% of words are in reparandum or editingphrase

    Word disfluency rateSWBD: 6%ATIS: 0.4%AMEX 13%

    – (human-human air travel)

    45LSA 352 Summer 2007

    Prosodic characteristics ofdisfluencies

    Nakatani and Hirschberg 1994Fragments are good cues to disfluenciesProsody:

    Pause duration is shorter in disfluent silence than fluentsilenceF0 increases from end of reparandum to beginning ofrepair, but only minor changeRepair interval offsets have minor prosodic phraseboundary, even in middle of NP:

    – Show me all n- | round-trip flights | from Pittsburgh | toAtlanta

    46LSA 352 Summer 2007

    Syntactic Characteristics ofDisfluencies

    Hindle (1983)The repair often has same structure as reparandumBoth are Noun Phrases (NPs) in this example:

    So if could automatically find IP, could find and correctreparandum!

    47LSA 352 Summer 2007

    Disfluencies and LM

    Clark and Fox TreeLooked at “um” and “uh”

    “uh” includes “er” (“er” is just British non-rhoticdialect spelling for “uh”)

    Different meaningsUh: used to announce minor delays

    – Preceded and followed by shorter pauses

    Um: used to announce major delays– Preceded and followed by longer pauses

    48LSA 352 Summer 2007

    Um versus uh: delays(Clark and Fox Tree)

  • 9

    49LSA 352 Summer 2007

    Utterance Planning

    The more difficulty speakers have in planning, the more delaysConsider 3 locations:

    I: before intonation phrase: hardestII: after first word of intonation phrase: easierIII: later: easiest

    And then uh somebody said, . [I] but um -- [II] don’t you thinkthere’s evidence of this, in the twelfth - [III] and thirteenthcenturies?

    50LSA 352 Summer 2007

    Delays at different points in phrase

    51LSA 352 Summer 2007

    Disfluencies in language modeling

    Should we “clean up” disfluencies before training LM(i.e. skip over disfluencies?)

    Filled pauses– Does United offer any [uh] one-way fares?

    Repetitions– What what are the fares?

    Deletions– Fly to Boston from Boston

    Fragments (we’ll come back to these)– I want fl- flights to Boston.

    52LSA 352 Summer 2007

    Does deleting disfluenciesimprove LM perplexity?

    Stolcke and Shriberg (1996)Build two LMs

    RawRemoving disfluencies

    See which has a higher perplexity on the data.Filled PauseRepetitionDeletion

    53LSA 352 Summer 2007

    Change in Perplexity when FilledPauses (FP) are removed

    LM Perplexity goes up at following word:

    Removing filled pauses makes LM worse!!I.e., filled pauses seem to help to predict next word.Why would that be?

    Stolcke and Shriberg 1996 54LSA 352 Summer 2007

    Filled pauses tend to occur atclause boundaries

    Word before FP is end of previous clause; word afteris start of new clause;

    Best not to delete FP

    Some of the different things we’re doing [uh] there’snot time to do it all“there’s” is very likely to start a sentenceSo P(there’s|uh) is better estimate thanP(there’s|doing)

  • 10

    55LSA 352 Summer 2007

    Suppose we just delete medial FPs

    Experiment 2:Parts of SWBD hand-annotated for clausesBuild FP-model by deleting only medial FPsNow prediction of post-FP word (perplexity) improvesgreatly!Siu and Ostendorf found same with “you know”

    56LSA 352 Summer 2007

    What about REP and DEL

    S+S built a model with “cleaned-up” REP and DELSlightly lower perplexityBut exact same word error rate (49.5%)Why?

    Rare: only 2 words per 100Doesn’t help because adjacent words aremisrecognized anyhow!

    57LSA 352 Summer 2007

    Stolcke and Shriberg conclusions wrtLM and disfluencies

    Disfluency modeling purely in the LM probably won’tvastly improve WERBut

    Disfluencies should be considered jointly withsentence segmentation taskNeed to do better at recognizing disfluent wordsthemselvesNeed acoustic and prosodic features

    58LSA 352 Summer 2007

    WER reductions from modelingdisfluencies+background events

    Rose and Riccardi (1999)

    59LSA 352 Summer 2007

    HMIHY Background events

    Out of 16,159 utterances:

    5353Echoed Prompt

    5112Operater Utt.

    3585Background Speech

    8834Non-Speech Noise

    8048Breath

    2171Lipsmack

    163Laughter

    792Hesitations

    1265Word Fragments

    7189Filled Pauses

    60LSA 352 Summer 2007

    Phrase-based LM

    “I would like to make a collect call”“a [wfrag]” [brth] “[brth] and I”

  • 11

    61LSA 352 Summer 2007

    Rose and Riccardi (1999)Modeling “LBEs” does help in WER

    89.860.8LBELBE

    88.160.8BaselineLBE

    87.758.7BaselineBaseline

    Card NumberGreeting(HMIHY)

    LMHMM

    Test CorporaSystem Configuration

    ASR Word Accuracy

    62LSA 352 Summer 2007

    More on location of FPs

    Peters: Medical dictation taskMonologue rather than dialogueIn this data, FPs occurred INSIDE clausesTrigram PP after FP: 367Trigram PP after word: 51

    Stolcke and Shriberg (1996b)wk FP wk+1: looked at P(wk+1|wk)Transition probabilities lower for these transitions thannormal ones

    Conclusion:People use FPs when they are planning difficult things, sofollowing words likely to be unexpected/rare/difficult

    63LSA 352 Summer 2007

    Detection of disfluencies

    Nakatani and HirschbergDecision tree at wi-wj boundary

    pause durationWord fragmentsFilled pauseEnergy peak within wiAmplitude difference between wi and wjF0 of wiF0 differencesWhether wi accented

    Results:78% recall/89.2% precision

    64LSA 352 Summer 2007

    Detection/Correction

    Bear, Dowding, Shriberg (1992)System 1:Hand-written pattern matching rules to find repairs

    – Look for identical sequences of words– Look for syntactic anomalies (“a the”, “to from”)– 62% precision, 76% recall– Rate of accurately correcting: 57%

    65LSA 352 Summer 2007

    Using Natural LanguageConstraints

    Gemini natural language systemBased on Core Language EngineFull syntax and semantics for ATISCoverage of whole corpus:

    70% syntax50% semantics

    66LSA 352 Summer 2007

    Using Natural LanguageConstraints

    Gemini natural language systemRun pattern matcherFor each sentence it returned

    Remove fragment sentencesLeaving 179 repairs, 176 false positivesParse each sentence

    – If succeed: mark as false positive– If fail:

    run pattern matcher, make corrections Parse again If succeeds, mark as repair If fails, mark no opinion

  • 12

    67LSA 352 Summer 2007

    NL Constraints

    Syntax OnlyPrecision: 96%

    Syntax and SemanticsCorrection: 70%

    68LSA 352 Summer 2007

    Recent work: EARS MetadataEvaluation (MDE)

    A recent multiyear DARPA bakeoffSentence-like Unit (SU) detection:

    find end points of SUDetect subtype (question, statement, backchannel)

    Edit word detection:Find all words in reparandum (words that will be removed)

    Filler word detectionFilled pauses (uh, um)Discourse markers (you know, like, so)Editing terms (I mean)

    Interruption point detection

    Liu et al 2003

    69LSA 352 Summer 2007

    Kinds of disfluencies

    RepetitionsI * I like it

    RevisionsWe * I like it

    Restarts (false starts)It’s also * I like it

    70LSA 352 Summer 2007

    MDE transcription

    Conventions:./ for statement SU boundaries, for fillers,[] for edit words,* for IP (interruption point) inside edits

    And wash your clothes whereveryou are ./ and [ you ] * you really get used to theoutdoors ./

    71LSA 352 Summer 2007

    MDE Labeled Corpora

    1.86.8Filler word %

    1.87.4Edit word %

    8.113.6SU %

    11.714.9STT WER (%)

    45K35KTest set (words)

    182K484KTraining set (words)

    BNCTS

    72LSA 352 Summer 2007

    MDE Algorithms

    Use both text and prosodic featuresAt each interword boundary

    Extract Prosodic features (pause length, durations,pitch contours, energy contours)Use N-gram Language modelCombine via HMM, Maxent, CRF, or other classifier

  • 13

    73LSA 352 Summer 2007

    State of the art: SU detection

    2 stageDecision tree plus N-gram LM to decide boundarySecond maxent classifier to decide subtype

    Current error rates:Finding boundaries

    – 40-60% using ASR– 26-47% using transcripts

    74LSA 352 Summer 2007

    State of the art: Edit worddetection

    Multi-stage model– HMM combining LM and decision tree finds IP– Heuristics rules find onset of reparandum– Separate repetition detector for repeated words

    One-stage modelCRF jointly finds edit region and IPBIO tagging (each word has tag whether is beginning ofedit, inside edit, outside edit)

    Error rates:43-50% using transcripts80-90% using ASR

    75LSA 352 Summer 2007

    Using only lexical cues

    3-way classification for each wordEdit, filler, fluent

    Using TBLTemplates: Change

    – Word X from L1 to L2– Word sequence X Y to L1– Left side of simple repeat to L1– Word with POS X from L1 to L2 if followed by word with

    POS Y

    76LSA 352 Summer 2007

    Rules learned

    Label all fluent filled pauses as fillersLabel the left side of a simple repeat as an editLabel “you know” as fillersLabel fluent well’s as fillerLabel fluent fragments as editsLabel “I mean” as a filler

    77LSA 352 Summer 2007

    Error rates using only lexicalcues

    CTS, using transcriptsEdits: 68%Fillers: 18.1%

    Broadcast News, using transcriptsEdits 45%Fillers 6.5%

    Using speech:Broadcast news filler detection from 6.5% error to 57.2%

    Other systems (using prosody) better on CTS, not on BroadcastNews

    78LSA 352 Summer 2007

    Conclusions: Lexical Cues Only

    Can do pretty well with only words(As long as the words are correct)

    Much harder to do fillers and fragments from ASRoutput, since recognition of these is bad

  • 14

    79LSA 352 Summer 2007

    Fragments

    Incomplete or cut-off words:Leaving at seven fif- eight thirtyuh, I, I d-, don't feel comfortableYou know the fam-, well, the familiesI need to know, uh, how- how do you feel…

    Uh yeah, yeah, well, it- it- that’s right. And it-

    SWBD: around 0.7% of words are fragments (Liu 2003)ATIS: 60.2% of repairs contain fragments (6% of corpussentences had a least 1 repair) Bear et al (1992)Another ATIS corpus: 74% of all reparanda end in wordfragments (Nakatani and Hirschberg 1994)

    80LSA 352 Summer 2007

    Fragment glottalization

    Uh yeah, yeah, well, it- it- that’s right. And it-

    81LSA 352 Summer 2007

    Why fragments are important

    Frequent enough to be a problem:Only 1% of words/3% of sentencesBut if miss fragment, likely to get surrounding words wrong(word segmentation error).So could be 3 or more times worse (3% words, 9% ofsentences): problem!

    Useful for finding other repairsIn 40% of SRI-ATIS sentences containing fragments,fragment occurred at right edge of long repair74% of ATT-ATIS reparanda ended in fragments

    Sometimes are the only cue to repair“leaving at eight thirty”

    82LSA 352 Summer 2007

    How fragments are dealt with incurrent ASR systems

    In training, throw out any sentences with fragmentsIn test, get them wrongProbably get neighboring words wrong too!!!!!!

    83LSA 352 Summer 2007

    Cues for fragment detection

    49/50 cases examined ended in silence >60msec; average282ms (Bear et al)24 of 25 vowel-final fragments glottalized (Bear et al)

    Glottalization: increased time between glottal pulses

    75% don’t even finish the vowel in first syllable (i.e., speakerstopped after first consonant) (O’Shaughnessy)

    84LSA 352 Summer 2007

    Cues for fragment detection

    Nakatani and Hirschberg (1994)Word fragments tend to be content words:

    54%155Untranscribed

    4%12Function

    42%121Content

    PercentTokenLexical Class

  • 15

    85LSA 352 Summer 2007

    Cues for fragment detection

    Nakatani and Hirschberg (1994)91% are one syllable or less

    0.3%13

    9%252

    52%1491

    39%1130

    PercentTokensSyllables

    86LSA 352 Summer 2007

    Cues for fragment detection

    Nakatani and Hirschberg (1994)Fricative-initial common; not vowel-initial

    73%

    0%

    11%

    % 1-C frags

    45%33%Fric

    13%25%Vowel

    23%23%Stop

    % frags% wordsClass

    87LSA 352 Summer 2007

    Liu (2003): Acoustic-Prosodicdetection of fragments

    Prosodic featuresDuration (from alignments)

    – Of word, pause, last-rhyme-in word– Normalized in various ways

    F0 (from pitch tracker)– Modified to compute stylized speaker-specific contours

    Energy– Frame-level, modified in various ways

    88LSA 352 Summer 2007

    Liu (2003): Acoustic-Prosodicdetection of fragments

    Voice Quality FeaturesJitter

    – A measure of perturbation in pitch period Praat computes this

    Spectral tilt– Overall slope of spectrum– Speakers modify this when they stress a word

    Open Quotient– Ratio of times in which vocal folds are open to total length of

    glottal cycle– Can be estimated from first and second harmonics– Creaky voice (laryngealization) vocal folds held together, so

    short open quotient

    89LSA 352 Summer 2007

    Liu (2003)

    Use Switchboard 80%/20%Downsampled to 50% frags, 50% wordsGenerated forced alignments with gold transcriptsExtract prosodic and voice quality featuresTrain decision tree

    90LSA 352 Summer 2007

    Liu (2003) results

    Precision 74.3%, Recall 70.1%

    10143fragment

    reference 35109complete

    fragmentcomplete

    hypothesis

  • 16

    91LSA 352 Summer 2007

    Liu (2003) features

    Features most queried by DT

    0.018Pause duration

    0.084Position of current turn

    .147Average OQ

    %

    .238Ratio between F0 before and after boundary

    .241Energy slope difference between current andfollowing word

    .272jitter

    Feature

    92LSA 352 Summer 2007

    Liu (2003) conclusion

    Very preliminary workFragment detection is good problem that isunderstudied!

    93LSA 352 Summer 2007

    Fragments in other languages

    Mandarin (Chu, Sung, Zhao, Jurafsky 2006)Fragments cause similar errors as in English:

    但我﹣我問的是

    I- I was asking…他是﹣卻很﹣活的很好

    He very- lived very well

    94LSA 352 Summer 2007

    Fragments in Mandarin

    Mandarin fragments unlike English; no glottalization.Instead: Mostly (unglottalized) repetitions

    So: best features are lexical, rather than voice quality

    95LSA 352 Summer 2007

    Summary

    DisfluenciesCharacteristics of disfluencesDetecting disfluenciesMDE bakeoffFragments