speechtek 2009: optimizing speech recognizer rejection thresholds
Post on 05-Dec-2014
1.794 Views
Preview:
DESCRIPTION
TRANSCRIPT
Optimizing speech recognizer rejection
thresholdsDan Burnett
Director of Speech Technologies, VoxeoAugust 24, 2009
Why this talk?
• Sometimes we forget the basics, which are:
• Recognizers are not perfect
• They can be optimized in a straightforward manner
• The simplest optimization is the rejection threshold
The Goal
• End user goal: optimal experience
• Our Goal: determine user experience for each possible rejection threshold, then choose optimum threshold
• Must compare true classification of an audio sample against the ASR engine’s classification
True classifications• Assume human-level recognition
• App should still distinguish (i.e. possibly behave differently) among the following cases:
Case Possible behaviorNo speech in audio sample
(nospeech)Mention that you didn’t hear anything and ask for repeat
Speech, but not intelligible (unintelligible) Ask for repeat
Intelligible speech, but not in app grammar
(out-of-grammar)Encourage in-grammar speech
Intelligible speech, and within app grammar (in-grammar) Respond to what person said
ASR Engine Classifications
• Silence/nospeech (nospeech)
• Reject (rejected)
• Recognize (recognized)
Crossing these two . . .
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASR
Crossing these two . . .
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASRMisrecognitions
Crossing these two . . .
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASR“Misrejections”
Crossing these two . . .
nospeech rejected recognized
nospeechCorrect
classificationImproperly
rejected Incorrect
unintelligible Improperly treated as silence
Correct behavior
Assume incorrect
out-of-grammar
Improperly treated as silence
Correct behavior
Incorrect
in-grammarImproperly
treated as silenceImproperly
rejectedEither correct or incorrect
True
ASR“Missilences”
Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized inappropriately or incorrectly
Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized inappropriately or incorrectly
So how do we evaluate these?
Evaluating errors
1. Evaluation data set
2. Try every rejection threshold value
3. Plot errors as function of threshold
4. Select optimal value for your app
1. Evaluation data set(s)• Data selection
• Must be representative (“every nth call”)
• Ideally at least 100 recordings per grammar path for good confidence in results
• Transcription
• Goal is to compare against recognition results, so no punctuation, coughs, etc. needed in transcription itself (but good to have in separate comments)
2. Try every rejection threshold value
• Run recognizer in batch mode with rejection threshold of 0 (i.e., no rejection)Remember to collect confidence scores!
• Then, for each threshold from 0 to 100
• Calculate number of misrecognitions, misrejections, and missilences
3. Plot errors
“Missilences”
Misrecognitions
“Misrejections”
Equal ErrorRate
0 100Rejection Threshold
3. Plot errors
Sum
MinimumTotal Error
0 100Rejection Threshold
4. Select optimal value
4. Select optimal value
• Equal-error-rate: not necessarily the optimum
4. Select optimal value
• Equal-error-rate: not necessarily the optimum
• Minimum of the sum: good starting point, great for comparing across engines (on same data set only!!)
4. Select optimal value
• Equal-error-rate: not necessarily the optimum
• Minimum of the sum: good starting point, great for comparing across engines (on same data set only!!)
• Optimal: depends on your app; some errors may be more critical than others
4. Select optimal value
• Equal-error-rate: not necessarily the optimum
• Minimum of the sum: good starting point, great for comparing across engines (on same data set only!!)
• Optimal: depends on your app; some errors may be more critical than others
• Question: if missilences not affected by threshold, why did I include it?
Further optimizations
• Move OOG into IG category if semantically correct (“You bet” -> “yes”)
• Consider additional threshold for confirmation
• Optimize endpointer parameters (affects missilences and/or “too much speech”)
Optimizing speech recognizer rejection
thresholdsDan Burnett
Director of Speech Technologies, VoxeoAugust 24, 2009
top related