bootstrap estimates for confidence intervals in asr performance evaluation presented by patty liu

Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation

Presented by Patty Liu

2

Introduction (1/2)

• The most popular performance measure in automatic speech recognition is the word error rate

• ： the number of word in sentence

• ： the edit distance, or Levenshtein distance ,between the recognizer output and the reference transcription of sentence

• The edit distance is the minimum number of insert, substitute and delete operations necessary to transform one sentence into the other.

ii

ii

n

eW :

in

ie

i

i

3

Introduction (2/2)

• The word error rate is an attractive metric, because it is intuitive, it corresponds well with application scenarios and (unlike sentence error rate) it is sensitive to small changes.

• On the downside it is not very amenable to statistical analysis.

- W is really a rate (number of errors per spoken word), and not a probability (chance of misrecognizing a word).

- Moreover, error events do not occur independently.

• Nevertheless hardly any other publication reports figures on these significance tests. Instead it is common to report the absolute and relative change in word error rate only.

4

Motivation

• What error rate do we have to expect when changing to a different test set?

• How reliable is a observed improvement of a system?

5

Bootstrap (1/5)

• The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates.

• The core idea is to create replications of a statistic by random sampling from the data set with replacement (so-called Monte Carlo estimates).

• We assume that the test corpus can be divided into s segments for which the recognition result is independent and the number of errors can thus be evaluated independently.

• For speaker-independent CSR it seems appropriate to to choose the set of all utterances of one speaker as a segment.

6

Bootstrap (2/5)

• For each sentence we record the number of words and the number errors :

• The following procedure is repeated B times (typically B = 103 . . . 104): For b = 1. . . B we randomly select with replacement s pairs from X, to generate a bootstrap sample

• The sample will contain several of the original sentences multiple times, while others are missing.

),(,),,( 11 ss enenX

),(,),,( ****1

* bs

bs

bbb enenX

iniie

7

Bootstrap (3/5)

8

Bootstrap (4/5)

• Then we calculate the word error rate on this sample

• The W b∗ are called bootstrap replications of W. They can be thought of as samples of the word error rate from an ensemble of virtual test sets.

• The uncertainty of Wboot can be quantified by the standard error, which has the following bootstrap estimate:

s

i

bi

s

i

bi

b

n

eW

1

*

1

*

* :

B

b

bboot W

BWW

1

** 1:

9

Bootstrap (5/5)

• For large s the distribution of W∗ is approximately Gaussian. In this case the true word error rate lies with 90% probability in the interval .

• Even when s is small, we can use the table of replications W b∗ to determine percentiles which in turn can serve as confidence intervals.

• For a chosen error threshold α, let be the

smallest value in the list , and be the largest.

• The interval contains the true value of W with probability 1 − 2α. This is the

bootstrap-t confidence interval. For example: With B = 1000 and α = 0.05, we sort the list of W

b∗ and use the values at position 50 and 950.

)(64.1 WseW bootboot

),(:),( bootbootboot WWaWC

bootW

BWW *1* bootW

thB

thB

10

EXAMPLE: NAB (North American Business News) ERROR RATES

narrower confidence intervals

smaller standard error

11

EXAMPLE: NAB ERROR RATES

Cboot CbootWboot

90%5% 5%

narrower confidence intervals

12

Comparing Systems (1/2)

• Competing algorithms are usually tested on the same data.

• Given two recognition systems A and B with word error counts and , the (absolute) difference in word error rate is

• We can apply the same bootstrap technique to the quantity ΔW as we did to W. The crucial point is that we calculate the difference in the number of errors of the two systems on identical bootstrap samples.

• The important consequence that ΔW∗ has much lower variance than W∗ of either system.

Aie

Bie

ii

i

Bi

Ai

BA

n

eeWWW

)(:

13

Comparing Systems (2/2)

• In addition to the two-tailed confidence interval , we may be more interested in whether system B is a real improvement over system A.

• Θ(x) is the step function, which is one for x > 0 and zero otherwise. So the poi function is the relative number of bootstrap samples which favor system B. We call this measure “probability of improvement” (poi).

)( WCboot

14

Example: System Comparison

• The system used for the examples described earlier now plays the role of system B, while a second system with slightly different acoustic models is system A.

• System B is apparently better by 0.3% to 0.4% absolute in terms of word error rate. The probability of improvement ranging between 82% and 95%, indicates that we can be moderately confident that this reflects a real superiority of system B, but we should not be too surprised if a fourth test set would be favorable to system A.

• The notable advantage of this differential analysis is that the standard error of ΔW is approximately one third of the standard error of W.

15


16


0.4

17

Conclusion

• We would like to emphasize that what we propose is not a new metric for performance evaluation, but a refined analysis of an established metric (word error rate).

• The proposed method seems attractive, because it is easy to use, it makes no assumption about the distribution of errors, results are directly related to word error rate, and the “probability of improvement” provides an intuitive figure of significance.

Open Vocabulary Speech Recognition with Flat Hybrid Models

19

Introduction (1/3)

• Large vocabulary speech recognition systems operate with a fixed large but finite vocabulary.

• Systems operating with a fixed vocabulary are bound to encounter so-called out-of-vocabulary (OOV) words. These are problematic for a number of reasons:

- An OOV word will never be recognized (even if the user repeats it), but will be substituted by some in-vocabulary word.

- Neighboring words are also often misrecognized.

- Later processing stages (e.g. translation, understanding, document retrieval) cannot recover from OOV errors.

- OOV words are often content words.

20

Introduction (2/3)

• The decision rule and knowledge sources used by a large vocabulary speech recognition system:

• acoustic model

relates acoustic features x to phoneme sequences , typically an HMM (vocabulary independent)

• pronunciation lexicon

assigns one (or more) phoneme string(s) to each word

• language model

assigns probabilities to sentences from a finite set of words

)|( xp

)|( wp

Vw

*Vw

)(wp

)'|()|(max)'(maxarg)('

wpxpwpxww

21

Introduction (3/3)

• For open vocabulary recognition, we propose to conceptually abandon the words in favor of individual letters. Unlike words, the set of different letters G in a writing system is finite.

• Concerning the link to the acoustic realization, the set of phonemes can also be considered finite for a given language. These considerations suggest the following model:

- acoustic model

- pronunciation model

provides a pronunciation for any string of letters

- sub-lexical language model

assigns probabilities to character strings

)|( xp

)|( gp *

)(gp

)'|()|(max)'(maxarg)('

gpxpgpxgg

*Gg

*Gg

22

Introduction (3/3)

• Alternatively the pronunciation model and sub-lexical language model can be combined into a joint “graphonemic” model

),'(maxarg)(*'

gpg

23

Grapheme-to-Phoneme Conversion (1/3)

• Obviously this approach to open-vocabulary recognition is strongly connected to grapheme-to-phoneme conversion (G2P), where we seek the most likely pronunciation for a given orthographic form:

• The underlying assumption of this model is that, for each word, its orthographic form and its pronunciation are generated by a common sequence of graphonemic units.

• Each unit is a pair of a letter sequence and a phoneme sequence of possibly different length.

• We refer to such a unit as a “graphone”. (Various other names have been suggested: grapheme-phonemejoint multigram, graphoneme, grapheme-to-phoneme correspondence (GPC), chunk)

),'(maxarg)(*'

gpg

** *),( GQgq

24


• The joint probability distribution is thus reduced to a probability distribution over graphone sequences which we model using a standard M-gram:

• The complexity of this model depends on two parameters: the range of the M-gram model and the allowed size of the graphones. We allow the number of letters and phonemes to vary between zero and an upper limit LLg qq 0||,0||

1

1111 ),,|()(

N

iMiii

N qqqpqp

),( gp )(gp

25


• We were able to verify that shorter units in combination with longer range M-gram modeling yields the best result for the grapheme-to-phoneme task.

26

Models for Open Vocabulary ASR (1/2)

• We combine the lexical entries with the (sub-lexical) graphones derived from grapheme-to-phoneme conversion to form an unified set of recognition units .

• From the perspective of OOV detection the sub-lexical units Q have been called “fragments” or “fillers”.

• By treating words and fragments uniformly the decision rule becomes

• The sequence model p(u) can be characterized as “hybrid” because it contains mixed M-grams containing both words and fragments.

• It can also be characterized as “flat”, as opposed to structured approaches that predict and model OOV words with different models.

QVU

)()|(maxarg*

upuxpUu

27

Models for Open Vocabulary ASR (2/2)

• A shortcoming of this model is that it leaves undetermined were word boundaries (i.e. blanks) should be placed.

• The heuristic used in this study is to compose any consecutive sub-lexical units into a single word and to treat all lexical units as individual words.

28

Experiments

• We have three different established vocabularies with 5, 20 and 64 thousand words, each corresponding to the most frequent words in the language model training corpus.

• For each baseline pronunciation dictionary a grapheme-to-phoneme model was trained with different length constraints

using EM training with M-gram length of 3 . The recognition vocabulary was then augmented with all graphones inferred by this procedure.

• For quantitative analysis, we evaluated both word error rate (WER) and letter error rate (LER). Letter error rate is more favorable with respect to almost-correct words and corresponds with the correction effort in dictation applications.

62L

29

Experiments

30

Experiments--Bootstrap Analysis of OOV Impact

• We are particularly interested in the effect of OOV words on the recognition error rate. This effect could be studied by varying the system’s vocabulary.

• However, changing the recognition system in this way might introduce secondary effects such as increased confusability between vocabulary entries.

• Alternatively we can alter the test set. By extending the bootstrap technique , we create an ensemble of virtual test corpora with a varying number of OOV words, and respective WER. This distribution allows us to study the correlation between OOV rate and word error rate without changing the recognition system.

31

Experiments

• This procedure is detailed in the following: For each sentence i = 1 . . . s we record the number of words , the number of OOV words and the number of recognition errors :

• For b = 1 . . .B (typically ) we randomly select with replacement s tuples from X to generate a bootstrap sample

• The OOV rate and word error rate on this sample

),,(,),,,( ****1

*1

*1

* bs

bs

bs

bbbb eoneonX

s

i

bi

s

i

bi

b

s

i

bi

s

i

bi

b

n

eWER

n

oOOV

1

*

1

*

*

1

*

1

*

*

:

:

),,(,),,,( 111 sss eoneonX

in

ieio

310B

32

Experiments

• The bootstrap replications OOV*b and WER*b can be visualized by a scatter plot. We quantify the observed linear relation between OOV rate and WER by a linear lest squares fit.

• The slope of the fitted line reflects the number of word errors per OOV word. For this reason we call this quantity “OOV impact”.

33

Discussion (1/2)

34

Discussion (2/2)

• Obviously the improvement in error rate depends strongly on the OOV rate.

• It is interesting to compare the OOV impact factor (word errors per OOV word): The baseline systems have values between 1.7 and 2, supporting the common wisdom that each OOV word causes two word errors.

• Concerning the optimal choice of fragment size L, we note that there are two counteracting effects:

- Larger L values increase the size of the graphone inventory, which in turn causes data sparseness problems, leading to worse grapheme-to-phoneme performance.

- Smaller values for L cause the unit inventory to contain many very short words with high probabilities, leading to spurious insertions in the recognition result. The present experiments suggest that the best trade-off is at L = 4.

35

Conclusion

• We have shown that we can significantly improve a well optimized state-of-the-art recognition system by using a simple flat hybrid sub-lexical model. The improvement was observed on a wide range of out-of-vocabulary rates.

• Even for very low OOV rates, no deterioration occurred.

• We found that using fragments of up to four letters or phonemes yielded optimal recognition results, while using non-trivial chunks is detrimental to grapheme-to-phoneme conversion.

OPEN-VOCABULARY SPOKEN TERM DETECTIONUSING GRAPHONE-BASED HYBRID RECOGNITION SYSTEMS

Murat Akbacak, Dimitra Vergyri, Andreas Stolcke

Speech Technology and Research Laboratory

SRI International, Menlo Park, CA 94025, USA

37

Introduction (1/3)

• Recently, NIST defined a new task, spoken term detection (STD), in

which the goal is to locate a specified term rapidly and accurately in

large heterogeneous audio archives, to be used ultimately as input

to more sophisticated audio search systems.

• The evaluation metric has two important characteristics:

(1) Missing a term is penalized more heavily than having a false

alarm for that term,

(2) Detection results are averaged over all query terms rather than

over their occurrences, i.e., the performance metric considers the

contribution of each term equally.

38

Introduction (2/3)

• Results of the NIST 2006 STD evaluation have shown that systems based on word recognition have an accuracy advantage over systems based on sub-word recognition (although they typically pay a price in run time).

• Yet, word recognition system are usually based on a fixed vocabulary, resulting in a word-based index that does not allow text-based searching for OOV words.

• To retrieve OOVs, as well as misrecognized IV words, audio search based on sub-word units (such as syllables and phone N-grams) has been employed in many systems.

• During recognition, shorter units are more robust to errors and word variants than longer units, but longer units capture more discriminative information and are less susceptible to false matches during retrieval.

39

Introduction (3/3)

• In order to move toward solutions that address the problem of misrecognition (both IV and OOV) during audio search, previous studies have employed fusion methods to recover from ASR errors during retrieval.

• Here, we propose a hybrid STD system that uses words and sub-word units together in the recognition vocabulary. The ASR vocabulary is augmented by graphone units.

• We extract from ASR lattices a hybrid index, which is then converted into a regular word index by a post-processing step that joins graphones into words.

• It is important to represent ASR lattices with only words (with an expanded vocabulary) rather than with words and sub-word units since the lattices might serve as input to other information processing algorithms, such as for named entity tagging or information extraction, which assume a word-based representation.

40

The STD Task ─Data

• The test data consists of audio waveforms, a list of regions to be searched, and a list of query terms.

• For expedience we focus in this study on English and the genre with the highest OOV rate, broadcast news (BN).

41

The STD Task ─Evaluation Metric

• Basic detection performance will be characterized in the usual way via standard detection error tradeoff (DET) curves of miss probability ( ) versus false alarm probability ( ).

• : detection threshold • : the number of correct (true) detections of term with

a score greater than or equal to .• : the true number of occurrences of term in the corpus.• : the number of spurious (incorrect) detections of

term with a score greater than or equal to .• : the number of opportunities for incorrect detection of

term in the corpus (= “Non-Target” term trials).

)(

),(1),(

termN

termNtermP

true

correctmiss

)(

),(),(

termN

termNtermP

NT

spuriousFA

MissP FAP

),( termNcorrect

),( termN true

),( termN spurious

),( termNNT

T

tFAMiss tPtP

TATWV

1

)]()([1

1

42

STD System Description (1/3)

43


I. Indexing step

• First, audio input is run through the STT system that produces word or word + graphone recognition hypotheses and lattices. These are converted into a candidate term index with times and detection scores (posteriors).

• When hybrid recognition (word+graphone) is employed, graphones in the resulting index are combined into words. To be able to do this, we keep word start/end information with a tag in the graphone representation (e.g., “[[.]”, “[.]]”,“[.]” indicate a graphone at the beginning, or end, or in the middle of a word, respectively).

44


II. Searching step

• During the retrieval step, first the search terms are extracted from the candidate term list, and then a decision function is applied to accept or reject the candidate based on its detection score.

45

STD System Description ─Speech-to-Text System

I. Recognition Engine

• The STT system used for this task was a sped-up version of the STT systems used in the NIST evaluation for 2004 Rich Transcription (RT-04).

• STT is using SRI’s Decipher(TM) speaker-independent continuous speech recognition system, which is based on continuous density, state-clustered hidden Markov models (HMMs), with a vocabulary optimized for the BN genre.

46


II. Graphone-based Hybrid Recognition• To compensate for OOV words during retrieval, we used

an approach that use sub-word units ─graphones─ to model OOV words.

• The underlying assumption used in this model is that, for each word, its orthographic form and its pronunciation are generated by a common sequence of graphonemic units.

• Each graphone is a pair of a letter sequence and a phoneme sequence of possibly different lengths.

47


• In our experiments, we used 50K words (excluding the 10K most frequent ones in our vocabulary) to train the graphone module, with maximum window length, M, set to 4.

• A hybrid word + graphone LM was estimated and used for recognition.

• Following is an example of an OOV word modeled by graphones:

abromowitz: [[abro] [mo] [witz]]

where graphones are represented by their grapheme strings enclosed in brackets, and “[[” and “]]” tags are used to mark word boundary information that is later used to join graphones back into words for indexing.

48

STD System Description ─N-gram Indexing

• Since the lattice structure provides additional information about the correct hypothesis could appear, to avoid misses (which have a higher cost in the evaluation score than false alarms) several studies have used the whole hypothesized word lattice to obtain the searchable index.

• We used the lattice-tool in SRILM (version 1.5.1) to extract the list of all word/graphone N-grams (up to N=5 for a word-only (W) STD system, N=8 for a hybrid (W+G) STD system).

• The term posterior for each N-gram is computed as the forward-backward combined score (acoustic, language, and prosodic scores were used) through all the lattice paths that share the N-gram nodes.

49

STD System Description ─Term Retrieval

• The term retrieval was implemented using the Unix join command, which concatenates the lines of the sorted term list and the index file for the terms common to both. The computational cost of this simple retrieval mechanism depends only on the size of the index.

• Each putative retrieved term is marked with a hard decision (YES/NO). Our decision-making module relies on the posterior probabilities generated by the STT system.

• One of two techniques were employed during the decision-making process.

- The first one determines a global threshold for posterior probability (GL-TH) by maximizing the ATWV score, which for this task was found to be 0.4 and 0.0001 for word-based and hybrid systems respectively.

- An alternative strategy can be formulated that computes a term-specific threshold (TERM-TH), which has a simple analytical solution.

50

STD System Description ─Term Retrieval

• Based on decision theory the optimal threshold for each candidate should satisfy

where is the value of a correct detection and is the cost for a false alarm.

• For the ATWV metric we have

Since the number of true occurrences of the term is unknown we approximate it for the calculation of the optimal by the sum of the posterior probabilities of the term in the corpus.

FAhit

FAFAhit CV

CCV

0*)1(*

)(,

1

)( termNTotalC

NV

trueFA

termtruehit

FAC

51

Experiment Results (1/2)

• In the hybrid (W+G) STD system there are approximately 15K graphones added to the recognition vocabulary.

• On average every OOV word results in two errors (itself, and one for a neighboring word because of incorrect context in the language model scoring). (e.g., going from 60K vocabulary to 20K vocabulary leads to a 3.2% increase in WER, and the hybrid system brings this number down to 1.3%).

52

Experiment Results (2/2)

• An interesting observation is that even for IV terms the hybrid (W+G) STD yields better performance than the word-only (W) STD system.

• This is because hybrid recognition improves both IV-word and OOV-word recognition, resulting in better retrieval performance for IV and OOV words at the same time.

bootstrap estimates for confidence intervals in asr performance evaluation presented by patty liu

Documents