assets.disi.unitn.itassets.disi.unitn.it/uploads/doctoral_school/documents/phd-thesis/x… ·...

PhD Dissertation

International Doctorate School in Information andCommunication Technologies

DIT - University of Trento

Applications of Word Graphs

In Spoken Language Processing

Vu Hai Quan

Advisor:

Dr Marcello Federico

ITC-irst, Centro per la Ricerca Scientifica e Tecnologica

February 2005

Abstract

This work explores the application of word graphs to spoken language pro-

cessing, in particular both automatic speech recognition and speech trans-

lation. For speech recognition, efficient algorithms for word graph genera-

tion, word graph expansion and word graph rescoring have been investigated

within the ITC-irst large vocabulary system. Two domains are considered:

Italian Broadcast News Corpus (IBNC) and the Basic Traveling Expres-

sions Corpus (BTEC). The first is a large vocabulary domain while the

second is a spontaneous-speech limited vocabulary domain. To evaluate the

quality of word graphs, various measures have been experimented. From

the generated word graph, I have further worked with confusion network

construction and N−best list generation, which gave positive results com-

pared with the baseline system. For the broadcast news domain, our best

word error rate is 17.7% which favorably compares to the baseline system

word error rate of 18.0%. Similarly to speech recognition, a word graph

can also be generated as output by a machine translation algorithm. As a

difference with the speech decoder, the translation decoder cannot proceed

synchronously with the input. However, partial theories examined during

the search have an exact correspondence with those of produced by a speech

decoder. In my thesis, I extended word graph processing algorithms to word

graphs generated by the ITC-irst statistical machine translation decoder, for

instance to generate M−best lists of translations. The possibility of having

the M−best list of translation candidates and the N−best list of hypothe-

ses from the speech recognition provides the system a richer of additional

knowledge sources which can be exploited in further steps to improve the

system performance. In particular, the N × M best list has been used

in a minimum error training scheme which improved translation quality.

Specifically, the BLEU score improved from 39.66 to 41.22

Keywords

statistical speech recognition, statistical speech translation, statistical ma-

chine translation, word graph, N−best list, confusion network, parameter

tuning.

4

Acknowledgment

I do not know how to express my love to Italy, the most beautiful country

with its age-old culture where I have lived and worked; to Italians, the

most friendly and kindly people in over the world whom I have met and

shared my life for three years; to the University of Trento, specifically the

International Doctorate School in Information and Communication Tech-

nologies, where I have started my scientific research. Probably, they will

deeply stay inside my heart for all my life.

I would like to thank to Marcello Federico, my advisor, who showed me

not only what a scientist should do but also how a gentleman should be.

He is a true scholar with his deep knowledge about the statistical models,

especially in information retrieval, speech recognition and machine trans-

lation. His lectures and advices was obviously guiding me to the place

where I should begin the research. Without him, I was certainly not able

to complete the study.

I also would like to thank to Fabio Brugnara and Mauro Cettolo, the

senior researchers at the ITC-irst. Fabio has a great effect on my speech

recognition education. Somehow, he knew the answer to all of my questions

and has such a clear way of explaining them to me. Mauro Cettolo has

spend a lot of time for reading and correcting the thesis. He also helps me

by providing the results on the speech translation by which my works can

be shown to have a value.

I am indebted to Nicola, Vanessa, Roldano, Stemmer and my colleges at

ITC-irst for their supports me on all sort of things.

I could not enjoyed graduate student life as much as I did without my

Vietnamese friends at Trento, especially Hoc, my roommate.

Finally I would like to make my parent and my wife a present of my work.

Contents

1 Introduction 1

2 Speech Translation 7

2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . 10

2.1.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Language Model . . . . . . . . . . . . . . . . . . . 14

2.1.4 Search . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Multiple-Pass Search . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 N-Best Algorithms . . . . . . . . . . . . . . . . . . 23

2.2.2 Word Graph Algorithm . . . . . . . . . . . . . . . . 28

2.2.3 Consensus Decoding . . . . . . . . . . . . . . . . . 33

2.3 Statistical MT . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Log-linear Model . . . . . . . . . . . . . . . . . . . 34

2.3.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.3 Speech Translation . . . . . . . . . . . . . . . . . . 37

2.3.4 Evaluation Criterion . . . . . . . . . . . . . . . . . 39

3 Word Graph Generation 41

3.1 Word Graph Definitions . . . . . . . . . . . . . . . . . . . 41

3.1.1 Word Graph Accessors . . . . . . . . . . . . . . . . 42

3.1.2 Word graph properties . . . . . . . . . . . . . . . . 43

i

3.1.3 Topological ordering . . . . . . . . . . . . . . . . . 44

3.2 Word Graph Generation . . . . . . . . . . . . . . . . . . . 44

3.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.2 Best predecessor . . . . . . . . . . . . . . . . . . . 47

3.2.3 Language Model State . . . . . . . . . . . . . . . . 48

3.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.5 Implementation Details . . . . . . . . . . . . . . . . 50

3.2.6 Bigram and Trigram-Based Word Graph . . . . . . 52

3.2.7 Dead Paths Removal . . . . . . . . . . . . . . . . . 53

3.3 WG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Word Graph Size . . . . . . . . . . . . . . . . . . . 54

3.3.2 Graph Word Error Rate . . . . . . . . . . . . . . . 56

3.4 Removing Empty-Edges . . . . . . . . . . . . . . . . . . . 58

3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 58


3.5 FW-BW Pruning . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.1 Edge Posterior Probability . . . . . . . . . . . . . . 64

3.5.2 Forward-Backward Based Algorithm . . . . . . . . 65


3.5.4 Forward-Backward Based Pruning . . . . . . . . . . 66

3.6 Node Compression . . . . . . . . . . . . . . . . . . . . . . 68

3.7 Word-Graph Expansion . . . . . . . . . . . . . . . . . . . . 69

3.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 69

3.7.2 Conventional Algorithm . . . . . . . . . . . . . . . 71

3.7.3 Compaction Algorithm . . . . . . . . . . . . . . . . 73

4 Word Graph Decoding 77

4.1 1Best WG Decoding . . . . . . . . . . . . . . . . . . . . . 77

4.2 N-Best Decoding . . . . . . . . . . . . . . . . . . . . . . . 78

ii

4.2.1 The Stack-Based N-Best Word Graph Decoding . . 80

4.2.2 Exact N-Best Decoding . . . . . . . . . . . . . . . . 82

4.3 Consensus Decoding . . . . . . . . . . . . . . . . . . . . . 88

4.3.1 Word Posterior Probability . . . . . . . . . . . . . . 89

4.3.2 Confusion Network Construction . . . . . . . . . . 90

4.3.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.4 Confusion Network . . . . . . . . . . . . . . . . . . 96

4.3.5 Consensus Decoding . . . . . . . . . . . . . . . . . 97

5 Improvements of Speech Recognition 99

5.1 ASR Experiments . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 ASR System . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2.1 Segmentation and Clustering . . . . . . . . . . . . . 100

5.2.2 Acoustic Adaptation . . . . . . . . . . . . . . . . . 101

5.2.3 Speech Transcription . . . . . . . . . . . . . . . . . 101

5.2.4 Training and Testing Data . . . . . . . . . . . . . . 103

5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . 103

5.3.1 Word Graph Decoding . . . . . . . . . . . . . . . . 104

5.3.2 Impact of Beam-Width . . . . . . . . . . . . . . . . 105

5.3.3 Language Model Factor Experiments . . . . . . . . 107

5.3.4 Forward-Backward Based Pruning Experiments . . 111

5.3.5 Node-Compression Experiments . . . . . . . . . . . 120

5.3.6 Word Graph Expansion Experiments . . . . . . . . 121

5.4 N-Best Experiments . . . . . . . . . . . . . . . . . . . . . 123

6 Speech Translation Experiments 129

6.1 ITC-irst Machine Translation System . . . . . . . . . . . . 129

6.2 N-Best and Word Graph . . . . . . . . . . . . . . . . . . . 130

6.2.1 N-Best based Speech Translation . . . . . . . . . . 133

6.2.2 Word Graph-based Speech Translation . . . . . . . 137

iii

6.3 ITC-irst Works . . . . . . . . . . . . . . . . . . . . . . . . 138

6.3.1 System Parameter Tuning . . . . . . . . . . . . . . 139

6.3.2 New BTEC Test and Development Sets . . . . . . . 143

6.3.3 The First Stage Results (for ASR) . . . . . . . . . 144

6.3.4 The Second Stage Results ( for Pure Text MT) . . 145

6.3.5 The Third Stage Results (for the Speech Translation) 145

7 Conclusions and Future Works 147

7.1 Efficient WG Generation . . . . . . . . . . . . . . . . . . . 147

7.2 Efficient WG Decoding . . . . . . . . . . . . . . . . . . . . 149

7.3 Results on Parameter Tuning . . . . . . . . . . . . . . . . 150

7.4 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 150

Bibliography 153

iv

List of Tables

3.1 A list of hypotheses output by the decoder at time frame

t = 6, Nt = 7. . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 A list of hypotheses output by the decoder at time frame

t = 3, Nt = 4. . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1 The Illustration of the stack-based N−best decoding. . . . 83

5.1 Training and Testing Data for BTEC and IBNC. . . . . . 103

5.2 BTEC: Word error rates with different rescoring methods. 104

5.3 IBNC: Word error rates with different rescoring methods. 104

5.4 BTEC: Costs of the decoder with different thresholds values.106

5.5 BTEC: Trigram-based graph word error rate. . . . . . . . 113

5.6 BTEC: Bigram-based graph word error rate. . . . . . . . 113

5.7 IBNC: Trigram-based graph word error rate. . . . . . . . 113

5.8 IBNC: Bigram-based graph word error rate. . . . . . . . . 114

5.9 BTEC: Trigram-based confusion network word error rate. 116

5.10 BTEC: Bigram-based confusion network word error rate. . 116

5.11 IBNC: Trigram-based confusion network word error rate. 116

5.12 IBNC: Bigram-based confusion network word error rate. . 119

5.13 BTEC: Node compression experiments. . . . . . . . . . . 121

5.14 IBNC: Forward-backward pruning and node compression. 121

5.15 BTEC: Bigram-based word graph expansion experiments. 122

5.16 IBNC: The N−best experiments. . . . . . . . . . . . . . . 124

v

6.1 Results reported in [Shen, 04] comparing minimum error

training with discriminative re-ranking (BLEU%). . . . . . 135

6.2 Experiments of the splitting algorithm on BTEC data . . . 136

6.3 The new BTEC test and development sets . . . . . . . . . 143

6.4 WER of the speech recognition on the 3006−test set when

the LM weight is 9.25. . . . . . . . . . . . . . . . . . . . . 145

6.5 The pure text translation results of the second stage (BLEU

score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.6 The speech translation results of the third stage (BLEU score).146

7.1 Word graph results of [Ortmanns, 1997] on the NAB’94 task.148

7.2 Word graph decoding results (WER) of [Mangu, 1999] on

the DARPA Hub-4 task . . . . . . . . . . . . . . . . . . 149

vi

List of Figures

1.1 The ITC-irst Speech Translation System. . . . . . . . . . . 3

2.1 Source-Channel model of speech generation and recognition 8

2.2 Source-Channel model of speech generation and recognition 10

2.3 An example of Hidden Markov Model . . . . . . . . . . . . 12

2.4 The construction of a word model by concatenating phoneme

models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 The construction of a compound model for recognizing a

sequence of numbers; from [DeMori, 1998], Chapter 5. . . . 17

2.6 The multiple-pass search framework. . . . . . . . . . . . . 22

3.1 Bigram and Trigram Constraint Word Graphs . . . . . . . 52

3.2 The counting number of paths algorithm in a word graph . 56

3.3 A word graph with @BG edges. . . . . . . . . . . . . . . . 61

3.4 A word graph with @BG edges removed. . . . . . . . . . . 62

3.5 A word graph with words placed on nodes. . . . . . . . . . 63

3.6 Link posterior computation . . . . . . . . . . . . . . . . . 65

3.7 Illustration of the conventional word graph expansion algo-

rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.8 Illustration of the compact word graph expansion where ex-

plicit trigram probability only exists for (w 1, w4, w5) . . . . 74

4.1 An word graph example for the N−best stack-based decoding. 83

4.2 A word graph example for the exact N-Best decoding . . . 87

vii

4.3 The exact N-Best decoding - Step 1: The initial FwSco and

N-Best Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.4 The exact N-Best decoding - Step 2: The N-Best tree at

step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.5 Time dependent word posteriors . . . . . . . . . . . . . . . 91

4.6 A word graph example. . . . . . . . . . . . . . . . . . . . . 94

4.7 The resulted confusion network from the word graph in Fig 4.6. 95

5.1 Broadcast News Retrieval System . . . . . . . . . . . . . . 100

5.2 BTEC: Time for generating word graphs vs. different thresh-

old values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.3 BTEC: GER vs. different threshold values. . . . . . . . . 109

5.4 BTEC: Number of paths in word graphs vs. different thresh-

old values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.5 BTEC: WER vs different language factors. . . . . . . . . 112

5.6 BTEC: Confusion network and its word graph representation.115

5.7 IBNC: Confusion network size vs. word graph size. . . . . 117

5.8 IBNC: Consensus decoding word error rate vs. the beam-

width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.9 BTEC: N− different best sentences and N−best sentences

vs. WER. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.10 BTEC: N− different best sentences and N−best sentences

vs. time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.1 The architecture of the ITC-irst SMT system . . . . . . . 131

6.2 Training of the ITC-irst SMT system . . . . . . . . . . . . 132

6.3 The estimation of parameters for speech recognition (the

first stage). . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.4 The estimation of parameters for machine translation (the

second stage). . . . . . . . . . . . . . . . . . . . . . . . . . 140

viii

6.5 The whole system parameter tuning (the first stage). . . . 141

6.6 WER vs. LM weight on the development set. . . . . . . . 144

ix

Chapter 1

Introduction

From human prehistory to the new media of the future, speech communi-

cation has been and will be the dominant mode of human social bonding

and information exchange. The spoken word is now extended, through

technological mediation such as telephony, movies, radio, television, and

the Internet. Moreover, the demands for overcoming the barrier of differ-

ent languages are increased day by day. This trend reflects the primacy of

spoken communication in human psychology. A spoken language system

needs to have speech recognition, speech synthesis and speech translation

capabilities. For all the three components, significant challenges exist,

including robustness, flexibility, ease of integration, and engineering effi-

ciency. The goal of building commercially viable spoken language systems

has long attracted the attention of scientists and engineers all over the

world. The purpose of the work is also to move a small step toward this

goal. Concretely, we deal with some specific problems in the boundary

between speech recognition and speech translation, aiming to improve the

performance of speech translation.

In comparison with written language, speech and especially spontaneous

speech poses additional difficulties for the task of automatic translation.

Typically, these difficulties are caused by errors of the recognition compo-

1

Vu Hai Quan

nent, which is carried out before the translation process. As a result, the

sentence to be translated is not necessarily well-formed from a syntactic

point-of-view. Even without recognition errors, speech translation has to

cope with a lack of conventional syntactic structures because structures of

spontaneous speech differ from those of written language. Recently, the

statistical approach for machine translation showed the potential to tackle

these problems for the following reasons. First, the statistical approach is

able to avoid hard decision at any level of the translation process. Second,

for any source sentence, a translated sentence in the target language is

guaranteed to be generated. In most cases, this will be hopefully a syntac-

tically perfect sentence in the target language; but even if this is not the

case, in most cases, the translated sentence will convey the meaning of the

spoken sentence [Och, 2000].

Currently, statistical speech translation systems show typically a cas-

caded structure: speech recognition followed by machine translation. This

structure, while explicit, lacks some joint optimality in performance since

the speech recognition module and translation module are running rather

independently. Moreover, the translation module of a speech translation

system usually takes a single best recognition hypothesis transcribed in

text and performs standard text-based translation. Lots of supplemen-

tary information available from speech recognition such as N−best list,

word graphs, confusion networks and likelihoods of acoustic and language

model are not well utilized in the translation process. This kind of in-

formation can be effective for improving translation quality if employed

properly [Zhang, 2004]. The main objective of this work is just to exploits

these sources of information, providing a new interface for speech transla-

tion. Specifically, results of the speech recognition process are represented

by means of word graphs, N−best lists and confusion networks which are

subsequentially used as the input of the machine translation process.

2

CHAPTER 1. INTRODUCTION

Figure 1.1: The ITC-irst Speech Translation System.

textMT

WGASR

speech signal

conf.

solution

translationdecoding

speechhypotheses

translationhypotheses

best

speechdecoding

Rescoring

network

N−best

N−best

1−best

WG

: extractor (from WG)

MTnetworkconfusion

Figs 1.1 illustrates the speech translation system currently developed at

ITC-irst [Bertoldi, 2004], which can be virtually divided into two parts. In

the left-hand side, beginning from the speech signal of the utterance, the

automatic speech recognition (ASR) produces a word graph that contains

alternative recognition hypotheses. By using the generated word graph,

we can either extract the best hypothesis or the N−best list and pass it

to the text machine translation module (text MT). Moreover, a confusion

network can also be built from the word graph with a dominant property

that it has a more compact representation and a lower word error rate than

the former. A special machine translation algorithm (confusion network

MT) has been recently developed at ITC-irst for dealing with this kind

of input, which gains promising results. Similarly, in the right-hand side,

output of machine translation is again a word graph, a compact represent-

ing of the translation hypotheses in the target language. Clearly, if we are

just interested in the translation result, the best translation hypothesis can

be extracted directly from the word graph. Additionally, the possibility of

having word graphs and N−best list outputs from machine translation al-

low us to adjust parameters of the translation models or rescore translation

hypotheses with deeper and more extensive knowledge sources or optimize

the model parameters.

3

Vu Hai Quan

The thesis is organized as follows. In Chapter 2, we review the state

of the art in speech translation in terms of speech recognition and speech

translation. Specifically, a short introduction about the speech recognition

components including acoustic model, language model and beam search

are given first. Then, Multi-pass searches and methods for generating word

graphs and N−best lists are reviewed. Finally, the last section covers the

statistical machine translation in terms of translation models, beam search

and evaluation criteria.

Chapter 3 is about word graph generation for speech recognition, word

graph evaluation, word graph pruning and word graph expansion. The

key section in Chapter 3 is the word graph generation in which an efficient

algorithm for constructing general m−gram word graphs is fully described.

Moreover, three different word graph pruning techniques, namely beam-

search, forward-backward pruning and node compression are sequentially

presented. Finally, Chapter 3 is ended up with the general m−gram word

graph expansion in which two algorithms, namely the conventional word

graph expansion and the compact word graph expansion, are introduced.

Given the generated word graph, Chapter 4 presents word graph decod-

ing algorithms which consist of the 1−best decoding, the N−best decod-

ing and the consensus decoding. Note that the N−best decoding works

directly on word graphs output both from speech recognition and machine

translation.

In Chapter 5, experiments of word graphs for speech recognition using

two datasets, namely the Italian Broadcast News Corpus and the Basic

Traveling Expression Corpus, are described in details. Using two datasets

for doing experiments implies that our algorithms work for both large vo-

cabulary and spontaneous speech tasks.

Results of using word graphs, N−best list and confusion network as

inputs for machine translation are given in Chapter 6. Moreover, in Chap-

4

CHAPTER 1. INTRODUCTION

ter 6 a new method for system parameter tuning in which the N−best lists

are used for optimizing model parameters is presented. This new method

gives significant improvements to the translation performance compared to

the baseline results.

Finally, we end up with Chapter 7 by analyzing our results in conjunc-

tion with results from other research groups and highlighting some future

works.

5

Vu Hai Quan

6

Chapter 2

Speech Translation

In this chapter some of the theoretical foundations of speech recognition

and machine translation which are based on the statistical approach are

discussed. First, the overviews of speech recognition and machine transla-

tion are given. Then specific algorithms employed for the development of

speech recognition and machine translation systems are described.

2.1 Speech Recognition

According to [Jelinek, 1998], at a very basic form, a speech recognizer is

a device that automatically transcribes speech into text. The recognized

words can be the final results, as for applications such as commands and

control, data entry, and document preparation. They can also serve as

the input to further linguistic processing in order to achieve speech under-

standing as spoken information retrieval, speech translation etc.

As illustrated in Fig 2.1, the person’s mind decides the source word

sequence W which is delivered through his/her text generator. The source

is passed through a noisy communication channel that consists of speaker’s

vocal apparatus to produce the speech waveform and the speech signal

processing component of the speech recognizer. Finally, the speech decoder

aims to decode the representation X of the acoustic signal into a word

7

2.1. SPEECH RECOGNITION Vu Hai Quan

Figure 2.1: Source-Channel model of speech generation and recognition

word sequence Noisy

Channel

speech signal

Channel Decoding

word sequence

W

feature vectors

X W

sequence W which is hopefully close to the original word sequence W.

In fact, the recognizer is usually based on some finite vocabulary that

restricts the words that can be output. To discuss the problem of speech

recognition, we need its mathematical formulation.

Let X denote the acoustic evidence (data) on the basis of which the

recognizer will take its decision about which words were spoken. Without

loss of generality we may assume that X is a sequence of symbols taken

from some alphabet X :

X = x1, x2, ..., xT ; xi ∈ X (2.1)

The symbol xi could be thought of as having been generated in time,

as indicated by the index i.

Let:

W = w1, w2, ..., wN , wi ∈ W (2.2)

denote a string of N words each belonging to a fixed and known vocab-

ulary W

If P (W|X) denotes the probabilities that the word sequence W were

spoken, given that the evidence X was observed, then the recognizer should

decide in favor of a word string W satisfying:

W = arg maxW

P (W|X) (2.3)

That is, the recognizer will pick up the most likely word string given

the observed acoustic evidence.

8

CHAPTER 2. SPEECH TRANSLATION

The well-known Bayes’ formula of probability theory allows to rewrite

the right-hand side probability of Eq. 2.3 as

P (W|X) =P (W) · P (X|W)

P (X)(2.4)

where P (W) is the probability that the word sequence W will be ut-

tered, P (X|W) is the probability that when the speaker says W the acous-

tic evidence X will be observed, and P (X) is the average probability that

X will be observed, that is:

P (X) =∑

W ′

P (W′) · P (X|W′) (2.5)

Since the maximization in Eq. 2.3 is carried out with the variable X

fixed, it follows from Eq. 2.3 and Eq. 2.4 that the recognizer’s aim is to

find the word sequence:

W = arg maxW

P (W) · P (X|W) (2.6)

An added complication is the fact that the two factors used in the Eq. 2.6

have very different dynamic spaces. If they are just multiplied as indicated

in Eq. 2.6, the decision for a word sequence would be dominated by the

acoustic scores and the language models would have hardly influence. To

balance the probabilities, it is customary to use an exponential language

model factor for the language model, denoted by γ. Thus Eq. 2.6 can be

written as follows:

W = arg maxW

P (W)γ · P (X|W) (2.7)

Fig 2.2 shows the basic components of a typical speech recognition

system. Acoustic Models include the representation of knowledge about

acoustics, phonetics, environmental variability, gender and dialect differ-

ences among speakers etc. Language Models refer to a system’s knowledge

9


Figure 2.2: Source-Channel model of speech generation and recognition

Speech Signal

Feature Extraction

Search Algorithm

Acoustic data

Text Corpus

Acoustic Model Language Model

Recognized Text

about constituents of words, word co-occurrences and word sequences. In

the following section we will describe in detail the components of a speech

recognizer.

2.1.1 Feature Extraction

The aim of the feature extraction module is to parameterize the speech

waveform into a sequence of feature vectors which contain the relevant in-

formation about the utterance sounds. For any speech recognition system,

acoustic features should have the following properties:

• good discrimination in order to distinguish between similar speech

sounds;

• allowing the building of statistical models without the need for an

excessive amount of training data;

• having statistical properties which are invariant across speakers and

over a wide range of speaking environments.

Of course, there is no single feature set that possesses all the above

properties. Features used in speech recognition systems are largely derived

10


from their utility in speech analysis, speech coding and psycho-acoustics.

In statistical automatic speech recognition, the speech waveform is usu-

ally sampled at a rate between 6.6 kHz and 20 kHz and processed to pro-

duce a new representation as a sequence of vectors containing values which

are generally called parameters. The vectors X typically comprise between

10 and 20 parameters, and are usually computed every 10 or 20 msec.

Parameter values are used in the estimation of the probability that the

portion of waveform under analysis is a particular acoustic phenomenon.

Currently, implementations of feature extraction typically include:

• Short-Time Spectral Features: Most speech recognition systems use

either discrete Fourier transform (DTF) or linear predictive coding

(LPC) spectral analysis methods based on fixed size frames of win-

dowed speech data, and extract spectral features, including LPC-

derived features, such as reflection coefficients, log area ratios, line

spectral frequencies, composite sinusoidal model parameters, autocor-

relations, etc. The short-time spectral feature set for each frame is

extended to include dynamic information (e.g the first and the second

order derivatives) of the features. The most popular representation

includes cepstral features along with its first and second time deriva-

tives.

• Frequency-Warped Spectral Features: Sometimes non-uniform fre-

quency scales are used in spectral analysis to provide the so-called

Mel-frequency or bark-scale spectral feature set. The motivation is

to mimic the human auditory system which processes the spectral

information on a non-uniform frequency scale.

The details of this topic can be found in [De Mori, 1998], Chapters 2,3,4

and [Lee, 1995].

11


2.1.2 Acoustic Model

As shown in Eq. 2.6, the recognizer has to be able to determine the value

P (X|W) of the probability that when the speaker uttered the word se-

quence W the acoustic processor produced the data X. Thus to compute

P (X|W) we need a statistical model of the speaker’s interaction with the

acoustic process. The usual acoustic model employed in speech recognizers,

the hidden Markov model (HMM), will be briefly discussed in the following.

An HMM is a composition of two stochastic processes, a hidden Markov

chain, which accounts for temporal variability, and an observable process,

which accounts for spectral variability. This combination has proved to be

powerful enough to cope with the most important sources of speech am-

biguity, and flexible enough to allow the realization of recognition systems

with dictionaries of tens of thousands of words.

Figure 2.3: An example of Hidden Markov Model

a 33a a44

24a13a

11 22b b 33b b44

34a23a12

121 a

b11a 22

23 34

2413

b b

b b

2 43

Fig 2.3 illustrates an example of an HMM model which has 4 states.

Formally, we can summarized the definition of a HMM as follows:

Let x ∈ X be a variable representing observations and s i, sj ∈ S be

variables representing model states; the model can be represented by the

following parameters:

A ≡ {aij|si, sj ∈ S} transition probabilities (2.8)

B ≡ {bij(·)|si, sj ∈ S} output probabilities (2.9)

12


π ≡ {πi} initial probabilities (2.10)

where:

aij ≡ p(st = sj|st−1 = si) (2.11)

bij(x) ≡ p(xt = x|st−1 = si, st = sj) (2.12)

πi ≡ p(s0 = si) (2.13)

The details of HMM and its use in speech recognition can be found in

[De Mori, 1998], Chapter 5.

For a large-vocabulary system, typically, there is a set of basic recog-

nition units that are smaller than whole words, which are named subword

units. Examples of these so-called subword units are phonemes, demisylla-

bles, and syllables. The word models are then obtained by concatenating

the subword models according to the phonetic transcription of the words

in a pronunciation lexicon or dictionary. In most systems, these subword

units are modeled by HMMs. Fig 2.4 illustrates the construction of a word

model by linking phoneme models.

Usually the choice of HMM topologies, as well as the type of probability

distribution, is decided by the developer. The values of the distribution

parameters, as well as transition probabilities, are estimated by a training

algorithm, which processes a set of labeled examples, the training set, for

computing an optimal set of values for HMM parameters. Optimality is

defined by means of an objective function depending both on the HMM

parameters and on the observations contained in the training set. Once the

objective function has been chosen, model training becomes a constrained

maximization problem. Training algorithms can differ in the optimality

criterion and/or in the method of performing the optimization. In general,

parameters of statistical models are estimated by iterative learning algo-

rithms (e.g. the EM algorithm) in which the likelihood of a set of training

13


Figure 2.4: The construction of a word model by concatenating phoneme models.

Transcription

one w ah ntwo t uwthree th r iy

zero z ih r ow

Lexical Rules

w ah n HMM Linking"one"

w

"one"

ah n

aa

ae

zh

Phoneme Models

data is guaranteed to increase at each step. Details of these algorithms are

given in [De Mori, 1998], Chapter 6.

2.1.3 Language Model

Eq. 2.6 further requires that we have to compute for every word sequence

W the probability P (W), that the speaker wishes to utter W. P (W)

is interpreted as the language model. The language model functionally

captures syntax, semantics, and pragmatics of a language and provides the

prior probability P (W) for a word sequence W. The probability P (W)

can be expressed by:

P (W) =N∏

i=1

P (wi|w1, ..., wi−1) (2.14)

14


=N∏

i=1

P (wi|hi) (2.15)

'N∏

i=1

P (wi|wi−1i−n+1) (2.16)

where hi = w1, ..., wi−1 is the history or context of word w i. The probabili-

ties P (wi|hi) may be difficult to estimate as the sequence of words h i grows.

In order to estimate these probabilities, it is usually assumed that a word

sequence follows an (n− 1)-th order Markov process, as in Eq. 2.16. The

corresponding language models are called n−gram language models. To-

day, bigram (n = 2) and trigram (n = 3) language models are usually used

in most ASR systems. In the following we will briefly introduce trigram

language models and methods for estimating its probabilities.

From Eq. 2.16 and setting n = 3, we have

P (W) =N∏

i=1

P (wi|wi−2, wi−1) (2.17)

The basis trigram probabilities can be estimated by:

P (w3|w1, w2) = f(w3|w1, w2) 'C(w1, w2, w3)

C(w1, w2)(2.18)

where f(|) denotes the relative frequency function and C(h i) denotes the

number of times the event hi appeared in the training corpus. Unfor-

tunately, even in large real training texts, most of possible trigrams do

not occur. Hence, for each W including any of such unobserved event, the

model would assign P (W) = 0. In this case, the recognizer would be forced

to commit a large number of errors. It is therefore necessary to smooth

the trigram frequencies. There are two main methods for smoothing the

trigram probabilities.

The first one, namely the linear smoothing, is done by interpolating

trigram, bigram and unigram relative frequencies as in Eq. 2.19 where the

non-negative weights satisfy the constraint λ1 + λ2 + λ3 = 1. The different

15


ways of choosing weights λi leads to the different interpolating schemes.

P (w3|w1, w2) = λ3f(w3|w1, w2) + λ2f(w3|w2) + λ1f(w3) (2.19)

The second method, namely the backing-off, is quite prevalent in state of

the art speech recognizers. It is defined through the formula:

P (w3|w1, w2) =

f(w3|w1, w2) if C(w1, w2, w3) ≥ K

αQT (w3|w1, w2) if 1 ≤ C(w1, w2, w3) < K

β(w1, w2)P (w3|w2) otherwise

(2.20)

where α, β are chosen so that the probability P (w3|w1, w2) is properly

normalized. Furthermore, QT (w3|w1, w2) is a Good-Turing type function

and P (w3|w2) is a bigram probability estimate having the same form as

P (w3|w1, w2):

P (w3|w2) =

f(w3|w2) if C(w2, w3) ≥ L

αQT (w3|w2) if 1 ≤ C(w1, w2, w3) < L

β(w2)f(w3) otherwise

(2.21)

E.q 2.20 then constitutes a recursion. The thresholds K and L are chosen

empirically. The argument for backing-off is that if there is enough evidence

for it, then the relative frequency is a very good estimate of the probability.

If not, then one should back-off and rely on bigrams; if there is not enough

evidence even for bigram, unigrams are needed. For details on language

models, see [De Mori, 1998], Chapter 7.

2.1.4 Search

The decision on which words have been spoken must be made by means of

an optimization procedure that combines information from several sources:

the language model, the acoustic-phonetic models of phonemes, and the

pronunciation lexicon.

16


Figure 2.5: The construction of a compound model for recognizing a sequence of numbers;

from [DeMori, 1998], Chapter 5.

zero

one

two

one

two

zero

Word Network end

end

start

start

startHMM

Compound end

Transcription

HMM Linking

zeroz ih r ow

one

two

Phoneme Models

w ah n

t uw

Lexical Rules

ih owrz

w ah n

t uw

Phoneme Network

For hypothesizing a word sequence w1, ..., wN , a compound HMM is

searched which includes the three knowledge sources mentioned. Fig 2.5

shows an example of such compound model for recognizing digits. As we

can see, the construction of a compound model for recognition includes

three steps. First, the language is represented by a network with word-

labeled arcs. The connections between words are made by means of empty

transitions, which could be assigned a probability. Given the network

representing the language, each word-labeled arc is replaced by a sequence,

or possibly a network, of phoneme-labeled arcs, according to a set of lexical

rules. Finally, each phoneme-labeled arc is replaced by an instance of the

corresponding HMM, obtaining the final compound model as illustrated in

Fig 2.5.

With this kind of model, the knowledge for acoustic, lexical and lin-

17


guistic knowledge can be naturally represented in a graph structure, that

can become huge when dealing with large vocabularies. The search for

the most probable word sequence translates into the search of an optimal

path over a derived structure, called “trellis”, which corresponds to the

unfolding of this graph along the time axis.

There are two main search strategies used in speech recognition. The

first one is named Viterbi search and the second one is called stack decoding

(A? search). The first one is normally carried out via time-synchronous

fashion by the so-called Viterbi algorithm, that relies on the principles of

dynamic programming. To avoid exhaustive exploration of a possibly huge

search space, the beam-search technique is used, which consists in pruning

the less promising hypotheses, based on a local estimation.

Stack decoding represents the best attempt to use A? search instead

of time-synchronous search for continuous speech recognition. It is a

tree search algorithm, which takes a slightly different viewpoint than the

time-synchronous Viterbi Search. Time-synchronous search is basically a

breadth-first search, so it is crucial to control the number of all possible

model states. On the other hand, stack decoding treats the search as a

stack for finding a path in a tree whose branches correspond to words in

the vocabulary V . The search tree has a constant branching factor of |V |,

if we allow every word to be followed by every word. In the following sub-

section, we present the one-pass Viterbi based search which is currently

used in most speech recognition system. The detail of the Viterbi-based

search algorithm and also stack decoding can be found in [Jelinek, 1998],

[De Mori, 1998], Chapters 8,9.

One-Pass Viterbi Search

Let X = x1, ..., xT be the sequence of acoustic vectors and S = s1, ..., sT

be the sequence of states through the compound HMM. We can define the

18


joint probability as follows:

P (X,S|W) =T∏

t=1

p(xt, st|st−1,W) (2.22)

=T∏

t=1

p(st|st−1,W)p(xt|st) (2.23)

where p(xt, st|st−1,W) denotes the transition and emission probabilities for

the compound HMM of W.

Denoting the language model probability by P (W), the Bayes decision

rule as in Eq. 2.6 results in the following optimization problem:

W = arg maxW{P (W)

∑

sT1

P (xT1 , sT

1 |W)} (2.24)

≈ arg maxW{P (W) max

sT1

P (xT1 , sT

1 |W)} (2.25)

Here we have made use of the so-called maximum approximation, which

is also referred to as Viterbi approximation. Instead of summing over all

paths, we consider only the most probable path. With maximum approx-

imation, the search space can be described as a huge network through

which the best time alignment path has to be found. The search has to be

performed at two levels: at the state level (sT1 ) and at the word level (W).

Viterbi Beam Search

The survey of the Viterbi search is given here. The details can be found

in [Ortmanns, 1997], [De Mori, 1998], chapter 9,10.

To explain the time-synchronous Viterbi search in a formal way, we

define some quantities:

Q(t, s; w) = score of the best path up to time t that ends in state s of

word w and

B(t, s; w) = start time of the best path up to time t that ends in state

s of word w

19


There are two types of dynamic programming transition rules, namely

intra-word and inter-word transition. In the word interior, we have the

recurrent equation:

Q(t, s; w) = maxs′

{p(xt, s|s′; w).Q(t− 1, s′; w)} (2.26)

B(t, s; w) = B(t− 1, smax(t, s; w); w) (2.27)

where smax(t, s; w) is the optimum predecessor state for the hypothesis

(t, s; w). When encountering a potential word boundary, we must perform

the recombination over the predecessor words. For doing this, let us define:

H(w; t) = maxv{p(w|v).Q(t, Sv; v)} (2.28)

where p(w|v) is the conditional language model probability of word bigram

(v, w). The symbol Sv denotes the terminal state of word v. The algorithm

can be summarized as follows, [Ortmanns, 1997]: It is also important

to notice that the above one-pass Viterbi search uses linear lexicon for

bigram language model. A very similar search algorithm which uses tree

lexicon was also presented in [Ortmanns, 1997]. With this approach, the

pronunciation lexicon is organized in the form of a prefix tree, in which each

arc represents a phoneme model. The lexical tree organization provides a

more compact space for the searching algorithm.

Beam Search

Since, for a fixed time frame, all (word, state)-hypotheses cover the same

portion of the input, their scores can be directly compared. This enables

the system to avoid an exhaustive search, and to perform a data-driven

search instead, i.e., to focus the search on those hypotheses that are most

likely to result in the best state sequence. In details, every 10-ms frame,

the score of the best hypothesis is determined, then all hypotheses whose

scores are below of this optimal by more than a fixed factor are pruned, i.e.

20


One-Pass Viterbi Search

1 for t = 1 to T

2 do

Acoustic Level: Process (word, state)-hypothesis

3 Initialization

Q(t, s = 0, w) = H(w, t− 1)

B(t− 1, s = 0; w) = t− 1

4 Time alignment: Q(t, s; w) using E.q 2.26

5 Propagate back-pointers B(t, s; w), using E.q 2.27

6 Pruning unlikely hypotheses

7 Purge bookkeeping lists.

Word Pair Level: Process word end hypotheses

8 for each word w

9 do

10 H(w, t) = maxv{p(w|v).Q(t, Sw; w)}

11 ν0(w; t) = arg maxv{p(w|v).Q(t, Sw; w)}

12 Store the best predecessor ν0 = ν0(w; t)

13 Store the best boundary τ0 = B(t, Sν0; ν0)

they are removed from further consideration. This beam search strategy

will be considered in full detail in chapter 5, for the experiments of word

graph generation.

2.2 Multiple-Pass Search

A speech recognition system should take into account all available knowl-

edge sources when recognizing an utterance. Besides the speech signal,

and the models of the recognition units, also knowledge about syntax, se-

mantic, and other properties of the natural language might be used when

searching for the most likely word sequence. One way to include these

knowledge sources in the search process is to use them simultaneously to

21

2.2. MULTIPLE-PASS SEARCH Vu Hai Quan

Figure 2.6: The multiple-pass search framework.

speech input

Word graph / N -Best Generation

First pass

Ordered Sentence Lists Word graphs Rescore Top

Choice

Knowledge Source 1

Knowledge Source 2

Second pass

Knowledge Source 1 Knowledge Source 2

statistical grammar, syntax,

bigram / trigram LM

semantics, long-span LM

other knowledge source, etc.,

constrain a single search. Since many of the natural language knowledge

sources contain ”long-distance” effects, the search can become quite com-

plex. Furthermore, the common left-to-right search strategy requires that

also all knowledge sources are formulated in a predictive, left-to-right man-

ner, which restricts the type of knowledge that can be used.

One way to solve these problems is to apply sources not simultane-

ously but sequentially so that the search for the most likely hypothesis is

constrained progressively. Thus the advantages provided by a knowledge

source can be traded-off against the costs of applying it. First, the most

powerful and cheapest knowledge sources are applied to generate a list of

the top N hypotheses or word graphs (word lattices). Then, these hypothe-

ses are evaluated by means of the other more expensive knowledge source

so that the list of hypotheses can be reordered according to a more refined

likelihood score. The two-pass search paradigm is illustrated in Fig 2.6.

In this section we will review N -best and word graph algorithms and their

applications in speech recognition.

Besides the two-pass search paradigm, there are also other uses for these

algorithms.

22


• The N -best list and word graph generated during recognition stage can

be used to investigate new knowledge sources. Since it is not necessary

to resume the whole recognition process, experimental evaluation of

the additional information provided by a new knowledge source can

be done much easier.

• Methods for discriminative training of HMMs usually require a list of

errors and near-misses so that the correct answer can be made more

likely and the errors and near misses can be made less likely. Such list

can be provided either by N -best list or word graph algorithms.

• In speech recognition system, some parameters like the weights of

different knowledge sources can not be easily estimated. For the fine-

tuning of these parameters, normally repeated recognition test are

required. Using the N -best lists or word graphs, generated during

a single run of the recognizer, the parameters optimization can then

be done much easier. This topic will be exploited in full detail in

the last chapter, when we apply the N -best lists and word graphs for

parameter tuning in both speech recognition and machine translation.

2.2.1 N-Best Algorithms

Different algorithms for finding the N -best sentence hypotheses have been

proposed in [Schwartz, 1991]. Some of these algorithms are exact while

others use different approximations to reduce computational requirements.

Basically, the Viterbi algorithm typically used in an HMM-based speech

recognizer only finds the best word sequence (corresponding to the state

sequence with the highest likelihood score). To obtain not only the first

best hypothesis but the list of the best N -hypotheses, several modifications

of the Viterbi algorithm are necessary. Different algorithms that are able

to find the N -best list of hypotheses are presented in the following.

23


The Exact N-Best Algorithm

The exact N -best algorithm was proposed in [Schwartz, 1990]. This algo-

rithm is similar to the time-synchronous Viterbi algorithm, but instead of

likelihood scores for state sequences, likelihood scores for word sequences

are computed. To be able to find the N -best hypotheses, it is necessary

to keep separate records for theories (paths) with different word sequence

histories. When two or more paths come to the same state at the same

time and also have the same history (word sequence), their probabilities

are added. When all paths for a state have been calculated, only a spec-

ified number of these local theories are maintained. Their probabilities

have to be within a threshold of the probability of the most likely theory

at that state. Therefore, any word sequence hypothesis that reaches the

end of the utterance has an accumulated score. This score is the condi-

tional probability of the observed speech signal given the word sequence

hypothesis. Thus, the list of N - best hypotheses is generated. To reduce

the exponentially growing number of possible word sequences, pruning is

used. It can be shown that this algorithm will find all hypotheses that are

within a search beam specified by the pruning threshold. To reduce the

computational requirements connected with the exact N -best algorithm,

it is possible to combine the N -best algorithm with the forward-backward

search algorithm which will be described in detail in the next chapter. Ba-

sically, the forward-backward search algorithm takes place in two stages.

In the first stage, a fast time-synchronous search of the utterance in for-

ward direction is performed. In the second stage, a more expensive search

is performed, processing the utterance in reverse direction and using infor-

mation gathered by the forward search. The information from the forward

search is used to avoid expanding the backward tree toward non-promising

hypotheses thus saving computational costs.

24


The Tree-Trellis Algorithm

The tree-trellis algorithm for finding the N -best hypotheses was proposed

in [Soong, 1991]. This algorithm combines a frame-synchronous forward

trellis search with a frame-asynchronous backward tree search. In the for-

ward trellis search, a modified Viterbi algorithm is used. In a normal

Viterbi algorithm, only the back-pointer arrays necessary to trace-back the

best hypothesis would be stored. The modified algorithm used here also

stores rank-ordered predecessor lists for each grammar node time frame.

For a given grammar node and time frame, such a list has an entry for

each predecessor of that grammar node. This entry contains the likelihood

score of the best partial path coming via that predecessor to the grammar

node. Before being stored, the entries in a predecessor list are rank-ordered

according to their likelihood scores. When the modified Viterbi search has

reached the end of the utterance, the best hypothesis can easily be obtained

by tracing-back. In the backward search, an A? tree search algorithm is

used to find the N -best hypotheses. This tree search starts from the end of

the utterance at the final grammar node. In each step, the backward par-

tial path is extended toward the beginning of the utterance by time-reverse

Viterbi search for single best word extension. The best single word exten-

sion is found using the rank-ordered predecessor lists generated during the

forward search. When the backward partial path reaches the beginning of

the utterance, the best hypothesis is found (it is identical to the hypoth-

esis already found in the forward Viterbi search). By continuing the A ?,

the N -best hypotheses can be found sequentially. A good summary of the

theory behind the tree-trellis algorithm can be found in [Soong, 1991]. In

[Soong, 1991], a modified version of the tree-trellis algorithm is presented

where a simple grammar is used in the forward Viterbi search and a more

complex grammar is used in the backward tree search. This concept is sim-

25


ilar to the forward-backward algorithm mentioned in the previous section

and can result in reduced computational requirements.

Lattice N-Best Algorithms

The exact N -best algorithm and tree-trellis algorithm require both a sig-

nificant computational overhead with respect to that of a normal Viterbi

search. Due to this reason, faster N -best algorithms adopting some approx-

imations have been suggested. These algorithms do not guarantee that the

exact list of N -best hypotheses will be found. It can either happen that the

likelihood score of an entry is underestimated or that an entry is missing

totally. But the approximations might still be sufficient for many applica-

tions. Two N -best algorithms using different approximations are described

now.

The lattice N -best algorithm was proposed in [Schwartz, 1991]. It is

based on a standard time synchronous forward Viterbi search but differs

in the back-pointer information stored during the search progress. At each

grammar node for each time frame, not only the best scoring word but all

words that arrive at that node are stored in a trace-back list, together with

their scores and the time when the word is started. Instead of storing all

the arriving words, it is also possible to store only the best N local words

(or word theories). Like in Viterbi search, only the score of the best word

is passed on as a base for further scoring together with a pointer to the

stored trace-back list. At the end of the utterance, a simple tree search

is used to step through the stored trace-back lists and obtain the N -best

complete sentence hypotheses sequentially. This tree search requires nearly

no computation and can be performed very fast. A serious disadvantage

of the lattice algorithm is that it underestimates or completely misses high

scoring hypotheses due to the fact that all (except the best) hypotheses

are derived from segmentations found for other higher scoring hypotheses.

26


This is caused by the assumption that is inherent to the lattice algorithm.

This problem can be mostly overcome by the word-dependent algorithm.

The Word-Dependent Algorithm

Like the lattice N -best algorithm, the word-dependent algorithm was also

proposed in [Schwartz, 1991]. It is a compromise between the exact N -best

algorithm and the lattice algorithm. Here it is assumed that the starting

time of a word does depend on the previous word but does not on any

word before that. Therefore, theories are distinguished if they differ in the

previous word. With this algorithm, within a word, the likelihood scores

for each of the different local theories are preserved. At the end of each

word, the likelihood score for each previous word is recorded along with

the name of the previous word. Then a single theory with the name of the

word that just ended is used to proceed. At the end of the utterance, a tree

search similar to the one used in the lattice algorithm, is used to obtain

the list of the N most likely hypotheses. To reduce the computational

requirements, the number Nlocal of theories kept locally should be limited.

Typically, values for Nlocal range from 3 to 6.

Word Graph Based N-Best Search

The last, but very different approach for finding the N -best hypotheses was

proposed by [Tran, 1996] which is based on the word graph. The detail of

word graphs will be given in the next section. Here we will shortly review

the proposed algorithm.

The principle of the approach is based on the following considerations:

When several paths lead to the same node in the word graph, according to

the Viterbi criterion, only the best scored path is expanded. The remaining

paths are not considered for further expansions. Assuming that the first

best sentence hypothesis was found by the Viterbi decoding through a given

27


word graph, the second best path is the path which competed with the best

one but was recombined at some node of the best path. Thus in order to

find the second best sentence hypothesis, we have to consider all possible

partial paths in the word graph which reach some node of the best path

and might share the remaining section with the best path. By applying

this procedure repeatedly, N -best sentence hypotheses can be successively

extracted from a given word graph.

More in detail, the best path can be determined simply by comparing

the accumulative scores of all possible paths leading to the final node of

the word graph. In order to ensure that this best word sequence is not

taken into account while searching for the second best path, the complete

path is copied into a so-called N -best tree. Using this structure, a back-

ward cumulative score for each word copy is computed and stored at the

corresponding tree node. This allows for fast and efficient computation

of the complete path scores required to determine the next best sentence

hypothesis. The second best sentence hypothesis can be found by taking

the path with the best score among the candidate paths which might share

a remaining section of the best path. The partial path of this sentence hy-

pothesis is then copied into the N -best tree. Assuming the N -best paths

have been found, the (N + 1)-th best path can be determined by exam-

ining all existing nodes in the N -best tree, because it can share the last

part of some path among the top N paths. Thus it is important to point

out that this algorithm performs a full search within the word graph and

delivers exact results as defined by the word graph structure. The detail

implementation of this algorithm will be presented in Chapter 4.

2.2.2 Word Graph Algorithm

The main obstacle of N-best method is that the number of sentences needed

to include the correct hypothesis grows exponentially with the length of the

28


utterance. In order to find a way to compactly represent the alternative

hypotheses, word graphs, [Ney, 1993, 1994; Ortmanns, 1997; Neukirchen,

2001] and word lattices [Odell, 1995] were introduced. In a lattice or word

lattice, words are represented by weighted arcs where weights correspond to

acoustic scores (usually log probabilities) while in a word graph, words are

represented by nodes and node weights correspond to acoustic likelihood

scores. However in principle they are similar. So from now on, we only

refer to them as word graphs (WGs for short). The main idea of word

graph is to represent word alternatives in regions of the speech signal where

the ambiguity in the acoustic recognition is high. The advantage is that

the acoustic recognition is decoupled from the application of the language

model, in particular a long-span language model, which can be applied

in a subsequent post-processing step. The number of word alternatives

should be adapted to the level of ambiguity in the acoustic recognition.

In the following, we present two algorithms: the first one is for lattice

generation which was proposed by [Odell, 1995]; the second one is for word

graph generation, proposed by [Ney, 1993; Ney, 1994; Ortmanns, 1997;

Neukirchen, 2001].

Lattice generation

According to [Odell 1995], only a few simple modifications are needed to

modify the one-pass time-synchronous decoding to generate a lattice of

hypotheses. Rather than discarding all but the most likely partial path

when these merge at word ends, it is possible to retain information about

them to allow lattice traceback.

When only the most likely hypothesis is required, the language model

likelihoods are added to the word ending state and the traceback infor-

mation updated. The states from equivalent partial paths are recombined

and only the most likely survives to propagate into the following network.

29


When a lattice of hypotheses is needed, the less likely word ending states

are not discarded but are linked to the most likely one and the combined

structure propagates into the following network. The calculation in the

remainder of the network is only performed on the most likely word ending

state but when traceback occurs at the end of the sentence, all of the word

ending states are used to construct a lattice of hypotheses. At the end of

each utterance, traceback proceeds separately through each of the linked

word ending states and a lattice of alternative hypotheses is constructed.

Each node in the generated lattice has an associated time. Each arc

has an associated word identity, the acoustic likelihood and the language

model likelihood, and forms a link between two nodes which define the

start and end times of the word hypothesis.

Word Graph Generation

Two algorithms [Ney, 1993], [Ney, 1994] have been proposed to build word

graphs without a backward phase. Both are based on time-synchronous

beam search decoding using a tree-organization lexicon. The search space is

built dynamically, instantiating a new tree each time a leaf, corresponding

to a word end, is reached. Basically, each word ending at a given time

corresponds to a word hypothesis, which is kept and then used to build a

word graph. The word graph is defined as a direct acyclic graph, where

each arc is labeled with a word and a score, and each node corresponds to

a time frame.

In [Ney, 1993], a first algorithm is proposed, in which a word hypotheses

generator (WHG) finds, with a beam search, word hypotheses consisting of

a word identifier, an acoustic score, start and end times. Hypotheses can

be arranged in a large word graph that must be pruned and optimized by

the subsequent module, called word graph optimizer (WGO). A reduction

in the number of arcs can be obtained by the WHG if words are allowed

30


only every other or every third frame. A possible WGO works as follows:

first the word graph is unfolded from the start node into a tree structure;

then, for each set of partial paths with identical start time, end time and

word sequence, only the most probable path is kept; finally, edges having

a score below a certain threshold, with respect to the best complete score,

are removed. Other actions that may be performed by a WGO concern

the merging of nodes with identical time, or merging of subgraphs having

identical time boundaries and word labels.

A different word graph builder is presented in [Ney 1994], [Ortmanns,

1997] which is integrated into a one-pass search algorithm. It exploits the

word-pair approximation, assuming that the boundary between two words

is independent of previous history. Using this assumption in conjunction

with an m-gram language model, it is possible to recombine, at time t, all

the word sequences having identical m− 1 words. The algorithm is based

on a dynamic programming recursion that finds, at a time t, the optimal

word boundary between words wi and wj at ending time t, say τ(t; wi, wj),

which, under the word-pair assumption, is independent of previous words.

The cumulative score for word wj from the optimal word boundary to t,

say h(wj, τ(t; wi, wj), t) is also computed. The algorithm is summarized as

follows. The details can be found in [Ortmanns, 1997]. Let:

• τ(t; v, w) = B(t; Sw; w) : word boundary between the predecessor word

v and the current word w

• h(w; τ, t) = P (xtτ+1|w) : probability that word w produces the acoustic

vector xτ+1...xt.

• H(w; t) = maxv{p(w|v).Q(t, Sw; w)} : (joint) probability of generating

the acoustic vector x1...xt and a word sequence wn1 with ending word

w and ending time t.

The definitions of B(·), H(·), Q(·) are given in Section 2.1.4. For each

31


predecessor word v, along with word boundary τ = τ(t; v, w) the word

scores are recovered using the equation:

h(w; τ, t) =Q(t, Sw; w)

H(v, τ)(2.29)

Given the above defined quantities, WGs can be built by the following

bigram and one-pass Viterbi search based algorithm: [Neukirchen, 2001]

Word Graph Constructing Algorithm

1 for t = 1 to T

2 do for each triple (v, w; t) ends at t

3 do keep track of:

4 -the (unique) word boundary τ(t; v, w)

5 -the acoustic score h(w; τ, t)

At the utterance end, the word graph is constructed

by tracing back through the bookkeeping list.

takes into account the language model during the generation of the word

graph. One thing that makes this algorithm different from the previous

one is the node creating process. When creating a new node, it takes into

account the m−gram constraint (or the m− histories of the current word).

The advantages of this method are:

• better modeling of word boundaries due to an extended word m-tuple

boundary optimization.

• improved pruning of word graphs by exploiting the more detailed

knowledge sources.

• smaller costs for graph expansion since higher order context con-

straints are encoded in the word graph structure.

We will discuss in detail the implementation of this algorithm in the next

chapter.

32


2.2.3 Consensus Decoding

With word graph decoding, we have to face with a computation problem.

In general, the number of paths through a word graph is exponential in

the number of links. These paths correspond to different segmentation

hypotheses (i.e. word sequence plus boundary times) of utterances. A

method to overcome these problems has been suggested in [Mangu, 1999].

The word graph is first transformed into a special form in which the calcu-

lation of the expected word error rate becomes trivial. This special form

of a word graph is called a confusion network. A confusion network itself

is a word graph. Each edge is labeled with a word and a probability. The

most important feature of these word graphs is that they are linear, in the

sense that every path from the start to the end node has to pass through

all nodes. A consequence of this (combined with the acyclicity) is that all

paths between two nodes have the same length. Thus the confusion net-

work naturally defines an alignment for all pairs of paths (called a multiple

alignment by Mangu). This alignment is used as the basis for the word

error rate calculation.

An approach to actually construct a confusion network from a word

graph is also presented in [Mangu, 1999]. The task is treated as a clustering

problem, where the edges from the original word graph have to be clustered

into groups according to criteria such the overlapping in time between

edges and the phonetic similarity between words. The confusion network

construction algorithm will be described in Chapter 4.

2.3 Statistical Machine Translation

The goal is the translation of a text given in some source language into

a target language. Precisely, we are given a source string f = f 1...fj...fm,

which is to be translated into a target string e = e 1...ei...el. The key idea

33

2.3. STATISTICAL MT Vu Hai Quan

here is, among all possible target strings, to choose the string with the

highest probability given by the Bayes’ decision rule [Brown, 1993]:

e? = arg maxe{P (e|f)} (2.30)

= arg maxe{P (e) · P (f|e)} (2.31)

Here, P (e) is the language model of the target language, P (f|e) is the string

translation model while arg max denotes the search problem, i.e., the gen-

eration of the output sentence in the target language. If we look back to

the Eq. 2.6, we will find a strong similarity between the problem of statis-

tical speech recognition and machine translation. The details of parameter

estimation and models for decoding can be found in [Brown, 1993]. In

the following we consider an alternative way to see at statistical machine

translation, namely, the log-linear models [Och, 2002].

2.3.1 Log-linear Model

As originally proposed by [Brown, 1993], the most likely translation of a

foreign source sentence f into English is obtained by searching for the

sentence with the highest posterior probability:

e? = arg maxe

Pr(e|f) (2.32)

Usually, the hidden variable a is introduced:

e? = arg maxe

∑

aP (e, a|f) (2.33)

which represents an alignment from source to target positions.

The framework of maximum entropy [Berger, 1996] provides a mean

to directly estimate the posterior probability P (e, a|f). It is determined

34


through suitable real valued feature functions h i(e, f, a), i = 1 . . .M , and

takes the parametric form:

pλ(e, a|f) =exp{

∑i λihi(e, f, a)}

∑e,a exp{

∑i λihi(e, f, a)}

(2.34)

The maximum entropy solution corresponds to values λ i that maximize

the log-likelihood over a training sample T :

λ? = arg maxλ

∑

(e,f,a)∈T

log pλ(e, a|f) (2.35)

Unfortunately, a closed-form solution of (2.35) does not exist. An iterative

procedure converging to the solution was proposed by [Darroch, 1972]; an

improved version is given in [Pietra, 1997].

If the following feature functions are chosen [Och, 2002]:

h1(e, f, a) = log P (e)

h2(e, f, a) = log P (f, a|e)

exploiting eq. (2.34), eq. (2.33) can be rewritten as:

e? = arg maxe

P (e)λ1

∑

aPr(f, a|e)λ2 (2.36)

where λi’s represent scaling factors of models.

In eq. (2.36), English strings e are ranked on the basis of the weighted

product of the language model probability P (e), usually computed through

an m-gram language model, and the marginal of the translation probability

P (f, a|e).

In [Brown, 1993, Och, 2003] six translation models (Model 1 to 6) of in-

creasing complexity are introduced. These alignment models are usually es-

timated through the Expectation Maximization algorithm [Dempster, 1977],

35


or approximations of it, by exploiting a suitable parallel corpus of trans-

lation pairs. For computational reasons, the optimal translation of f is

computed with the approximated search criterion:

e? ≈ arg maxe

P (e)λ1 maxa

P (f, a|e)λ2 (2.37)

In summary, given the string e = e1, . . . , el, a string f and an alignment

a are generated as follows: (i) a non-negative integer φ i, called fertility, is

generated for each word ei and for the null word e0; (ii) for each ei, a list τi,

called tablet, of φi source words and a list πi, called permutation, of φi source

positions are generated; (iii) finally, if the generated permutations cover

all the available source positions exactly once then the process succeeds,

otherwise it fails. Fertilities fix the number of source words to be aligned to

each target word, and the total length of the foreign string. Moreover, as

permutations of Model 4 are constrained to assign positions in ascending

order, it can be shown that if the process succeeds in generating a triple

(φl0, τ

l0, π

l0), then there is exactly one corresponding pair (f, a), and vice-

versa. This property justifies the following decomposition of Model 4:

• fertility model: p(φ|e)

• lexicon model: p(τ | φ, e)

• distortion model: p(π | φ, τ, e).

The detail of the decompositions is specified in [Brown, 1993].

2.3.2 Decoding

Given the source sentence f = fm1 , the optimal translation e? is searched

through the approximate criterion (2.36). According to the dynamic pro-

gramming paradigm, the optimal solution can be computed through a re-

cursive formula which expands previously computed partial theories, and

36


recombines the new expanded theories. A theory can be described by its

state, which only includes information needed for its expansion; two par-

tial theories sharing the same state are identical (indistinguishable) for the

sake of expansion, i.e. they should be recombined.

Pruning

In order to reduce the huge number of theories to generate, some methods

are used, which affect the optimality of the search algorithm:

• Comparison with the best theory: theories are pruned, whose score is

worse than the so-far best found complete solution, as theory expan-

sion always decreases the score.

• Beam search: at each expansion less promising theories are also pruned.

In particular, two types of pruning define the beam:

– threshold pruning: partial theories whose score is smaller than the

current optimum score are eliminated;

– histogram pruning: hypotheses not among the top N best scoring

ones are pruned.

These criteria are applied, first to all theories with a fixed coverage

set, then to all theories of fixed output length.

• Reordering constraint: a smaller number of theories is generated by

applying the so-called IBM constraint on each additionally covered

source position, i.e. by selecting only one of the first 4 empty positions,

from left to right.

2.3.3 Speech Translation

In the previous subsection, we have introduced the framework for statis-

tical machine translation which assumed that the input is written text.

37


Considering the problem of speech input rather than text input for trans-

lation, we can distinguish three levels, namely the acoustic vector x, the

source sentence f and the target sentence e.

x→ f→ e (2.38)

From a strict point of view, the source sentence, f are not of direct inter-

est for the speech translation task. Mathematically, this is captured by

introducing the source sentence hypotheses as hidden variables into Bayes’

decision rule as in Eq. 2.6.

arg maxe

P (e|x) = arg maxe{P (e) · P (x|e)} (2.39)

= arg maxe{P (e) ·

∑

fP (f,x|e)} (2.40)

= arg maxe{P (e) ·

∑

fP (f|e) · P (x|f, e)} (2.41)

= arg maxe{P (e) ·

∑

fP (f|e) · P (x|f)} (2.42)

∼= arg maxe{P (e) ·max

f{P (f|e) · P (x|f)}} (2.43)

In the above equation, we have made only a reasonable assumption,

P (x|f, e) = P (x|f) (2.44)

that is the target string e does not help to predict the acoustic vector

if the source string f is given. In addition, similar to Eq. 2.25, here we

have also used the maximum approximation as in Eq. 2.43. The key issue

here is the question of how the requirement of having both a well-formed

source sentence f and a well-formed target sentence e at the same time is

satisfied. From the statistical point of view, this question is captured by

finding suitable models for the joint probability:

P (f, e) = P (e) · P (f|e) (2.45)

38


2.3.4 Evaluation Criterion

A generally accepted criterion for evaluating automatic machine translation

does not yet exist. Therefore, the usual way is to use a large variety of

different criteria. The good system is the one that produces the good

quality of translation over almost or all of these criteria. The following

criteria are widely used in recent literature.

• SER (sentence error rate): The SER is computed as the number of

times that the generated sentence corresponds exactly to one of the

reference translations.

• WER (word error rate): The WER is computed as the minimum

number of substitution, insertion and deletion operations that have

to be performed to convert the generated sentence into the target

sentence.

• PER (position-independent WER): A shortcoming of the WER is the

fact that it requires a perfect word order. The word order of an

acceptable sentence can be different from that of the target sentence,

so that the WER measure alone could be misleading. To overcome

this problem, the PER criterion is introduced as additional measure,

that compares the words in the two sentences ignoring the word order.

• mWER (multi-reference word error rate): For each test sentence, there

is not only a single reference translation, as for WER, but a set of ref-

erence translations. For each translation hypothesis, the edit distance

to the most similar sentence is calculated.

• mPER (multi-reference position independent WER): This criterion

ignores the word order by treating a sentence as a bag-of-words and

computing the minimum number edit distance needed to transform

the hypothesis into the closest of the given reference translations.

39


• SSER (subjective sentence error rate): For a more detailed analysis,

subjective judgments by test persons are necessary. Each translated

sentence is judged by a human examiner according to some error scales

(i.e. from 1.0 to 5.0).

• IER (information item error rate): The test sentences are segmented

into information items. For each of them, if the intended information

is conveyed and there are no syntactic errors, the sentence is counted

as correct.

• BLEU score: This criterion computes the geometric mean of the pre-

cision of n−gram of various lengths between a hypothesis and a set

of reference translations multiplied by a factor that penalizes short

sentences.

• NIST score: This criterion computes a weighted precision of n−grams

between a hypothesis and a set of reference translations multiplied by

a factor that penalizes short sentences.

Both NIST and BLEU are accuracy measures, and thus larger values reflect

better translation quality.

40

Chapter 3

Word Graph Generation

In this chapter we first introduce some terms and common operators related

to word graphs. These terms and operators are then used to describe

algorithms that we have implemented, including word graph generation,

word graph evaluation, word graph pruning and word graph expansion.

3.1 Word Graph Definitions

A word graph is a directed, acyclic, weighted, labeled graph with distinct

start and end nodes. It is a quadruple G = (V, E, I, F ) with the following

components:

• A nonempty set of vertices or nodes V = {v1, ..., vN}.

• A nonempty set of weighted, labeled, directed edges E = {e1, ..., eM}.

Each edge e is defined by e = (vi, vj, τ, t, w, s) where:

– vi, vj ∈ V are the starting node and the ending node of e.

– w ∈ L is a word label where L denotes a non-empty set of words,

L = {w1, ..., w|L|}.

– τ, t ∈ R are the starting time and the ending time of the word

hypothesis w.

41

3.1. WORD GRAPH DEFINITIONS Vu Hai Quan

– s = (ac, lm · Flm) ∈ R × R is the weight of e, consisting of the

acoustic score, ac, and the language model score, lm, multiplied

with the language model factor F lm, (see Eq. 2.7). We will include

p as its posterior score. This score will be discussed in the next

chapters.

• I ∈ V : the start node. This node represents the start of the utterance.

Every node in the word graph is reachable from the start node I. By

default, I = v1.

• F ∈ V : the end node. This node represents the end of the utterance.

By the default, F = vN .

3.1.1 Word Graph Accessors

Accessors for Word Graph nodes

Given a node v we define the following accessors:

• fan-In(v): return the number of edges incoming to node v.

• fan-Out(v): return the number of edges outgoing from node v.

• Expand(v): return the set of edges outgoing from node v.

• Inpand(v) return the set of edges incoming to node v.

Accessors for Word Graph Edges

Given an edge e ∈ E we define the following accessors:

• s(e): return the weight s of e.

• ac(e): return the acoustic score ac of e.

• lm(e): return the language model score lm of e.

42

CHAPTER 3. WORD GRAPH GENERATION

• w(e): return the word label w of e.

• b(e): return the starting node vi of e.

• f(e): return the ending node vj of e.

• τ(e): return the starting time τ of e.

• t(e): return the ending time t of e.

3.1.2 Word graph properties

Reachability

We define the relation of reachability for nodes (−→) as follows:

∀vi, vj ∈ V : vi −→ vj ⇐⇒ ∃e ∈ E : b(e) = vi ∧ f(e) = vj (3.1)

The transition hull of the reachability relation −→ is denoted by∗−→.

We define:

Reachable(vi, vj) return true if vi∗−→ vj

Paths

A path through a graph is a sequence of edges p = e1, e2, .., eK such that:

f(ei) = b(ei+1), i = 1, .., K − 1

We call the number K of edges in the sequence p the length of the path.

To express that two nodes are connected by a sequence p of edges, we use:

vip−→ vj. We also define the score of a given path p = e1, e2, .., eK as

follows:

path-Score(p) =K∑

i=1

s(ei) (3.2)

43

3.2. WORD GRAPH GENERATION Vu Hai Quan

3.1.3 Topological ordering

A topological ordering is a function γ : V −→ {1, .., |V |} on the nodes of a

word graph having the property that:

∀e ∈ E : γ(b(e)) < γ(f(e))

The function topo-Sort(G) sorts all nodes ∈ G in topological order when

there are no loops in that graph.

3.2 Word Graph Generation

In the previous chapter, it has been shown how acoustic, lexical and lan-

guage knowledge can be compiled into a stochastic finite-state integrated

network to be used for generating hypotheses. In general, the search for the

most probable word sequence W, given the sequence of acoustic vectors,

X, is translated into the search of an optimal path over a derived structure,

called trellis, which corresponds to the unfolding of the integrated network

along the time axis. The search, or decoding, is normally carried out in

time-synchronous fashion by the so-called Viterbi algorithm. To avoid ex-

haustive search of a huge search space, the beam search technique is used.

The main idea of beam search is to prune the less promising hypotheses,

on the basic of a local estimation. At each time frame t, the decoder pro-

duces a best hypothesis - the one among all hypotheses ending at time t,

which has highest local accumulated score. All hypotheses whose scores

are below the best one with respect to a given threshold are pruned. The

pruned hypotheses are no longer considered in the next time frames. The

best local hypotheses are kept by using a back-pointer list. At the ending

time T of the utterance, the word sequence W will be found by tracing

back through the list.

The main idea of word graphs is to represent word alternatives in regions

44


of the speech signal where the ambiguity in the acoustic recognition is

high. The advantage is that the acoustic recognition is decoupled from

the application of the language model, in particular a long-span language

model can be applied in a subsequent post-processing step. The number

of word alternatives should be adapted to the level of ambiguity in the

acoustic recognition. In the following we present the m−gram word graph

generation, proposed by [Neukirchen,2001]. Similarly to the survey about

the word graph algorithms presented in the previous chapter, we need to

define some quantities:

• W = (w1, ..., wN), be an N word sequence.

• Sm(W) = (wN−m+2, ..., wN), the m−gram LM-state of a word se-

quence W is given by the (m− 1) most recent words.

• h(w; τ, t) = P (xtτ+1|w) : probability that word w produces the acoustic

vector xτ+1...xt.

• G(W) = P (W) ·P (xt1|W) (joint) probability of generating the acous-

tic vector x1...xt and a word sequence (w1, ..., wN) with ending time

t.

• H(Sm; t) :(joint) probability of generating the acoustic vector x 1...xt

and a word sequence with the final (m − 1) words given by Sm at

ending time t

The optimization in the search is conditioned by word history Sm. When

the search reaches the leaf for word w in the lexicon tree this results in an

extension of the preceding partial hypothesis W to the new hypothesis

W = (W, w). Within the lexicon tree, the recombination is applied to

all preceding hypotheses being in an identical m−gram state Sm(W) but

having entered the tree at different starting times τ . Using the m−gram

45


P (w|Sm(W)), the following optimization generates the probability for the

new partial hypothesis W ending at time t :

G(W; t) = P (w|Sm(W)). maxτ{H(Sm(W); τ).h(w; τ, t)} (3.3)

The optimal tree starting time (boundary between W and w), τ opt is given

implicitly by the optimization in Eq. 3.3 and the boundary only depends

on the ending time t and the m most recent words in W.

In order to complete the word graph constructing algorithm, two more

quantities have to be computed, the word boundary τ(Sm(W), w) and the

word score, h(w; τ, t). Taking directly from Eq. 3.3 we have:

τopt = arg maxτ{H(Sm(W); τ).h(w; τ, t)} (3.4)

score(w) =G(W)

Hmax(Sw(W); τ)(3.5)

where Hmax(Sw(W); τ) = maxτ{H(Sm(W); τ).h(w; τ, t)} and Score(w)

is the accumulated score of word w spanning from time τ + 1 to time t.

In order to describe the word graph generating algorithm in detail, in

the following subsections some concepts on the hypotheses output by the

decoder, the best-predecessor hypothesis and the language model state will

be introduced.

3.2.1 Hypothesis

At each time frame t, after pruning, the decoder outputs a list of N t hy-

potheses {%it}i=1..Nt

, which represents a set of hypothesized words ending

at time t. With a given hypothesis %, we also define some accessors as

follows:

• b(%) return the beginning state of the decoding network.

• e(%) return the ending state of the decoding network.

46


Table 3.1: A list of hypotheses output by the decoder at time frame t = 6, Nt = 7.

e(%) w(%) S(%) b(%) τ(%)

172 c -718.115601 3 3

213 a -717.378235 3 3

226 ho -718.540283 3 3

284 e‘ -709.984314 3 3

287 e -717.644409 3 3

2 @BG -605.717712 0 0

3 @BG -629.394775 0 0

• τ(%) return the starting time.

• t(%) return the ending time.

• w(%) return the word label.

• S(%) return the total likelihood score up to this hypothesis.

Table 3.1 shows an example of such hypothesis list containing N t = 7

hypotheses, output by the decoder at time frame t = 6.

3.2.2 Best predecessor

For a given hypothesis %it we define its best predecessor hypothesis by:

%∗τ = arg max%

jτ

{S(%jτ ) : e(%j

τ ) = b(%it), t(%

jτ) = τ(%i

t)} ≡ f(%it) (3.6)

The following function returns the best predecessor of a given hypothesis

%it, which was implemented as in Eq. 3.6:

find-Best-Predecessor(%it)

Let us consider the hypothesis in the first row of Table 3.1. We recognize

that it starts from the time frame τ(%) = 3 and the beginning state b(%) =

3. Back to the list of hypotheses ending at time frame t = 3 we find the

hypothesis %? which has the ending state equal to 3 and get the results as

shown in Table 3.2.

47


Table 3.2: A list of hypotheses output by the decoder at time frame t = 3, Nt = 4.

e(%) w(%) S(%) b(%) τ(%)

172 c -414.504028 0 0

284 e‘ -410.906921 0 0

2 @BG -302.052124 0 0

3 @BG -325.729187 0 0

3.2.3 Language Model State

Best hypothesis sequence

From the definition of the best predecessor of a given hypothesis, we define

a best hypothesis sequence ηm ending at the hypothesis %m as a sequence

of m hypotheses:

ηm = %1, .., %m such that %i = f(%i+1), i = 1...m− 1 (3.7)

The corresponding word sequence W (ηm), taken from the best hypothesis

sequence, is defined as follows:

W (ηm) = w1, .., wm such that wi = w(%i), i = 1...m (3.8)

Language model state

With a given word sequence W = w1, ..., wm we define its language model

states, L-state[W ] and R-state[W ] as follows:

• L-state[W ] = w1, ..., wm−1

• R-state[W ] = w2, ..., wm

We call L-state[W ] the left language model state of a given word sequence,

W . In fact, it is the context of that word sequence, discussed in the previous

section. The right language model state R-state[W ] is defined in a similar

way.

Initially, at time frame t = 0, there are no hypotheses output, so:

48


• L-state(null) = R-state(null) =< s >.

Here, < s > is a special symbol used to denote the beginning of the utter-

ance.

At time frame t = 1, the word sequence of each output hypothesis has

length 1, then:

• L-state[W (η1)] =< s >

• R-state[W (η1)] =< s >, w.

Finally, the left and the right language model state L-state[W (η m)],

R-state[W (ηm)] are built incrementally during word graph construction.

3.2.4 Algorithm

At each time frame t, keep Nt hypotheses, {%ti}, i = 1..N within the beam

search.

• For each hypothesis %it, i = 1..N take its starting time τ(% i

t) and start-

ing state b(%it) and its word wm = w(%i

t)

– find the best predecessor %∗τ , given by Eq. 3.6

– update the best hypothesis sequence: ηm(%it) = ηm−1(%

∗t ), %

it

– update the L-state[W (ηm)]

– update the R-state[W (ηm)]

– create or find node vj = (t,R-state[W (ηm)])

– create a new edge e whose score is given by %it, %∗τ and p(wm|w1, .., wm−1)

– create or find node vi = (τ,R-state[W (ηm−1)])

– insert edge e between node vi and vj

49


Generate-WG

� create the initial node I

1 create-Node(0,R(null))

2 for each time frame t, t = 1..T

3 do for each %it ∈ Beam(t), i = 1..Nt

4 do

� Find the best predecessor of %it

5 %∗

τ = find-Best-Predecessor(%it)

� Update the LM states

6 L-State(W (ηm(%it))) = R-state(W (ηm(%∗

τ )))

� Find or create the ending node

7 vj = create-Node(t,R-state(W (ηm(%it))))

� Create the edge

8 e = create-Edge(w(%it), score(%i

t))

� Find or create the starting node

9 vi = create-Node(τ,R-state(W (ηm(%∗

τ ))))

� Insert the newly created edge

10 insert-Edge(vi, vj, e)

� create the final node F and its edges

11 create-Node(T + 1,R-state(null))

12 topo-Sort

3.2.5 Implementation Details

At line 3, Beamt denotes a set of hypotheses which are produced by

the decoder at time frame t.

At line 6, the left language model state of the current hypothesis % it is

updated according to its definition. That means that it takes the right

language model state from its best predecessor.

At line 7, the function create-Node(t, context) creates a new node v

50


according to (t, context) and adds it to V . The parameter t is the current

time frame, and context denotes the R-state of the current word hypoth-

esis w. If this node has been already created before, this function return

its index.

At line 8, the function create-Edge creates a new edge e and then

appends information to it. Information of an edge includes:

• current word label

• score, including both the acoustic score, ac, and the language model

score, lm. The procedure to extract ac and lm from hypothesis % it is

as follows:

– Score(%it)

lm = p(w(%it)|L-state(W (ηm(%i

t)))

ac = s(%it) − s(%∗τ ) − lm ∗ Flm, where Flm is the language model

factor, see also Eq. 3.5

At line 10, the function insert-Edge(v i, vj, e) inserts an edge e be-

tween two nodes vi and vj

At line 11, the function create-Node(T + 1, context):

• creates a final node F at time T+1 and context = R-state(null)

• creates a special edge e with w(e) =< s > and s(e) = 0

• for each node v ∈ V , which has ending time t = T , insert-Edge(v, F, e)

Finally, at line 12, the generated word graph is topologically sorted by

calling the procedure topo-Sort.

51


pruning / recombination

v

w

w

u

u

v w

u

u

v w

v

u

v

w u

u w

u v

w

w

v w u

v

w v

w

w

u

u

v w

u

u

v w

v

u

v

w u

u w

u

w u

w

v

w v

v

v

v w

A word graph using bigram constraint A word graph using trigram constraint

dead path

dead path

Figure 3.1: Bigram and Trigram Constraint Word Graphs

3.2.6 Bigram and Trigram-Based Word Graph

In the above described Generate-WG algorithm, if we limit the length of

word sequence in the left-language model state, L[W (ηm)], to be 1 or 2, we

get the so-called bigram-based word graph or trigram-based word graph

respectively. It is worth to noticing that the bigram constraint is more

relaxed than the trigram constraint. This results in larger word graphs

in bigram cases than in trigram cases. Fig 3.1 shows an example for this

property. As we can see, due to the recombination, the dotted-line in the

right graph is no further expanded while this is not true in the left graph.

The dotted-line in this case is referred as the dead path, which should be

removed by the algorithm described in the following subsection.

Since the decoder uses a trigram language model, only with trigram

word graphs we are able to separate the real trigram language model scores,

see Eq. 3.5. In the bigram case, the language model scores are computed

approximately by using the trigram with an unseen word. This causes word

52


error rates (WERs) of bigram-based word graph decoding to be always

higher than WERs for trigram-based word graph decoding. In summary,

we have:

• graph error rates (GERs) of bigram-based word graphs are always

lower than the GERs of trigram-based word graphs,

• decoding WERs with bigram-based word graphs are always higher

than with trigram-based word graphs.

Experimental results given in Chapter 5 will confirm these properties.

3.2.7 Dead Paths Removal

When a partial hypothesis is not further expanded due to recombination

or pruning, it can cause the generation in the word graph of dead paths

that do not reach the final graph node F . After the word graph has been

built we can safely remove all dead paths without affecting GERs and

WERs. Moreover, removing the dead paths also makes the word graph

smaller. The observation here is that: at a given node v, v ∈ V, v 6=

F, v 6= I if there are no outgoing edges from or incoming edges to this node

then all paths leading to v are dead and should be removed. Therefore

the algorithm just simply traverses the word graph in topological order

and removes all dead-nodes. Back to Fig 3.1, the path passing through

edges (v, w) drawn by dotted-line is no longer expanded. In this case,

it can be safely removed. The detailed implementation is given in the

remove-Dead-Paths procedure.

3.3 Word Graph Evaluation Methods

In order to evaluate the quality of word graphs, we usually rely on two

criteria: the size of the generated word graph and the graph word error

53

3.3. WG EVALUATION Vu Hai Quan

dead-Node(vi)

� Node vi is dead if fan-Out(vi) = 0 or fan-In(vi) = 0

1 return (fan-Out(vi) = 0)

remove-Node(vi)

� Delete node vi by removing all edges incoming from and outgoing to vi

1 for each edge e(vk, vi, τ, t, w, s) ∈ Inpand(vi)

2 do remove-Edge(vk, vi, e)

3 for each edge e(vi, vj , τ, t, w, s) ∈ Expand(vi)

4 do remove-Edge(vi, vj , e)

5 delete vi

remove-Dead-Paths(wg)

� Traverse the word graph and remove all dead-node

1 for each node vi, vi 6= I, vi 6= F in topological order

2 do if dead-Node(vi)

3 then remove-Node(vi)

4 for each node vi, vi 6= I, vi 6= F in reversed topological order

5 do if dead-Node(vi)

6 then remove-Node(vi)

rate (GER). This section will cover both criteria.

3.3.1 Word Graph Size

Limiting to our domain (generation of the word graphs for speech recog-

nition and machine translation), there are some different measures about

the quality of the word graph, with respect to its size.

[Aubert,1995] and [Woodland, 1995] used the word graph density, the

node graph density and the boundary graph density as measures. Those

criteria have been widely used in most word graph evaluation systems.

They are informally defined as follows.

54


• The word graph density (WGD) is defined as the total number of word

graph arcs divided by the number of actually spoken words.

• The node graph density (NGD) is defined as the total number of

different words ending at each time frame divided by the number of

actually spoken words.

• The boundary graph density (BGD) denotes the number of word

boundaries, i.e. different start times, respectively, per spoken word.

Finally, in [Amtrup, 1996], the number of paths (nPaths), the number

of derivations and the rank of a path are used to measure the quality

of the word graph, which are suitable for natural language and machine

translation. We describe the algorithm to enumerate the number of paths

in the word graph [Amtrup, 1996] here. The algorithm that computes the

rank of a given path within a word graph is skipped as we actually use the

N−best decoding algorithm, which will be described in Chapter 4.

Number of Paths

Given a word graph as an acyclic, directed graph, defined in Section 3.1,

it is quite easy to compute the number of paths contained in this graph.

Specifically, starting at the initial node I in which we assume that the

number of path at this node is 1, traverse all the graph node v i, vi ∈ V

in topological order. For each incoming edge e(v j, vi) to node vi, add the

values from the previous nodes vj to node vi. The value of the final node

F is the number of paths in the graph. The detail implementation of the

algorithm is given in the num-Paths(wg) procedure.

Fig 3.2 illustrates the num-Paths algorithm. The number in each node

is the number of paths incoming to the node. For example, at the final

node we get the value of 4 and it is actually the number of path in this

55

3.3. WG EVALUATION Vu Hai Quan

num-Paths(wg)

� return the total number of paths in a word graph.

� p[N ] temporary variable.

1 p[I ]=1;

2 for each node vi ∈ V \ {I}

3 do p[vi]=0;

4 for each node vi ∈ V \ {I} in topological order

5 do for each edge e(vi, vj, τ, t, w, s) ∈ expand(vi)

6 do p[vj ]+ = p[vi];

� return the total number of paths in the word graph

7 return p[F ];

1 1 1 1 2 4

a

c

b

d

d

e

e

e

Figure 3.2: The counting number of paths algorithm in a word graph

graph. Moreover, examining at this node we see that it has a total of three

incoming edges where two edges come from nodes that have value 1 and

the remainder comes from a node that has value 2.

3.3.2 Graph Word Error Rate

The graph word error rate (GER) is computed by determining which sen-

tence through the word graph best matches the actually spoken sentence.

The match criterion is defined in terms of word substitutions (SUB), dele-

tions (DEL) and insertions (INS). This measure provides a lower bound of

56


the word error rate for this word graph. The algorithm, for its computa-

tion, is very similar to the one that computes the string edit distance. The

following procedure, namely GER, computes the GER, given the word

graph wg and the reference sentence s with length T .

GER(s)

Initial :

1 t = 0, ∀ node vi ∈ V , νt[vi] =∞; νt[0] = 0

Matching :

2 for t = 1 to T

3 do for each node vi in topological order

4 do � compute the deletion error

5 νt[vi] = min(νt[vi], νt−1[vi] + 1)

6 for each edge e(vi, vj, τ, t, w, s) ∈ expand(vi)

7 do � compute the substitution error

8 νt[vj ] = min(νt[vj ], νt−1[vi] + δ(w(e), s[t]))

� compute the insertion error

9 for each node vi in topological order

10 do for each e(vi, vj, τ, t, w, s) ∈ expand(vi)

11 do νt[vj ] = min(νt[vj ], νt[vi] + 1)

� update the score table for next loop

12 copy(νt−1, νt); ∀vi ∈ V , νt[vi] =∞;

13 return νT [F ]

Let νt[vi] denote the total cost when matching process is at the graph

node vi and the sentence position t. The algorithm is based on dynamic

programming. Specifically, at each position t, considering the word s[t]

in the reference sentence s, the νt[vi] value is computed by choosing the

minimum cost in terms of INS, DEL, SUB when comparing all the words

of the outgoing edges, e(vi, vj) to word s[t]. This value is then compared to

the previous value νt−1[vi] or νt[vi] to choose the minimum one. The value

νT [F ] at the final position T of the reference sentence and at the final node

57

3.4. REMOVING EMPTY-EDGES Vu Hai Quan

F of the word graph is the GER.

The GER is one of the main criteria for evaluating the quality of a word

graph. All the results in Chapter 5 are given in terms of GER.

3.4 Removing Empty-Edges

Empty edges, denoted by the label @BG in the word graph, represent the

background noise hypotheses which were produced by the decoder. They

are usually supposed not to carry any meaning that has to be preserved for

subsequent language processing steps after recognition. Removing empty

edges can make simpler other algorithms which also operate on the word

graph as in our case are the node-merging algorithm and word graph ex-

pansion algorithm, which will be described in Section 3.6 and Section 3.7.

In this section we describe the algorithm for removing all empty-edges from

the word graph.

3.4.1 Algorithm

• For each node vi in topological order

– Get a list of nodes {vj} which can be reached from vi through

empty transitions and the corresponding scores, map[v j]. By

map[vj] we mean the best accumulated scores of the paths from

vi to vj through empty transitions.

– For each node vj

∗ For each edge e(vj, vk, τ, t, w, s), w 6= empty

· update score for edge e : s(e) = s(e) + map[vj];

· insert-Edge(vi, vk, e)

∗ remove-Edge(vi, vj, w)

58



Notation

� return list of nodes and scores reachable from vi through empty transitions.

� The corresponding scores are kept in map

� Here map[vj ] is the accumulated score of @BG links from node vi to vj

get-Empty-Node-Scores(vi, st, map)

1 for each e(vi, vj, τ, t, w, s) ∈ Expand(vi)

2 do if w(e) = empty

3 then if (map[vj ] = 0)

4 then get-Empty-Node-Scores(vj, st + s(e), map)

5 map[vj ] = st + s(e)

6 else tmp = st + s(e)

7 if map[vj ] < tmp

8 then get-Empty-Node-Scores(vj, tmp, map)

9 map[vj ] = tmp

Removing empty edges could result in a bigger word graph size com-

pared to the one that contains empty edges. Fig 3.3 shows a word graph

with @BG edges while the equivalent word graph without @BG edges is

illustrated in Fig 3.4.

A very important property of word graphs without @BG edges is that

all incoming edges to a node vi, ∀vi ∈ V have the same word label because

node vi must have the same left-language model state. In this case, we

can freely put the word label on node v i and suppose that the initial node

I contains the word label <s>, as shown in Fig 3.5. This property is

exploited in the word graph expansion and node compression algorithms,

which will be described in the next sections.

59

3.5. FW-BW PRUNING Vu Hai Quan

remove-Empty-Edge(vi)

1 get-Empty-Nodes-Scores(vi, 0, map)

2 for each vj such that map[vj ] > 0

3 do for each e(vj , vk, τ, t, w, s) ∈ Expand(vj)

4 do if w(e) 6= empty

5 then s(e) = s(e) + map[vj ]

6 Insert-Edge(vi, vk, e)

7 Remove-Edge(vi, vj, w)

remove-All-Empty-Edges

1 for each node vi ∈ V taken in topological order

2 do Remove-Empty-Edge(vi)

3.5 Forward-Backward Pruning

Since the word graph produced by using hypotheses directly output by the

decoder can be very large, it is essential to use pruning methods for gen-

erating compact word graphs. In fact, three different pruning techniques

are considered:

• The forward-pruning. At each time frame, only the most promising

hypotheses are retained in one-pass search. This pruning technique is

usually named “beam search”.

• The forward-backward pruning. It consists of two passes after the gen-

eration of a huge word graph. It employs a beam search with respect

to the forward and backward scores of a specific hypothesis. Strictly

speaking, for every word graph arc representing a word hypothesis

(w, τ, t) with starting time (τ) and ending time t and its score, we

compute the overall score of the best path passing through this spe-

cific arc. Word arcs whose scores are relatively close to the global

score of the best path are kept in the word graph, while the others are

60


0

1

a

2

a 3

@BG

4

d

5

e

6

@BG c c

@BG

c

c

7

@BG

8

f f f

Figure 3.3: A word graph with @BG edges.

61


0

1

a

2

b

6

c

7

d

8

f

5

e

f

e c

f f

f

Figure 3.4: A word graph with @BG edges removed.

62


<s>

a b

c d

f

e

Figure 3.5: A word graph with words placed on nodes.

63


pruned.

• Node compression. This pruning technique is based on the topology

of a word graph. If two nodes vi and vj have the same incoming or

outgoing edges, they can be merged.

The experiments of all pruning techniques mentioned above are given in

Chapter 5. In this section we present the forward-backward based pruning.

The next section is about the node compression algorithm.

Firstly, the edge posterior probability is introduced. This posterior is

not only used in forward-backward pruning but also in confusion network

construction, which will be described in Chapter 4.

3.5.1 Edge Posterior Probability

In a word graph, each edge is labeled with a word, the starting and ending

nodes, and the likelihood score calculated from the acoustic and language

models as showed in the previous section. From these edge likelihood

scores, posterior probabilities p(e|X tτ+1) can be calculated for each edge e.

This is done by an algorithm very similar to the forward-backward used to

train HMMs.

Definition

The edge posterior p(e|X) is defined as the sum of the probabilities of all

paths q passing through the link e, normalized by the probability of the

signal p(X) :

p(e|X) =

∑q,e∈q p(q, X)

p(X)(3.9)

where:

• p(X) is approximated by the sum over all paths through the word

graph.

64


v i v j e

p( e 1i )

p( e 2i )

p( e 3i )

p( e 4i )

) ( i v a ) ( j v b

p( e j1 )

p( e j2 )

p( e j3 )

p( e j4 )

Figure 3.6: Link posterior computation

• p(q, X) is the probability of path q composed from acoustic likelihood

pac(X|q) and the language model likelihood p lm(W ), given in Eq. 3.2

• Forward-Backward based algorithm is used to compute p(e|X).

3.5.2 Forward-Backward Based Algorithm

For each node v ∈ V , two quantities are defined, the forward probability

and the backward probability, as follows:

• α(v) is the sum of likelihoods of all paths from node I to node v

• β(v) is the sum of likelihoods of all paths from node vto node F .

We have:

α(v) =∑

e:f(e)=v

s(e)α(b(e)) (3.10)

β(v) =∑

e:b(e)=v

s(e)β(f(e)) (3.11)

From α(v), β(v) we have posterior for a given edge e ∈ E

p(e|X) =α(b(e))s(e)β(f(e))

p(X)(3.12)

An example of edge posterior computation is illustrated in Fig 3.6.

65



This section describes the implementation of the forward-backward for the

computation of the edge posterior. The details of this algorithm can also

be found in [Wessel, 2002]. As we can see, the Forward-Backward

algorithm consists of two phases.

In the first phase, the forward-pass, all the initial values of α[v i], ∀vi ∈

V are set to log 0, except the value at the initial node, α[I] is set to 0.

Then, the algorithm works by traversing all the node v i, vi ∈ V, vi 6= I in

topological order and compute value α[v i] for each node by using Eq. 3.11

(from line 4 to line 6).

In the second phase, the backward pass, values β[v i] are computed by

using Eq. 3.11 but in reversed topological order (from line 10 to line 12).

After that, posterior probabilities are computed for the edges by exploiting

Eq. 3.9.

Here, there are two things worth to be mentioned. First, logPlus(a, b)

returns the log of the sum of two variables expressed in logarithm. Second,

the value α[F ] and the value β[I] are actually the probability of all paths

in the word graph.

3.5.4 Forward-Backward Based Pruning

One of the applications of edge’s posteriors is for word graph pruning.

In [Sixtus, 1999], the authors proposed a method to gain high quality word

graphs by using the forward-backward based pruning algorithm. In fact,

the forward-backward was used to compute the overall score of the best

path traversing a given arc. Word arcs whose scores are relatively close to

the global score of the best path are kept in the word graph, while the others

are pruned. Implementation details are given in the prune-Posterior(t)

procedure, which input is a threshold t. The algorithm works by traversing

66


Forward-Backward(wg)

� compute the posterior for all edges e ∈ wg

� based on the forward-backward algorithm.

� forward pass

� α[N ] local variable for forward probability α

1 for vi = I to F

2 do α[vi] = −∞;

3 α[I] = 0

4 for each node vi = I + 1 to F in reversed topological order

5 do for each edge e(vj , vi, τ, t, w, s) ∈ Inpand(vi)

6 do α[vi] = logPlus(α[vi], α[vj] + s(e))

� backward pass

� β[N ] local variable for backward probability β

7 for vi = I + 1 to F

8 do β[vi] = −∞;

9 β[F ] = 0

10 for each node vi=F -1 to I in topological order

11 do for each edge e(vi, vj, τ, t, w, s) ∈ Expand(vi)

12 do β[vj] = logPlus(β[vj], β[vi] + s(e))

� edge posterior computation

13 for each node vi = I to F


15 do p(e) = (α[vi] + s(e) + β[vj])− α[F ];

all edges in the word graph and removing edges whose posterior probabil-

ities lower than the given threshold t. Practically, the threshold value t is

chosen such that all edges are removed if α[F ]− p(e) < t where α[F ] and

p(e) are computed in the procedure Forward-Backward. The experi-

mental results of this algorithm are given in Chapter 5, which show that

it dominates over other pruning algorithms.

67

3.6. NODE COMPRESSION Vu Hai Quan

prune-Posterior(t)

� prune the word graph based on edge posteriors



3 do if p(e) < t

4 then remove-Edge(vi, vj, e)

3.6 Node Compression

In this section, a word graph pruning technique is introduced which exploits

the graph topology [Weng, 1999]. Implementation details are also given.

Results on the application of this algorithm are given in Chapter 5.

The idea of this algorithm is to combine identical sub-paths in the word

graph so that the redundant nodes and edges are removed. The key ob-

servation underlying this algorithm is that if two nodes v i and vj have the

same word label w(vi) = w(vj) and meet one of the following criteria:

• Inpand(vi) = Inpand(vj)

• Expand(vi) = Expand(vj)

• (Inpand(vi) ⊂ Inpand(vj) or Inpand(vj) ⊂ Inpand(vi) ) or

(Expand(vi) ⊂ Expand(vj) or Expand(vj) ⊂ Expand(vi) )

then two nodes vi ad vj can be merged without changing the language

of the word graph, where the language of a word graph is defined as the

set of all word strings starting at the initial node and ending at the final

node. Implementation of this algorithm is given in the procedure named

Node-Compression. As we can see, the key operation of the procedure

is at line 7 where two nodes vi and vj are merged. The merging operation

is only performed when both vi and vj have the same set of outgoing edges

68


Node-Compression(wg)

1 finish = false;

2 while (!finish)

3 finish = true;

4 do for each node v, v ∈ V in reverse topological order

5 do for each pair of predecessor nodes (vi, vj) of node v

6 do if Expand(vi) = Expand(vj)

7 then merge-Node(vi, vj)

8 finish = false;

or incoming edges. The loop, from line 2 to line 8, is finished when there

are no couple of nodes satisfying the condition in line 6.

The merge-Node(vi, vj) procedure, which merges two nodes vi and

vj, consists of two phases. In the first phase, all the outgoing edges from

node vj are copied and added into the set of outgoing edges from node v i.

Similarly, in the second phase, all incoming edges to node v j are copied and

added into the set of incoming edges to node v i. In case that the copied

edge from node vj was already in the set of edges incoming to or outgoing

from node vi, then the score of those edges is updated by choosing the best

one as specified at line 3 and line 9 of the merge-Node(v i, vj) procedure.

The experimental results of this algorithm can be found in Chapter 5.

3.7 Word-Graph Expansion

3.7.1 Introduction

Since the cost of generating bigram word graphs is small, it would be

attractive to be able to generate bigram word graphs and then expand

them to m−grams. An m−gram word graph is a word graph whose tran-

sitions (edges) contain m−gram probabilities. Thus with m−gram word

69

3.7. WORD-GRAPH EXPANSION Vu Hai Quan

merge-Node(vi, vj)

� Merge two nodes vi and vj

� First copy incoming edge to vj

1 for each edge e(vk, vj, w, s)

2 do if ∃e′(vk, vi, w, s′)

3 then s′ = Min(s, s′)

4 else enew = create-Edge(w, s)

5 insert-Edge(vk, vi)

6 � Now copy outgoing edges from vj

7 for each edge e(vj, vk, w, s)

8 do if ∃e′(vi, vk, w, s′)

9 then s′ = Min(s, s′)

10 else enew = create-Edge(w, s)

11 insert-Edge(vi, vk)

graph, more context information is encoded in and could be used in the

subsequence processing steps. In this section the method for word graph

expansion by [Weng, 1999] is described. For simplicity, our discussion is

focused on trigram word graphs only, but the algorithms described can be

easily generalized to m−gram models of higher order.

We have to make an assumption before describing the algorithms in

detail. Specifically:

• ∀ei, ej ∈ E if f(ei) = f(ej) then w(ei) = w(ej)

The assumption ensures that all edges ei incoming to node vi will have the

same word label. We can obtain this property if we do the remove-Empty-Edge

after generating the word graph. With this assumption we can put word

label w(ei) on node vi and add the following accessory to nodes:

• w(vi) returns the word label in node vi. Moreover, we simply denote

wi the word label put in node vi.

70


The consequence result of this is that:

• ∀vi ∈ V, wi 6= empty

In the following subsections, two algorithms for word graph expansion are

described, namely, the conventional algorithm and the compaction algo-

rithm. The first one simply traverses the word graph and duplicates new

nodes and edges to ensure that all edges in the word graph contain trigram-

language model scores. The compaction algorithm takes into account the

back-off language property when duplicating new nodes. That means that

the duplications are only performed when there are true trigram language

model scores. Naturally, the second approach keeps the word graph as

small as possible.

3.7.2 Conventional Algorithm

To place trigram probabilities on the graph edges we must create a unique

two-word context for each edge. It is very similar to the definition of

the language model state, L-state(w), as given in Section 3.2.3. For

example in Fig 3.7, case a , a node which contains word label w 4 has

its edges duplicated to guarantee the uniqueness of trigram contexts for

placing p(w5|w1w4) and p(w5|w2w4) on edges e41 and e51 respectively. When

a central node has two predecessors nodes labeled with the same word,

only one additional node and its corresponding outgoing edges need to be

duplicated. This case is illustrates in Fig 3.7, case b. As we can see, the

central node labeled with word w4 has two predecessor nodes containing

the same word label w1 and for this, only a new node with word label

w4 is duplicated. The conventional trigram expansion algorithm, given in

the procedure named conventional-Expansion, works by duplicating

nodes and edges in the manner indicated.

As we can see, at a given node vj, two loops are performed at line

71


Case a . Bigram word graph before expansion Word graph after expansion

Case b Bigram word graph before expansion Word graph after expansion

w 1

w 2

w 3

w 4

w 5

w 6

w 1

w 2

w 3

w 4

w 4

w 4

w 5

w 6

w 1

w 2

w 1

w 4

w 1

w 2

w 1

w 4

w 4

w 5

w 6

w 5

w 6

e 1

e 2

e 3

e 4

e 5

e 1

e 2

e 3

e 41

e 51

e 1

e 2

e 3

e 4

e 5

e 1

e 3

e 2

e 41

e 51

Figure 3.7: Illustration of the conventional word graph expansion algorithm

3 and line 4 for its incoming edges and its outgoing edges, respectively.

Specifically, in the outmost loop,

• ea(vi, vj, wj) is an incoming edge to node vj. Based on the above

assumption, the word labels of v i and vj are wi and wj respectively.

Similarly in the inner loop,

• eb(vj, vk, wk) is an outgoing edges from node vj to node vk which is

labeled with word wk.

At line 5, if a node v? with its word label wj and a trigram (wi, wj, wk)

has been created, we just need to insert edge eb from node v? to node vk

(line 6).

On the contrary, at line 7, node v ? is duplicated from node vj and the word

label wj is also set to node v?. Then, at line 8, edge ea is placed between

node vi and v?.

The key of the algorithm is at line 9 and line 10, where edge e b is updated

72


conventional-Expansion

1 � The conventional word graph expansion

2 for each vj ∈ V taken in topological order

3 do for each ea(vi, vj , wj) ∈ Inpand(vj)

4 do for each eb(vj, vk, wk) ∈ Expand(vj)

5 do if exists(v?, wj, (wi, wj, wk))

6 then insert-Edge(v?, vk, eb)

7 else v? = dup-Node(vj)

8 insert-Edge(vi, v?, ea)

9 lm(eb) = p(wk|wi, wj)

10 insert-Edge(v?, vk, eb)

11 remove-Node(vj)

with a new trigram language model score and then inserted from node v ?

to node vk. Now, the edge between v? and vk contains exactly the trigram

language model, p(wk|wi, wj).

When all the incoming edges to node vj are examined, it can be safely

removed, as in line 11.

3.7.3 Compaction Algorithm

The above algorithm considerably increases the word graph size. In fact,

for most trigram language models, the number of trigrams is much smaller

than the number of all possible trigrams. It would be attractive to share

the bigram back-off weights for trigram contexts since in this core, we need

to duplicate only enough nodes to uniquely represent the explicit trigram

probabilities in the word graph.

The idea underlying the algorithm is to factorize the backed-off trigram

probability p(wi+2|wi, wi+1) into the back-off weight b0(wi, wi+1) and the

bigram probability p(w i+2|wi+1), and to multiply the back-off weight by

the weight of the edge e(vi, vi+1), while keeping only the bigram estimate

73


w 1

w 2

w 3

w 4

w 5

w 6

w 1

w 2

w 3

w 4

w 4

w 5

w 6

e 1

e 2

e 3

e 4

e 5

e 1

e 2

e 3

e 14 e 41

e 42

e 5

Figure 3.8: Illustration of the compact word graph expansion where explicit trigram

probability only exists for (w1, w4, w5)

.

on the edge e(vi+1, vi+2). Thus no node duplication is required. Since back-

off weights and probabilities are multiplied, the total score along a path

from vi through vi+1 to vi+2 will include the correct trigram probability

p(wi+2|wi, wi+1).

Fig 3.8 illustrates the compact expansion idea given that there is only

one explicit trigram probability p(w 5|w1, w4). Notice that only one new

node labeled with word w4 is duplicated together with its incoming edge

e14 and its outgoing e41. The trigram probability, p(w5|w1, w4), is placed

on the edge e41. The language model score on edge e14 is directly copied

from the language score on edge e1. After the explicit trigrams are pro-

cessed, the language model score on the edges outgoing from the origi-

nal node labeled with word w4, which are e42 and e5 on the right-hand

side of Fig 3.8, are weighted with their corresponding bigram probabilities

p(w5|w4) and p(w6|w4). Furthermore, bigram back-off weights b0(w4, w1),

b0(w4, w2), b0(w4, w3) are multiplied by the corresponding incoming edges,

e1, e2, e3, of the original node labeled with word w 4.

As we can see, at a given node vj, the compaction-Expansion proce-

dure also contains two loops, at line 2 and line 3. In case that there is an

explicit trigram language probability for w i, wj, wk, see line 4, a new node

is duplicated together with its new incoming and outgoing edges. Lines

74


compactation-Expansion

� The compactation word graph expansion

1 for each vj ∈ V taken in topological order

2 do for each ea(vi, vj , wj) ∈ Inpand(vj)

3 do for each eb(vj, vk, wk) ∈ Expand(vj)

4 do if trigram(wi, wj, wk)

5 then if exists(v?, wj, (wi, wj, wk))

6 then insert-Edge(v?, vk, eb)

7 else v? = dup-Node(vj)

8 insert-Edge(vi, v?, ea)

9 lm(eb) = p(wk|wiwj)

10 insert-Edge(v?, vk, eb)

11 else mark(ea)

12 mark(eb)

13 if !(mark(ea))

14 then remove-Edge(vi, vj , ea)

15 else lm(ea) = lm(ea) · bo(wi, wj)

16 for each eb(vj, vk, wk) ∈ Expand(vj)

17 do if (!mark(eb))

18 then remove-Edge(vj , vk, eb)

19 else lm(eb) = p(wk|wj)

20 if !(mark(ea)), ∀ea(vi, vj, wj) ∈ Inpand(vj)

21 then remove-Node(vj)

between 6 and 10 are actually the conventional-Expansion. On the

contrary, no new nodes have to be duplicated. We just need to update the

language model scores by using the back-off factors.

The experiments of the word graph expansion algorithms will be given in

Chapter 5.

75


76

Chapter 4

Word Graph Decoding

Word graph decoding is the process of finding the best sentence or N−best

sentences through the word graph. In this chapter, three word graph decod-

ing algorithms are fully presented, namely the 1-best word graph decoding,

the N−best word graph decoding and finally the consensus decoding.

4.1 1-Best Word Graph Decoding

Finding the best sentence in the word graph is equivalent to the shortest

path problem in graph theory [Thomas, 2001]. In fact the best sentence

corresponds to the longest path, i.e. the path that has highest score in the

word graph. In general, the problem of 1-best word graph decoding can

be formulated as follows: given a word graph G = (V, E, I, F ), find paths

which have highest score from I to some vj ∈ V . When vj is equal to F ,

we have the complete best path, or the best path. Let simply denote d(v j)

the best score of the path starting from the initial node I and ending at

node vj. By using dynamic programming, the following recurrence can be

formulated:

d(vj) =

0 if I = vj

max∀e,b(e)=vi,f(e)=vj(d(vi) + s(e))

(4.1)

77

4.2. N-BEST DECODING Vu Hai Quan

The interesting thing in Eq. 4.1 is that each edge is updated only once.

Hence, the best score up to node vj can be found by just comparing the

d(vi) + s(e) of all incoming edges e from node v i to node vj.

The implementation of this algorithm is straightforward. However, we

are not only interested in the best score but also in the word sequence in

the best path. Therefore, a back-pointer list is needed in order to trace-

back the best word sequence from the final node F to the initial node I.

For this, to each node vj ∈ V , an entry is associated as follows:

entry[vj] =

vi → is the best predecessor of node vj

w → is the word label of the edge e(vi, vj)

s→ is the best score up to node vj

The 1Best-Decoding procedure consists of two passes. In the forward-

pass, dynamic programming is applied to compute the best path from the

initial node I to the final node F . Specifically, lines from 7 to 11 are actually

the implementation of Eq. 4.1. When the final node F is reached, the value

of entry[F ].s is, therefore, the score of the best path. In the backward-

pass, the best word sequence is found by back-tracking from entry[F ] to

entry[I] using the back-pointer information as mentioned above.

This algorithm is also referred as the 1-best Viterbi decoding. Experimen-

tal results of the algorithm on the word graphs are given in Chapter 5,

where we will compare different word graph rescoring methods.

4.2 N-Best Decoding

The problem of finding the N−shortest paths of a weighted directed graph

is a well-studied problem in computer science [Eppstein, 1998a]. The prob-

lem also admits a number of variants such as finding just the N−shortest

paths with no cycle, or the N−shortest paths with distinct scores, which

78

CHAPTER 4. WORD GRAPH DECODING

1Best-Decoding

� initial


2 do entry[vi].f = −1

3 entry[vi].s = −∞

4 entry[vi].w = empty

� Dynamic programming pass

5 for each node vj = I to F in topological order

6 do for each edge e(vi, vj, τ, t, w, s) ∈ Inpand(vj)

7 do score = entry[vi].s + s(e)

8 if (entry[vj].s < score)

9 then entry[vj].s = score

10 entry[vj].vi = vi

11 entry[vj].w = w(e)

� Back-tracking pass

12 f = F − 1

13 repeat

14 to = f

15 word = entry[to].w

16 push word to sentence

17 f = entry[to].f

18 until f = I

have all been studied extensively as well. An efficient algorithm introduced

by [Eppstein, 1998b] finds an implicit representation of the N−shortest

paths (allowing cycles and multiple edges) between two nodes in time

O(|E| + |V | · log |V | + N). However, as mentioned in the previous sec-

tion, the related problems that arise in speech recognition applications are

quite different from the problem in graph theory. Concretely, the weighted

graphs considered in our applications are the word graphs defined in Sec-

tion 3.1 and the problem of determining N−shortest paths is intepreted

as determining N− best paths with the highest scores in word graphs.

79


The N−word sequences corresponding to the N−best paths are called the

N−best word sequences or N−best sentences or the N−best hypotheses

or more simply, the N−best list. It is often desirable to determine not just

the N− best word sequences in a word graph, but the N−best distinct

word sequences or the N−different best list. In the following sub-sections,

we describe two algorithms for finding the N−best word sequence in a word

graph, namely the stack-based N−best decoder and the exact N−best de-

coder. Experimental results are given in Chapter 5 for speech recognition

and Chapter 6 for speech translation.

4.2.1 The Stack-Based N-Best Word Graph Decoding

In this section, a simple method is presented for finding all the complete

paths in a word graph, from the initial node I to the final node F whose

scores are within a prescribed threshold th with respect to the best path

score. Informally, the algorithm first computes the score of the best path

by using the 1Best-Decoding procedure (Section 4.1), then compare the

best score to all other scores of the remain paths in the word graph. Those

paths whose scores are close to the best score within a given threshold t h

are output. [Thomas, 1984] proposed an efficient implementation for this

algorithm, which uses a push-down (last-in,first-out) stack and has modest

memory requirements.

The algorithm can be described as follows. Let dβ(vi) denote the score of

the best path from the current node vi to the final node F . Let us assume

that there exists a partial path p of score dα(vi) starting from the initial

node I and ending at the current node vi. The edge e(vi, vj, w, τ, t, s) is

said to enter the path p if:

dα(vi) + s + dβ(vj) ≥ dβ(I)− th (4.2)

80


where dα(vi) + s + dβ(vj) is the score of the complete path p and dβ(I), by

its definition, is the score of the best path of the word graph.

Hence, all edges that enter are those on paths from node I to node F

having scores within th of the best path score. The depth-first procedure

lets us use the same path p for each entry in the stack. This approach

allows to put into the stack very small objects. In fact, each entry in the

stack has three attributes:

entry =

vi → is the next to the last node

vj → is the last node

c→ is the score of path (I, ..., v i, vj)

having a sub-path (I, ..., vi) in p

Since the word graph is an acyclic directed graph, the number of ele-

ments in the stack at any time is at most the number |E| of edges in the

word graph.

The detailed implementation of the algorithm is given in the Stack-Decoding

procedure. In the initial step, the scores dβ[vi], ∀vi ∈ V , of the best path

from node vi to the final node F are computed by using the 1Best-Decoding

procedure, given that the ending node is F . The initial entries of the stack

are also created during this step, at line 3 and line 4. The key of this

algorithm is given at line 13. As we can see, only edges in which the scores

of paths passing through them satisfy Eq. 4.2 are put in the stack. When

the final node F is reached, the complete path p whose score is within t h

of the best path score is output.

The main problem of this algorithm is how to specify the suitable value

of the threshold th. Experimentally, we first compute the score of the

best path and then try values of th until an approximate value of N−best

sentences is found.

In Fig 4.1, the optimal path from the initial node I=v 1 to the fi-

81


Stack-Decoding

� Compute the dβ[vi], ∀vi ∈ V

1 1Best-Decoding(F )

� Create the initial path p with node I.

2 set p = (I)

� Create the initial stack.

3 for each e(I, vj , τ, t, w, s) ∈ expand(I) satisfies Eq. 4.2

4 do create an entry(I, vj, s) in the stack.

� The main algorithm.

5 while the stack is not empty

6 do

7 pop the topmost entry(vi, vj, c) in the stack

8 replace p = (I, ..., vi) by p = (I, ..., vi, vj).

9 if (vj 6= F )

10 then

11 vi = vj

12 d = c

13 for each e(vi, vj , τ, t, w, s) ∈ expand(vi) satisfies Eq. 4.2

14 do

15 c = d + s(e)

16 create an entry(vi, vj , c)

17 else

18 output p and c

nal node F=v9 has a cost of -13 and passes through the set of nodes:

{v1, v3, v6, v8, v9}. Table 4.1 shows how the algorithm computes all paths

from node I to node F whose lengths are with 2.6 units of the optimal path

length.

4.2.2 Exact N-Best Decoding

The N−best decoding algorithm presented in the previous section is sim-

ple and fast. However, there is the problem of specifying the thresh-

82


Figure 4.1: An word graph example for the N−best stack-based decoding.

v 1

v 2

v 3

v 4

v 6

v 7

v 8

v 5 v 9

e 1

- 2

e 2

0

e 3

- 2

e 4

- 6

e 5

- 8

e 6

- 3

e 7

-5

e 8

-3

e 9

-4

e 10

-4

e 11

-5

e 12

-6

{-5}

I F

Table 4.1: The Illustration of the stack-based N−best decoding.

Step Entries Path P

vi vj c

1 v1 v2 -2 {v1}

v1 v3 0

2 v2 v4 -4 {v1, v2}

v1 v3 0

3 v4 v7 -9 {v1, v2, v4}

1 3 0

4 v7 v9 -14 {v1, v2, v4, v7}

v1 v3 0

5 v1 v3 0 Output path {v1, v2, v4, v7, v9}, -14

6 v3 v6 -3 {v1, v3}

7 v6 v8 -7 {v1, v3, v6}

8 v8 v9 -13 {v1, v3, v6, v8}

9 Output path {v1, v3, v6, v8, v9}, -13

83


old th. Experimentally the score of the best path is determined by the

1Best-Decoding procedure. Then, the threshold value th is chosen so

that we can obtain approximately N−best sentences. In this section, an

exact N−best decoding based on the word graph [Tran, 1996] is carefully

described. The complexity of this algorithm is higher than the previous

one but with an efficient implementation, we can get the expected results

while keeping the processing time low.

The principle of the approach is based on the following consideration:

when several paths lead to the same node in the word graph, according to

the Viterbi criterion, only the best scored path is expanded. The remaining

paths are not considered any more due to this recombination.

Assuming that the first best sentence hypothesis was found by the

Viterbi decoding through a given word graph, the second best path is

the path which competed with the best one but was recombined at some

node of the best path. Thus in order to find the second best sentence hy-

pothesis, we have to consider all possible partial paths in the word graph

which reach some node of the best path and might share the remaining

section with the best path. By applying this procedure repeatedly, N -best

sentence hypotheses can be successively extracted from a given word graph.

More specifically, the best path can be determined simply by comparing

the accumulative scores of all possible paths leading to the final node of

the word graph. In order to ensure that this best word sequence is not

taken into account while searching for the second best path, the complete

path is copied into a so-called N -best tree. Using this structure, a back-

ward cumulative score for each word copy is computed and stored at the

corresponding tree node. This allows for fast and efficient computation

of the complete path scores required to determine the next best sentence

hypotheses. The second best sentence hypothesis can be found by taking

the path with the best score among the candidate paths which might share

84


a remaining section of the best path. The partial path of this sentence hy-

pothesis is then copied into the N -best tree. Assuming the N -best paths

have been found, the (N + 1)-th best path can be determined by exam-

ining all existing nodes in the N -best tree, because it can share the last

part of some path among the top N paths. Thus it is important to point

out that this algorithm performs a full search within the word graph and

delivers exact results as defined by the word graph structure. The detailed

algorithm is given in the Exact N-Best procedure.

Fig 4.2 shows the graph example of Fig 4.1 with an additional node v 10

and an edge e13 with score 0. Note that the newly added node and edge do

not change the results of N−best decoding. Some notes on the algorithm

are given below and in Fig 4.3 and Fig 4.4.

• Line 1 : do the 1-best decoding to find the best path in the word

graph.

• Line 2 : for each vj ∈ E, compute the FwSco(vj, ea) leading to it.

• Line 3 : the initial N−best tree is created by using information from

the best path.

Fig 4.3 shows the FwSco and BwSco of the initial N−best tree. It is

important to notice that, by definition of FwSco(v j, ea) and BwSco(vi, eb),

we can compute the complete path score by the following equation:

Ψ(e) = FwSco(vj, ea) + BwSco(vi, eb) + score(e) (4.3)

where the relationship in Eq. 4.3 is given by b(e) = v j and f(e) = vi.

For example, let us consider the edge e10 in Fig 4.2. The complete path

score passing through this edge is given by: Ψ(e10) = FwSco(v10, e6) +

BwSco(κv8, e12) + score(e10) = (−3) + (−6) + (−4) = −13

• the algorithm is repeated from line 4 until N−best paths are found.

85


Notation

e ∈ E : be an edge in the word graph, defined in Chapter 3.

FwSco(vj , ea) : cumulative forward score of the best partial path

leading to node vj , passing through ea ∈ E.

BwSco(κi, eb) : cumulative backward score from sentence end

to node vi ∈ Υ ⊂ V passing through edge eb ∈ E.

Υ : the N-Best tree contained all the tree nodes (vi, eb).

Exact N-Best(wg, N)

1 Do 1best-Docoding(wg) to find the best path.

2 Store cumulative path score FwSco and back-pointer for ∀e ∈ E.

3 Create the initial N-Best tree, which contains the best path.

4 for n=1. . . N

5 do for each node (vi, eb) ∈ Υ

6 do for each e(vj , κi) ∈ E,

7 do if w(e) 6= w(eb)

then Compute complete path score

8 Ψ(e) = FwSco(vj , ea)+

BwSco(κi, eb) + score(e)

Save the edge with the best score

9 e? = arg maxe Ψ(e),

Save the best N-Best tree node

10 κ? = arg maxκiΨ(e),

11 Trace back from e?, to the sentence start in the word graph.

by using the FwSco structure found at line 2.

12 Copy this partial path and insert it to the N-Best tree at κ?

to get complete path. The newly created nodes are then κj .

13 Compute backward cumulative score BwSco for the

newly created nodes κj.

14 Output word sequence.

86


Figure 4.2: A word graph example for the exact N-Best decoding

v 1

v 2

v 3

v 4

v 6

v 7

v 8

v 5 v 9

e 1

- 2

e 2

0

e 3

- 2

e 4

- 6

e 5

- 8

e 6

- 3

e 7

-5

e 8

-3

e 9

-4

e 10

-4

e 11

-5

e 12

-6

I F v 10 e 13

0

Figure 4.3: The exact N-Best decoding - Step 1: The initial FwSco and N-Best Tree

v 3

FwSco = 0 e 2

v 6

FwSco = -3 e 6

v 8

FwSco = -7 e 10

v 9

FwSco = -13 e 12

v 10

FwSco = -13 e 13

v 1

BwSco = -13 e 2

v 3

BwSco = -13 e 3

v 6

BwSco = -10 e 10

v 8

BwSco = -6 e 12

v 9

BwSco = 0 e 13

v 10

BwSco = 0 NULL

The FwSco for the best path

The BwSco for the initial N -best tree

Figure 4.4: The exact N-Best decoding - Step 2: The N-Best tree at step 2

BwSco = -14 e 1

BwSco = -12 e 3

BwSco = -10 e 7

BwSco = -5 e 11

v 1 v 2 v 4 v 7

v 1

BwSco = -13 e 2

v 3

BwSco = -13 e 3

v 6

BwSco = -10 e 10

v 8

BwSco = -6 e 12

v 9

BwSco = 0 e 13

v 10

BwSco = 0 NULL

The BwSco for the N -best tree at first loop

87

4.3. CONSENSUS DECODING Vu Hai Quan

During the first iteration, all the elements in the N−best tree are ex-

amined to find the edge which has the highest complete path score passing

through it. That means: ∀(vi, eb) ∈ Υ, ∀e ∈ E, f(e) = vi compute Ψ(e)

and save the best one. The found edge in the case of Fig 4.2 is e11 with

the complete score of −14. By using the FwSco back-pointer, we get the

partial best path and insert it to the N−best tree at node v 9 resulting in

Fig 4.4.

The Complexity of the Algorithm

As we can see, the N−best tree must be updated, after the N−best path

was found to avoid the extraction of the same sentence hypothesis twice.

Thus the search effort depends on the current size of of the N−best tree.

Assuming a sentence with M words and a very large N , the expected cost

of the computation is:

N∑

n=1

M

2(n + 1) =

M

4N(N + 3) O(N 2)

The experimental results of N−best decoding are given in Chapter 5.

4.3 Consensus Decoding

With word graph decoding, we have to handle computational problems.

In general the number of paths through a word graph is exponential in

the number of edges. These paths correspond to different segmentation

hypotheses (i.e. word sequence plus boundary times) of the utterances.

A method to overcome the exponential problem has been suggested in

[Mangu, 1999]. The word graph is first transformed into a special form in

which is called “confusion network”. A confusion network itself is a word

graph. Each edge is labeled with a word and a probability. The most im-

portant feature of confusion networks is that they are linear, in the sense

88


that every path from the start to the end node has to pass through all

nodes. A consequence of this (combined with the acyclicity) is that all

paths between two nodes have the same length. Thus the confusion net-

work naturally defines an alignment for all pairs of paths, called a multiple

alignment by Mangu. This alignment is used as the basis for the word error

rate calculation. The special entries labeled “ “ in the network correspond

to deletions in the alignment.

To estimate the word error rate, sets E i are defined that contain all the

edges that connect the node vi with the next node. Thus each Ei contains

the alternative hypotheses for the word in position i in the final word string.

The posteriors of the elements of each set sum to one. In order to come

up with the confusion network constructing algorithm, we first make an

introduction about the word posterior probability.

4.3.1 Word Posterior Probability

The sentence posterior probability is straightforwardly defined as the prob-

ability of the word sequence given the acoustic feature vectors: p(w 1...wn|X).

By definition the sentence hypothesis covers the whole sequence of feature

vectors. The situation for a word level posterior probability is more com-

plicated as the boundaries of the word in question are not a-priori known.

Depending on the application, different variants of a word posterior prob-

ability might be useful. In the following some of these variants are listed

and discussed.

The simplest way to define a word posterior probability is to treat

the start and end times as additional random variables[Evermann, 1999],

[Wessel, 2002]. This leads to posteriors of the form p(w|τ, t, X tτ). The

calculation of these values can be achieved relatively easily:

89


p(w|τ, t, X) =∑

e:τ(e)=τ∧t(e)=t∧w(e)=w

p(e|X) (4.4)

The problem with these posteriors is that they are too specific, because

they depend on the exact boundary times. For many applications the

concrete start and end times are not relevant, instead it is desirable to find

a more general definition that does not involve the specific times.

A solution to this is to calculate posterior probabilities for each time

in the utterance [Evermann, 1999] [Wessel, 2002]. Informally speaking

these ”instantaneous” word posterior distributions capture which words

the decoder considers likely at that particular time.

The big advantage of this approach is that it does not only take the

best path into account but also encodes information about the number and

likelihoods of all alternatives (different segmentations of the same word and

different words). The calculation from the link posteriors is simple:

p(w|t, X) =∑

e:τ(e)≤t≤t(e)∧w(e)=w

p(e|X) (4.5)

Fig 4.5 illustrates the computation of the time-dependent word poste-

riors. As an example, the posterior of word w 1 given by Eq. 4.5 is:

p(w1|ta) = p(e1) + p(e3) + p(e5)

4.3.2 Confusion Network Construction

[Mangu, 1999] also presented an approach to construct a confusion network

from a word graph. The task is treated as a clustering problem, where

the edges from the original word graph have to be clustered into the sets

Ei mentioned above. To achieve a suitable clustering, the algorithm is

constrained to keep the precedence ordering of the original word graph. If

90


e 1

e 2

e 3

e 4

e 5

e 6

w 1

w 2

w 1

w 3

w 1

w 2

t a t c

t

Figure 4.5: Time dependent word posteriors

an edge ea precedes an edge eb in the word graph (i.e. there is a path from

ea to eb) then the cluster in which ea falls must also precede eb’s cluster.

Very informally speaking, this ensures that the order of words is kept, i.e.

the word graph is just collapsed along the vertical axis. In fact, this can

be ensured by defining a precedence relation or a partial order on the sets

of edges. A set E1 precedes (�) E2 if each of E1 members precedes (≤) all

E2 members in the original lattice. We define the partial order ≤ on the

edges of a given word graph as follows. For e1, e2 ∈ E, e1 ≤ e2 iff:

e1 = e2 or

t(e1) = τ(e2) or

∃e′ ∈ E such that e1 ≤ e′ and e′ ≤ e2

Two equivalence sets E1, E2 are said to be consistent with the word graph

order ≤ if e1 ≤ e2 implies that E1 � E2. By combining equivalence sets,

additional precedences are introduced. Starting from the partial order

defined by the word graph this is repeated until a total order of all sets

is reached, which corresponds to a linear structure of the lattice. The

91


algorithm by [Mangu, 1999] consists of the following steps:

• calculation of all edge posteriors as described in Eq. 3.9

• calculation of p(w|τ, t, X) for all words (with associated times) found

in the word graph as in Eq. 4.5. These constitute the initial sets.

• Intra-Clustering: combination of overlapping sets corresponding to

the same word. Uses the similarity measure:

SIM(E1, E2) = maxe1∈E1,e2∈E2

overlap(e1, e2).p(e1).p(e2) (4.6)

• Inter-Clustering: combination of remaining overlapping equivalence

sets (with different words) until a total order of the sets is achieved.

Uses the similarity measure:

SIM(E1, E2) = avgw1∈E1,w2∈E2sim(w1, w2).pE1(w1).pE2(w2) (4.7)

where the function overlap(e1, e2) measures the overlapping in time be-

tween edge e1 and e2; sim(w1, w2) is the phonetic similarity measure of the

canonical pronunciations of the words w1 and w2. The details of the intra-

word clustering algorithm are given in the Intra-Word Clustering

procedure. It is similar to the inter-word clustering just by replacing the

SIM(w1, w2) for phonetic similar.

After the above algorithm has run until no further equivalence sets can-

didates are available, we obtain the confusion network and also the best

alignment which is named the consensus hypothesis. The properties of

confusion networks and consensus hypotheses will be given in the next

sections.

As we can see, the algorithm consists of three stages. The initial edge

equivalence sets are formed by word identity and the starting and ending

92


Intra-Word Clustering(wg,pro)

1 for E1 ∈ E

2 do for E2 ∈ E

3 do if (word(E1) = word(E2)) & E1 6� E2 & E2 6� E1

4 sim = SIM(E1, E2)

5 if (sim > maxSim)

6 then maxSim = sim

7 bestSet1 = E1

8 bestSet2 = E2

9 Enew = bestSet1 ∪ bestSet2

10 for Ei ∈ E, i = 1, ..., N

11 do if (Ei � bestSet1 or Ei � bestSet2)

12 then Ei � Enew

13 for Ei ∈ E, i = 1, ..., N

14 do if (bestSet1 � Ei or bestSet2 � Ei)

15 then Enew � Ei

16 for Ei ∈ E, i = 1..N

17 do for Ej ∈ E, j = 1..N

18 do if (Ei � bestSet1&bestSet2 � Ej) or

(Ei � bestSet2&bestSet1 � Ej)

19 then Ei � Ej

20 E = E ∩ {Enew} \ {bestSet1, bestSet2}

time:

Ew,t1,t2 = {e ∈ E|w(e) = w; τ(e) = t1, t(e) = t2}

The initial partial order� is given as the transitive closure of the edge order

≤ defined above. The partial order is updated minimally upon merging of

equivalence sets so as to keep it consistent with the previous order.

Fig 4.6 shows a real word graph example while its confusion network is

illustrated in Fig 4.7. The symbol eps is represents for the empty label.

93


0

1

LA

2

LA

3

LA

4

LAMENTI

5

LAMENTI

6

LA

7

LAMENTI

8

LAMENTI

9

LAMENTI

10

LAMENTI

11

LAMENTI

12

LAMENTI

22

LA

@BG

@BG

@BG

@BG

@BG

@BG

@BG

@BG

@BG

@BG

@BG

13

MA

14

MA

15

MA

16

MA

17

MA

18

MA

19

MA

20

MA

@BG

@BG

@BG

@BG

@BG

@BG

@BG

21

GIUSEPPE

82

</s>

23

VITTIMA

24

VITTIMA

28

VITTIMA

33

VITTIMA

34

VITTIMA

35

VITTIMA

36

VITTIMA

37

VITTIMA

38

VITTIMA

@BG

25

A

26

E

27

E’ @BG

29

@BG

30

@BG

31

@BG AEE’

32

ERA @BG

40

@BG

52

@BG

64

@BG

60

@BG @BG A

41

A

47

DIEERA

61

ERA

E’

@BG

53

E

65

E’

@BG

42

A

48

DI

54

E

62

ERA

66

E’

@BG

43

A

55

E

56

E67

E’

68

E’

@BG

44

A

49

DI

39

GIUSEPPE

50

DI

</s>

@BG

@BG

@BG

@BG

45

GIUSEPPE

46

GIUSEPPE

@BG

</s>

@BG

@BG

@BG

51

GIUSEPPE

</s>

@BG

@BG

@BG

@BG

57

GIUSEPPE

58

GIUSEPPE

59

GIUSEPPE@BG

@BG

</s>

@BG

@BG

63

GIUSEPPE

</s>

@BG

@BG

@BG

@BG

69

GIUSEPPE

70

GIUSEPPE

71

GIUSEPPE

72

GIUSEPPE

74

GIUSEPPE

78

GIUSEPPE

79

GIUSEPPE

80

GIUSEPPE

@BG

@BG

@BG

73

CHE

@BG

@BG

81

TE

</s>

75

E

76

E’

77

DE

</s> </s></s>

@BG

@BG

</s>

</s>

Figure 4.6: A word graph example.

94


1

2

LA LAMENTI eps

3

VITTIMA MA eps

4

DI E’ eps E A

5

GIUSEPPE eps

Figure 4.7: The resulted confusion network from the word graph in Fig 4.6.

95


4.3.3 Pruning

In chapter 3, a pruning algorithm, namely forward-backward based algo-

rithm, was presented. In fact, it works by computing the posterior prob-

ability for each edge and then removing edges which have low posterior

probabilities compared to the best one. The same procedure can be ap-

plied to a word graph before constructing the confusion network from that

word graph. In particular, edges which have very low posterior probability

are negligible in computing the total posterior probabilities of word hy-

potheses, but can have a detrimental effect on the alignment. This occurs

because the alignment preserves consistency with the word graph order, no

matter how low the probability of the links imposing the order is. In order

to eliminate low posterior probability edges, we use a preliminary pruning

step. Word graph pruning removes all edges whose posteriors are below an

empirically determined threshold. The clustering procedure merges only

edges that survive the initial pruning. Chapter 5 gives results showing the

effect of lattice pruning on the overall effectiveness of our algorithm.

4.3.4 Confusion Network

The total posterior probability of an alignment set can be strictly less

than 1. This happens when there are paths in the original word graph

that do not contain a word at that position; the missing probability mass

corresponds precisely to the probability of a deletion (or null word). We

explicitly represent deletions by a link ” ”. We can think at the confusion

network as a highly compact representation of the original lattice with the

property that all word hypotheses are totally ordered. Moreover, confusion

networks have other interesting uses besides word error minimization, some

of which will be discussed in Chapter 6 (for Machine Translation).

96


4.3.5 Consensus Decoding

Once we have a complete alignment, it is straightforward to extract the

hypothesis with the lowest expected word error. Let E i, i = 1, ..., L be the

final link equivalence sets making up the alignment. We need to choose a

hypothesis W = w1, ..., wL such that wi = , or wi = w(ei) for some ei ∈ Ci.

It is easy to see that the expected word error of W is the total sum of word

errors for each position in the alignment. Specifically, the expected word

error rate at position i is:

1− p(ei) if wi 6= “ ”

1−∑

e∈Eip(e) if wi = “ ”

This is equivalent to find the path through the confusion graph with the

highest combined link weight.

97


98

Chapter 5

Improvements of Speech Recognition

5.1 Speech Recognition Experiments

In this chapter, we present the experiments on the of use word graphs

in speech recognition. There are two datasets which have been used for

evaluating purposes. The first one, named BTEC (Basic Travelling Ex-

pressions Corpus), is a small corpus of spontaneous speech. The second

one, named IBNC (Italian Broadcast News Corpus), is a large corpus of

broadcast news. The two corpora are then quite different in terms of vo-

cabulary size and type of speech, allowing an exhaustive evaluation of word

graph algorithms.

This chapter is organized as follows. In the first section, we mention

about the ITC-irst recognition system, in terms of acoustic and language

models, the training datasets etc. In the second section, experiments on

the use of word graphs for speech recognition are presented. Finally, the

last section covers the experiments of N−Best rescoring methods.

5.2 Speech Recognition System

The automatic speech recognition system employed for experiments has

been developed at ITC-irst since 1990s, first experiments on broadcast news

99

5.2. ASR SYSTEM Vu Hai Quan

Figure 5.1: Broadcast News Retrieval System

were presented in [Brugnara, 2000] and [Federico, 2000]. Fig. 5.1 shows the

components of the ITC-irst system for recognizing broadcast news, which

includes the following components: segmentation and classification, speaker

clustering, acoustic adaptation, and speech transcription. Hereinafter, we

will shortly describe each component.

5.2.1 Segmentation and Clustering

As shown in Fig 5.1, the audio signal of news can contain speech, possibly

from different speakers, music, non-speech. The purpose of the segmenta-

tion and classification stage is to individuate in the audio signal segments

of music, non-speech events, male/female wide-band speech, male/female

narrow band speech. The clustering stage tries to gather speech segments

of the same speaker.

For segmentation, the Bayesian Information Criterion (BIC) is applied

100

CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION

to segment the input audio into acoustically homogeneous chunks. Gaus-

sian mixture models are then used to classify segments in terms of acoustic

source and channel. Emission probability densities consist of mixtures of

1024 multi-variate Gaussian components having diagonal covariance ma-

trix. Observations are the same 39-dimension vectors used for speech recog-

nition (see below).

Clustering of speech segments is done by a bottom-up scheme that

groups segments which are acoustically close with respect to BIC.

5.2.2 Acoustic Adaptation

Gaussian components in the system are adapted using the Maximum Like-

lihood Linear Regression (MLLR) technique. A global regression class is

considered for adapting only the means or both means and variances. Mean

vectors are adapted using a full transformation matrix, while a diagonal

transformation matrix is used to adapt variances.

5.2.3 Speech Transcription

The core of speech transcription is the recognition engine. It includes the

acoustic model, the language model and the search algorithm.

Acoustic Model

Acoustic modeling is based on continuous-density HMMs. The acoustic

parameter vector comprises 12 mel-scaled cepstral coefficients, the log-

energy and their first and second time derivatives. A total of 85 phone-

like units are used, in which 50 are needed for representing the Italian

phonemes, while the remaining 35 are needed for representing foreign words

coming from the English and German languages. A set of 16 additional

101

5.2. ASR SYSTEM Vu Hai Quan

units has also been introduced to cover silence, background noise and a

number of spontaneous speech phenomena.

Context-dependent models include a set of triphones and a set of left-

dependent or right-dependent diphones used as backoff models for unseen

contexts. Acoustic training was performed with Baum-Welch re-estimation

using the training portion of the IBNC corpus, augmented with other

datasets collected at ITC-Irst.

Language Model

A trigram language model was developed for recognition, by mainly ex-

ploiting newspaper texts. For estimating the trigram language model, a

133M-word collection of the nationwide newspaper La Stampa was em-

ployed. Moreover, the broadcast news transcriptions of the training data

were also added.

A lexicon of the most frequent 64K words was selected. The lexicon

gives a 2.2% OOV rate on the newspaper corpus and about 1.6% on the

IBNC corpus. An interpolated trigram LM [Federico, 2000] was estimated

by employing a non linear discounting function and a pruning strategy that

deletes trigrams on the basis of their context frequency. This results in a

pruned LM with a perplexity of 188 and a size of 14M.

Recognition

The recognizer is a single-step Viterbi decoder. The 64K-word trigram LM

is mapped into a static network with a shared-tail topology [Antoniol, 1995].

The main network has about 11M states, 10M named transition and 17M

empty transitions.

102


Table 5.1: Training and Testing Data for BTEC and IBNC.

Training Data Testing Data

Speech Language Speech Vocab. Size Word per Sent.

BTEC 130 h 133M 00:29m 14K 7.4

IBNC 60 h 133M 1h:15m 64K 20.2

5.2.4 Training and Testing Data

The IBNC corpus consists of around 60 hours of news programs from Radio

RAI (the major Italian broadcasting company) for training and 1h:15m

of news programs for testing. The programs mainly contain clean studio

speech reports of anchors or other reporters (52 %), clean telephone reports

and interviews (21 %), studio speech with background music/noise (20 %),

telephone speech with background noise (5 %) and other (2 %).

The Basic Travelling Expression Corpus-BTEC jointly developed by

the C-STAR partners is a collection of sentences that bilingual travel ex-

perts consider useful for people going to or coming from another coun-

try. The initial collection of Japanese and English sentence pairs is be-

ing translated into Chinese, Korean, and (partially) Italian, as reported

in [Paul,2004]. Actually, the BTEC for Italian consists of 506 short sen-

tences and is recorded by 10 persons for a total of 28.7m. This dataset is

used for evaluation purpose.

In summary, Table 5.1 shows the training and testing data for both

BTEC and IBNC tasks.

5.3 Experimental Results

This section presents all experimental results of the algorithms mentioned

in Chapter 3 and Chapter 4, including the word graph generation, the word

graph pruning, the word graph expansion and the word graph decoding.

103

5.3. EXPERIMENTAL RESULTS Vu Hai Quan

Table 5.2: BTEC: Word error rates with different rescoring methods.

Decoder Word Graph Decoding Consensus Decoding

trigram-case bigram-case trigram-case bigram-case

21.8 21.8 22.2 21.8 22.2

Table 5.3: IBNC: Word error rates with different rescoring methods.

Decoder Word Graph Decoding Consensus Decoding

trigram-case bigram-case trigram-case bigram-case

18.0 18.0 19.8 17.8 19.6

5.3.1 Word Graph Decoding

Let us begin with the word graph decoding experiments. Three decoding

methods, namely the decoder, the 1−best word graph decoding and the

consensus decoding, are evaluated in terms of WER. Here, the WER is

the edit distance between the best hypothesis and its given utterance (see

Chapter 3).

The difference between the three decoding methods is in the way the

best hypothesis is chosen. With the decoder, the best output hypothesis

is taken directly from the recognizer, while with the 1−best word graph

decoding, the best hypothesis is the best path in the word graph. Finally,

in consensus decoding, the best hypothesis is the consensus hypothesis,

which has the highest posterior probability among all possible hypotheses

taken from the confusion network (see Chapter 4).

This evaluation should fulfil two important requirements:

• correctness : for the implementation of the word graph generation

algorithm.

• improvement : for the consensus decoding.

From the theoretical descriptions of the word graph generation in Chap-

ter 3 and the word graph decoding in Chapter 4, we expect that the WER

104


of the 1−best word graph decoding for trigram-based word graphs should

be exactly the WER of the decoder. This property can not be true for

the bigram-based word graphs, as the approximate language model scores

have been used to separate the acoustic scores from the accumulate word

scores, given by Eq. 3.5. Moreover, the WER of consensus decoding for

trigram-based word graphs should be at least equal to the WER of the

decoder.

Table 5.2 and Table 5.3 report the word error rate for the BTEC and

the IBNC datasets with respect to the three different decoding methods.

For the BTEC dataset, the three methods give the same WER of 21.8%

with trigram case while with bigram-case, the WER is 22.2% for both

1−best word graph decoding and consensus decoding methods. There is

a little bit different with IBNC dataset. Specifically, while the WER of

the decoder and the 1−best word graph decoding give the same value of

18.0% with trigram-case, the WER of the consensus decoding for this type

of word graphs is smaller (17.78%). In addition, with bigram-case the

WER of the consensus decoding is also smaller than the WER of 1−best

decoding, 19.6% compared to 19.9% respectively. These results show both

the correctness and the improvement requirements.

A possible reason why the consensus decoding does not help to improve

the word error rate for the BTEC dataset, as it does with the IBNC dataset,

can be the following: The average length of the sentences in IBNC is longer

than in BTEC, (see Table 5.1). This leads to the more influence of the

language models on IBNC than on the BTEC.

5.3.2 Impact of Beam-Width

Results reported in Table 5.2 and Table 5.3 have been measured on word

graphs of different quality obtained by three different pruning techniques.

The first pruning technique is the beam search with different beam-width.

105


Table 5.4: BTEC: Costs of the decoder with different thresholds values.

threshold Time Active States Active Trans Active Model

1 · 10−40 158.45 1629469 2985586 1770564

1 · 10−50 173.36 4448169 8184684 3897707

1 · 10−60 242.02 12252734 21518788 7572416

1 · 10−70 335.03 29111720 48728725 13010224

1 · 10−80 612.41 60120619 95958020 20431202

The second one is based on the forward-backward pruning algorithm. Fi-

nally, the last one exploits the topology of word graphs for compressing

them into a more compact representation. In this and in the following

subsection, we will examine the effect of the pruning methods on the word

graph quality.

Threshold

With the first approach, by setting different beam-width, the number of

pruned hypotheses changes. Specifically, when a large beam is searched,

the number of pruned hypotheses is small (see Chapter 3). This results

in larger word graphs. As a consequence, the time needed to complete

the recognition and word graph generation tasks is longer. In fact, the

recognition system uses a variable named threshold to represent the beam-

width. It is worth to note that a small value of threshold corresponds to

a large beam-width and vice-versa. Hence, the terms threshold and beam-

width are used alternatively. The following experiments show the impact

of beam-width on the performance of the recognizer and the word graph

quality.

Costs of the Recognition System vs. Beam Width

Table 5.4 shows the costs of the recognition system for a part of the BTEC

dataset, with respect to different beam-widths. By decreasing the thresh-

106


old value from 10−40 to 10−80, the ratio of time that the system needs to

complete the recognition task is approximately 1 : 3.4 and the ratio of

active states is 1 : 13.37 respectively. This means that not only the time

but also the used memory increase as the beam-width value is set larger.

Word graphs vs. Beam Width

As noticed in the previous section, when the beam-width is large, a small

number of hypotheses is pruned and a longer time is needed to complete

the recognition task. The same result holds for the word graph generation.

Fig 5.2 shows the relationship between the time used for generating the

word graph and the threshold value. The dotted-line is used for trigram-

based word graphs and the solid line refers to bigram-based word graphs.

We also report the GER and the word graph size (in terms of number

of paths) with different threshold values, as shown in Fig 5.3 and Fig 5.4.

Note that, the threshold values on the x−axis in those figures are the power

of exponential. For example, the value −40 actually means 1 · 10−40.

The choice of the beam-width is very important. It should satisfy both

the time and the GER constraints. Keeping this in mind and looking at

the experimental results, we have chosen the threshold value of 10−50 for

all the following experiments.

5.3.3 Language Model Factor Experiments

As shown in Eq. 2.7, the language model factor plays an important role in

computing the likelihood. Actually, the likelihood is computed by combin-

ing two scores, given by the acoustic and by the language models respec-

tively, which have very different dynamic scales. If they are just multiplied

as indicated in Eq 2.6 the decision for a word sequence would be dominated

by the acoustic likelihood and the language model would have hardly any

107


0

50

100

150

200

250

300

350

-80 -75 -70 -65 -60 -55 -50 -45 -40

Tim

e

Threshold

"btec.time.crtLat.B.txt""btec.time.crtLat.T.txt"

Figure 5.2: BTEC: Time for generating word graphs vs. different threshold values.

108


6

8

10

12

14

16

18

-85 -80 -75 -70 -65 -60 -55 -50 -45 -40

GW

ER

Threshold

"btec.B.fig.pruning.decoding.dat""btec.T.fig.pruning.decoding.dat"

Figure 5.3: BTEC: GER vs. different threshold values.

109


20

40

60

80

100

120

140

160

180

200

220

-80 -75 -70 -65 -60 -55 -50 -45 -40

Num

ber o

f Pat

hs (i

n lo

g)

Threshold

"btec.B.npaths_beamthreshold.txt""btec.T.npaths_beamthreshold.txt"

Figure 5.4: BTEC: Number of paths in word graphs vs. different threshold values.

110


influence. To balance the two contributions, it is customary to use an expo-

nential weighting factor for the language model. The difference in dynamic

scales is mainly caused by the fact that the acoustic likelihoods are severely

underestimated since consecutive frames are assumed to be independent in

the HMM framework.

Fig 5.5 shows the WER vs. the language model factor, for a part of

BTEC dataset. As we can see, with the value of 7.0, the WER is small-

est. This value of the language model scale was used for all the following

experiments.

5.3.4 Forward-Backward Based Pruning Experiments

The previous sub-sections showed the effect of the “beam search” on the

generation of word graphs by measuring the generation time, the word

graph sizes and the GERs. The following subsections present experiments

on the algorithms which directly work on the generated word graphs. For

each experiment, we report the results for both trigram-based word graphs

and bigram-based word graphs with both BTEC dataset and IBNC dataset.

Word Graph Experiments

Let us begin with the forward-backward based pruning algorithm which

was presented in Chapter 3. Table 5.5 and Table 5.6 show the impact of

this pruning algorithm on the word graph quality for the BTEC dataset.

Table 5.7 and Table 5.8 give results of the same experiments for the IBNC

dataset.

In the first lines of the tables, the beam-width value of Inf means that

no pruning is performed. The GERs, in this initial case, are 15.3% for

trigram-based word graphs, and of 7.9%, for bigram-based word graphs.

When using the threshold value of 100.0, we obtained an absolute increase

111


20.5

21

21.5

22

22.5

23

23.5

6 6.5 7 7.5 8 8.5 9 9.5 10

Wor

d E

rror

Rat

e

LM Scales

"btec.lm.dat"

Figure 5.5: BTEC: WER vs different language factors.

112


Table 5.5: BTEC: Trigram-based graph word error rate.

Beam-Width WGD NGD BGD NumPaths(log) GER

Inf 447.7 224.8 23.0 42.8 15.3

300 431.6 236.9 23.2 41.0 15.4

250 418.3 230.3 23.2 39.7 15.4

200 393.2 218.0 23.2 38.5 15.5

150 338.0 191.2 23.1 35.1 15.5

120 259.6 152.2 22.6 32.2 15.6

100 156.7 93.8 16.0 29.5 16.3

50 19.2 12.2 3.3 19.3 18.1

Table 5.6: BTEC: Bigram-based graph word error rate.


Inf 1558.7 390.4 38.4 126.4 7.9

300 1444.0 367.6 38.0 119.6 8.0

250 1338.3 347.2 37.4 111.9 8.1

200 1113.0 308.0 36.0 104.3 8.1

150 781.0 237.2 32.9 92.0 8.1

100 298.2 111.4 21.0 65.5 8.8

50 38.0 19.1 4.9 30.0 12.1

Table 5.7: IBNC: Trigram-based graph word error rate.


Inf 139.2 75.5 8.7 84.3 14.2

300 135.8 73.9 8.7 82.3 14.2

250 132.8 72.4 8.7 79.8 14.2

200 126.8 69.4 8.7 74.5 14.2

150 114.1 63.3 8.7 66.8 14.2

120 98.3 55.9 8.7 59.9 14.3

100 71.9 42.4 7.5 53.8 14.3

50 11.0 7.4 2.3 30.6 14.8

113


Table 5.8: IBNC: Bigram-based graph word error rate.


Inf 2363.5 335.0 34.7 321.2 5.1

300 2200.5 335.9 34.2 310.0 5.1

250 2014.7 316.1 33.5 308.0 5.1

200 1635.8 274.5 31.8 299.2 5.1

150 991.3 197.3 27.5 263.1 5.2

120 549.5 134.2 22.8 208.9 5.3

100 298.8 87.7 17.6 159.3 5.5

50 27.3 13.3 4.1 54.6 7.1

in GER of nearly 1% but the reduction in WGD is significant, as the ratio

of 1 : 2.8 and 1 : 5.2 are obtained with respect to the initial case for trigram

and bigram-based word graphs respectively.

For the IBNC dataset, the results are even better. At the beam-width

value of 100.0, the WGD for trigram based word graphs is reduced by

nearly 50% while the GER is just increased by 0.1%, compared to the

case where the beam-width value is set to Inf. For the bigram-based word

graphs, the results are further better than the trigram-based word graphs.

At the same beam-width value of 100, the graph size is reduced nearly by

10 times with respect to WGD, while the GER is just increased by 0.4%.

Confusion Network Experiments

We applied the same procedure for pruning word graphs, before passing

them to the confusion network construction. As mentioned in Chapter 4,

the confusion network construction stage takes a word graph as its input

and produces a compact representation, namely the confusion network,

by grouping edges which have similar pronunciation words and overlap in

time.

In order to use the same procedure developed for the word graph eval-

uation, we have converted the confusion network to the word graph-based

114


a

b

c e

d

Confusion Network a

b

c e

d The corresponding word graph

@B G

@B G

@B G

@B G @B G

@B G

@B G @B G

Figure 5.6: BTEC: Confusion network and its word graph representation.

representation where each node is labeled with a unique word label. Fig 5.6

shows an example of confusion network and its graph-based representation.

The @BG label stands for the empty-word.

Table 5.9 and Table 5.10 show the impact of the pruning threshold

values on the quality of confusion networks for the BTEC dataset while

Table 5.11 and Table 5.12 are for the IBNC corpus respectively. From

those tables, two comments arise:

Firstly, it is important to notice that the GER for confusion networks is

always smaller than the GER of word graphs. At the beam-width value of

1400, that ban be considered equivalent to the inf value of word graphs, the

GER for trigram-based confusion networks is 13.5% and for bigram-based

confusion networks is 6.6%, while the GER for word graphs, as reported

in Table 5.5 and Table 5.6, is 15.3% and 7.9% respectively.

Secondly, confusion network sizes are always smaller than word graph

sizes, at the same GER values. Fig 5.7 illustrates the relationship between

the GERs for the word graphs (the solid line) and for their confusion net-

works (the dotted-line) and the WGD. As we can see, at the same value

115


Table 5.9: BTEC: Trigram-based confusion network word error rate.

beam-width WGD NGD BGD NumPaths(log) GER

1400 57.6 9.47 6.6 24.4 13.5

1200 52.2 9.4 6.4 23.4 13.6

900 42.8 8.4 5.7 21.0 13.7

700 35.9 7.7 5.1 19.4 13.8

500 27.6 6.7 4.4 17.9 14.0

300 18.2 5.5 3.6 14.1 14.5

200 13.8 4.7 3.0 12.5 14.8

100 8.8 3.8 3.7 10.9 15.4

50 6.4 3.1 1.9 9.6 16.0

Table 5.10: BTEC: Bigram-based confusion network word error rate.


1400 88.7 18.1 8.8 81.6 6.6

1200 81.7 17.0 6.7 78.8 6.7

900 69.8 15.3 7.6 74.3 6.9

700 60.9 13.9 6.9 66.3 7.3

500 49.3 12.3 6.1 55.9 7.6

300 35.2 9.7 5.1 42.7 8.4

200 26.8 8.1 4.4 36.7 9.0

100 16.8 6.0 3.4 25.1 10.3

50 10.9 4.6 2.6 18.5 11.9

Table 5.11: IBNC: Trigram-based confusion network word error rate.


1400 35.97 6.0 3.7 33.3 12.1

1200 32.9 5.7 3.5 33.1 12.1

900 27.6 5.3 3.2 32.8 12.2

700 23.0 4.9 2.9 32.6 12.2

500 18.3 4.5 2.7 32.4 12.2

300 13.5 3.9 2.3 32.0 12.3

200 10.8 3.5 2.0 31.4 12.4

100 7.5 3.0 1.7 30.2 12.7

50 5.7 2.7 1.5 29.6 12.8

116


0

500

1000

1500

2000

2500

3.5 4 4.5 5 5.5 6 6.5 7 7.5

WG

D

GER

"IBNC_BTEC_WG.dat""IBNC_BTEC_CFN.dat"

Figure 5.7: IBNC: Confusion network size vs. word graph size.

117


17.5

18

18.5

19

19.5

20

20.5

21

0 200 400 600 800 1000 1200 1400

Wor

d E

rror

Rat

e

Thresholds

"BN.Consensus.B.dat""BN.Consensus.T.dat"

Figure 5.8: IBNC: Consensus decoding word error rate vs. the beam-width.

118


Table 5.12: IBNC: Bigram-based confusion network word error rate.


1400 103.2 16.7 8.0 185.1 3.6

1200 94.7 15.7 7.6 178.3 3.7

900 80.1 14.0 6.8 159.4 3.9

700 68.1 12.6 6.1 142.9 4.0

500 53.8 10.9 5.3 127.8 4.2

300 37.0 8.7 4.3 101.8 4.6

200 28.0 7.3 3.7 87.4 5.0

100 17.2 5.4 2.8 63.0 5.7

50 10.7 4.0 2.1 53.0 6.7

of GER, the WGD of word graphs is larger than the WGD of confusion

networks.

However, also in the confusion networks case a trade-off between quality

and computational costs has to be established. In fact, the smaller the

confusion network, the higher the WER. This fact is shown in Fig 5.8

which plots the WER as a function of the beam-width. The dotted line

refers to trigrams, while the solid line refers to the bigram case.

In summary, the following conclusions can be drawn for the forward-

backward based pruning algorithm.

• In the best case, the lower-bound of the error is 3.6%, with respect to

the GER of bigram-based confusion network for IBNC dataset. It is

a very positive result.

• In most cases, the word graph sizes can be at least halved, with a

relatively small effecting to their GERs.

• In all cases, the GERs of confusion networks are always smaller than

the GER of the corresponding word graphs, at the same pruning con-

dition. The same results are obtained for the WER.

119


5.3.5 Node-Compression Experiments

In the previous subsection, the experiment results of the forward-backward

based pruning on word graphs and confusion networks were presented.

Here, we show the experiments of the node-compression method. As men-

tioned in Chapter 3, the idea of this algorithm is to combine identical

(sub) paths in the word graph so that the redundant nodes and edges are

removed. In fact, if two nodes in the word graph have the same set of

incoming edges (or outgoing edges), they can be merged without changing

the language of the word graph, where the language of the word graph is

defined as the set of all the word strings starting at the initial node and

ending at the final node.

Table 5.13 reports on the reductions obtained by means of the node-

compression algorithm on bigram-based and trigram-based word graphs.

As we can see, the algorithm reduces word graph sizes, with respect to

WGD, by about 44% for bigram-cases and 30% for trigram-cases. Com-

pared to the results of the forward-backward based pruning in which word

graph sizes are halved, the reduction in word graph sizes by using this algo-

rithm is not significant. Moreover, if we do the forward-backward pruning

first and then apply the node compression for pruned word graphs, the

results are even less significant, as shown in Table 5.14. The first line of

Table 5.14 contains different measures of the size of the initial word graphs

and the corresponding GER, without using any pruning. In the second

and third lines, the resulting word graphs by using the forward-backward

pruning (at the beam-width value of 120, see also Table 5.8) and the node

compression are given. Finally the last line is the word graph sizes and

GER when two pruning methods are applied in the order mentioned above.

Clearly, the forward-backward pruning gives the most significant reduction.

In addition, the cost of the node merging procedure, which is required in

120


Table 5.13: BTEC: Node compression experiments.

Methods WGD NGD BGD NumPaths

Bigram-based WG before compression 1558.7 390.4 38.4 126.4

Bigram-based WG after compression 891.5 210.2 28.6 90.3

Trigram-based WG before compression 447.7 224.8 23.0 42.8

Trigram-based WG after compression 312.4 198.8 18.6 36.6

Table 5.14: IBNC: Forward-backward pruning and node compression.

Methods WGD NGD BGD NumPaths(log) GER

Initial word graph 2363.5 335.0 34.7 321.2 5.1

Fw-Bw based pruning 549.5 134.2 22.8 208.9 5.3

Node-compression 1456.2 274.5 28.1 281.7 5.1

Combined two pruning 493.6 114.7 18.3 188.4 5.3

the node-compression algorithm, is quite expensive.

5.3.6 Word Graph Expansion Experiments

Results reported in Table 5.5 and Table 5.8 show for sure that the GER

of bigram-based word graphs is always smaller than the GER of trigram-

based word graphs at the same pruning condition. In contrast, the WER

of trigram-based word graphs is always lower than the WER of bigram-

based word graphs (see Table 5.2 and Table 5.3). In fact, the language

model constraints, which were used for generating bigram word graphs,

are looser than the ones used for generating trigram-based word graphs.

Hence bigram word graphs contain more paths than trigram word graphs.

However, when constructing bigram word graphs, we have approximately

used the bigram language model scores instead of the real trigram language

model scores - the values that are actually used by decoder. This explains

the higher WER for bigram-based word graph rescoring, compared to the

WER for trigram-based word graph rescoring. It would be interesting to

construct the word graph using bigram constraints and then expand it to

121


Table 5.15: BTEC: Bigram-based word graph expansion experiments.

Method WGD NGD BGD WER

Trigram-Based Word Graph 338.0 191.2 23.0 21.8

Bigram-Based Word Graph 781.0 237.2 32.9 22.2

Simple Method 7681.2 2064.7 30.2 22.0

Back-off Method 1681.5 824.2 30.2 22.0

put the trigram language scores on its edges. The word graph expansion

algorithms were presented in Chapter 3.

Table 5.15 shows the results for the bigram word graph expansion.

The first and second lines represent the word graph quality in terms of

WGD, NGD, BGD and WER for the original bigram-based word graphs

and trigram-based word graphs respectively. The third line contains the

corresponding results for the simple word graph expansion method. Finally,

the last line shows the results for the expansion method, which exploits

the back-off language model property. As we can see, the improvements

of WER is relatively small, 22.0% vs. 22.2% with respect to the expanded

word graph sizes, which are nearly ten times bigger, for the simple method,

and two times for the back-off based method, compared to the original

bigram-based word graph sizes. Moreover, as mentioned in Chapter 3, in

order to apply the word graph expansion all edges labeled @BG have to

be removed from word graphs.

A benefit of the word graph expansion is actually that the real trigram

language model scores are put on edges of bigram word graphs, which

have very low GER compared to the GER of trigram word graphs. If we

apply the same procedure for expanding trigram word graphs to four-gram

word graphs, we obtain the expanded word graphs which have four-gram

language model scores put on their edges and have the GERs of trigram

word graphs, and so on. It is a very efficient way to decode the word graph

using long-span language model.

122


5.4 N-Best Experiments

As shown in Chapter 4, once the word graph is generated, we can extract

the N−best sentence hypotheses directly from it. This N−best list can be

used for MAP decoder [Evermann, 2000] or rescored again with long-span

language models [Tran, 1996]. In our applications, we used the N−best

list for machine translation which gave promisingly results. This topic

will be mentioned in the next Chapter. The details of N−best decoding

algorithm were described in Chapter 4.

Similarly to the previous section, we also evaluated the N−best list on

BTEC and IBNC datasets by using the criteria named N−best word error

rate (see also Chapter 2). The N−best word error rate is calculated by

choosing the sentence with the minimum number of errors among the top

N−best sentence hypotheses, which is the oracle case (or the best); or the

maximum number of errors, which is the anti-oracle case (or the worst).

The N−best word error rate when N =∞ is interpreted as the GER.

Table 5.16 presents the experimental results on IBNC dataset.

The leftmost column of Table 5.16 contains values of N , which ranges

from 5 to 1000 of the top sentences. For each value of N , the N−best

hypotheses are ranked according to their scores. The second column, la-

beled WER-best, is the N−best word error rate or the oracle, while the

third column, named WER-worst, is the worst case or the anti-oracle. The

remaining columns have the same meaning, but for bigram-based word

graphs.

The top-most line in Table 5.16 shows the WER with N =∞. In fact it

is actually the GER of trigram word graphs with respect to IBNC dataset.

As we can see, just with the first top 30 best sentences, we can get a WER

of 15.2%, compared to the 18.0% WER of the 1−best word graph decoding.

With N = 400, the 14.4% WER, is very close to the 14.2% GER.

123

5.4. N-BEST EXPERIMENTS Vu Hai Quan

Table 5.16: IBNC: The N−best experiments.

Trigram-Based Bigram-Based

N WER (best) WER (worst) WER (best) WER (worst)

∞ 14.2 - 5.1 -

1000 14.4 37.9 10.2 53.1

500 14.4 37.6 10.4 49.0

400 14.4 37.4 10.4 47.6

300 14.5 37.2 10.5 46.2

200 14.6 36.8 10.6 43.9

100 14.8 35.6 11.1 40.2

50 14.9 32.0 11.7 36.8

30 15.2 29.7 12.5 34.5

20 15.6 27.4 13.0 32.4

10 16.1 23.9 14.2 29.2

5 17.0 20.8 16.6 24.7

The N-Different Best List Experiments

In this experiment, we investigate effects of the duplicate in the N−best

list. In the N−best decoding algorithm described in Chapter 4, all the

output sentences are ranked according to the scores. Therefore, it can

happen that two or more output sentences can contain the same word

sequence, we refer to them as duplicate. Among them, it is essential to

keep just one which has the highest score and eliminate all the others.

Experimentally with the BTEC dataset, it has been observed that among

the top 1000−best sentences, the maximum number of different sentences

is 532 while the minimum number is just 57.

Fig 5.9 shows the N−best experiments on the BTEC dataset. On the

x−axis, N values range from 5 to 50 and on the y−axis the N−best word

error rate is represented.

To illustrate relationships between the N -different best sentences and

the N -best sentences, four lines are drawn in Fig 5.9. The solid-line, which

124


10

12

14

16

18

20

22

0 5 10 15 20 25 30 35 40 45 50

Wor

d E

rror

Rat

e

N-Best sentences

"btec.B.dnbest.txt""btec.T.dnbest.txt""btec.B.nbest.txt""btec.T.nbest.txt"

Figure 5.9: BTEC: N− different best sentences and N−best sentences vs. WER.

125


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5 10 15 20 25 30 35 40 45 50

Tim

e

N-Best sentences

"btec.B.dnbest.time.txt""btec.T.dnbest.time.txt""btec.B.nbest.time.txt""btec.T.nbest.time.txt"

Figure 5.10: BTEC: N− different best sentences and N−best sentences vs. time.

126


is located at the lowest position in the figure, represents the N− differ-

ent best word error rate for bigram-based word graphs. On the contrary,

the highest dotted-line is the generic N−best error rate for trigram-based

word graphs. As we can see, the lines representing the N−different best

sentences go down faster than the lines of generic N−best sentences. At

the same value of N = 50, for bigram-based word graphs, the WER on

different hypotheses is around 11.5% while the WER of the general case is

around 16.0%. Fig. 5.10 shows the corresponding time required to gener-

ate the sets of N−best sentences. Of course, the time needed to generate

the N−different best list is longer than the time needed for generating the

N−best list as we have to remove the duplicate sentences.

Moreover, eliminating the duplicates from the N−best list plays a cru-

cial role in machine translation. In fact, two source sentences that contain

the same word sequence are always translated into the same target, regard-

less of their scores. So, with N−different best list the machine translation

does not have to repeat useless translations.

127


128

Chapter 6

Speech Translation Experiments

In this chapter, we present speech translation experiments on the BTEC

dataset. The organization of this chapter is given as follows. Firstly, the

machine translation system recently developed at ITC-irst is briefly intro-

duced. Secondly, a survey of the state of the art in speech translation,

emphasized on the use of N−best lists and word graphs is presented. The

core section is about our current efforts, which exploit the applications of

N−best lists and word graphs to improve the translation quality. Specifi-

cally, the N−best lists and word graphs have been used to optimize model

parameters by a parameter tuning scheme. Finally, detail experimental

results are given in the last section.

6.1 ITC-irst Machine Translation System

The architecture of the ITC-irst statistical machine translation system at

run time is shown in Fig 6.1. After a preprocessing step, the sentence in

the source language is given as input to the decoder, which outputs the

best hypothesis in the target language; the actual translation is obtained

by a further post-processing.

Preprocessing and post-processing consist of a sequence of actions aim-

ing at normalizing text and are applied both for preparing training data

129

6.2. N-BEST AND WORD GRAPH Vu Hai Quan

and for managing text to be translated. The same steps can be applied

to both source and target sentences, accordingly with the language. Input

strings are tokenized, and put in lowercase. Text is labeled with few classes

including cardinal and ordinal numbers, week-day and month names, years

and percentages.

Parameters of the statistical translation model described in Section 2.3

can be divided into two groups: parameters of the basic phrase-based mod-

els and weights of their log-linear combination. Accordingly, the training

procedure of the system, shown in Figure 6.2, consists of two separate

phases.

• In the first phase, basic phrase-based models are estimated starting

from a parallel training corpus. After preprocessing, Viterbi align-

ments from source to target words, and vice-versa, are computed by

means of the GIZA++ toolkit [Och, 2000]. Phrase pairs are then ex-

tracted taking into account both direct and inverse alignments, and

finally phrase-based models are estimated.

• In the second phase, scaling factors of the log-linear model are esti-

mated by a minimum error training procedure. An iterative method

searches for a set of factors that minimizes a given error measure on

a development corpus. The simplex method is used to explore the

space of scaling factors. A detailed description of the minimum error

training approach is reported in [Cettolo, 2004].

6.2 N-Best List and Word Graphs for MT

Recently, experimental results reported in [Och, 2003a], [Shen, 04],

[Cettolo, 2004], [Zhang, 2004], [Ueffing, 2002] have shown that the use of

N−best lists and word graphs in machine translation, especially in speech

130

CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS

Figure 6.1: The architecture of the ITC-irst SMT system

src tgt

Preprocessing

Postprocessing

src

Preprocessing

src tgt

BESTTRANSLATION

src tgt

BESTHYPOTHESIS

Decoder

src tgtPREPROCESSED

.. ..w1#..#wkw1#..#wj

w1#..#wl w1#..#wm

src tgtPHRASES

Word Aligner

WORDALIGNMENTSsrc tgt

ExtractionPhrase

EstimationParameter

MODELPARAMETERS

− distortion distributions− phrases− LM

− fertility distributions− translation distributions

TRAIN TEST/RUN

PREPROCESSEDsrc

131


Figure 6.2: Training of the ITC-irst SMT system

FACTORSSCALING

Evaluator

TRANSLATION

Decoder

Simplex

SCORE

ExtractionPhrase

Word Aligner

EstimationParameter

.. ..w1#..#wkw1#..#wj

w1#..#wl w1#..#wm

src tgtPHRASES

WORDALIGNMENTSsrc tgt

PHRASE−BASEDMODEL

PARAMETERS

Phase 1:Phrase−based Model Training

Phase 2:Minimum Error Training

src tgtTRAINING SET

PREPROCESSED

src tgt (ref)

PREPROCESSEDDEVELOPMENT SET

− λ3− λ2

− λ4

− λ1

− LM

− lexicon distributions− fertility "− distortion "

translation, can help to improve translation quality. Most of the works are

based on log-linear models in which parameters are trained and optimized

according to some given criteria by using N−best list. In principle, they

can be summarized as follows:

• The N−best list approach. Those systems usually work in at least

one of two following phases:

– In the first phase, the recognizer can either generates a list of

N−best hypotheses or just the best hypothesis in the source lan-

guage using the algorithms mentioned in Chapter 3 and Chap-

ter 4.

– In the second phase, the N− best hypotheses list is given as the

input to the text MT system. Then, the MT system produces

a list of N × M−best hypotheses in the target language as its

output. Some additional parameters are used in an additional

module named rescore in order to select out the best translation

hypothesis from the N ×M−best list.

132


• The word graph approach. Similarly to the above approach, those

systems require at least one of the following stages:

– In the first stage, the speech recognition system outputs a word

graph of hypotheses in the source language.

– In the second stage, this word graph is used as the input to the

text MT system. Then, the MT system produces a word graph

of hypotheses in the target language as its output.

In the following subsections, we will review current works in both ap-

proaches which were mentioned above.

6.2.1 N-Best based Speech Translation

In [Och, 2003a], the authors directly modeled the posterior probability

P (e|f) by using a log-linear model, as shown in Section 2.3.1. Indeed,

there is a set of F feature functions hm(e, f), m = 1, ..., F . For each feature

function, there exists a model parameter λm, m = 1, ..., F . The translation

probability is:

P (e|f) = pλF1

(e|f) (6.1)

=exp[

∑Fm=1 λmhm(e, f)]

∑e′ exp[

∑Fm=1 λmhm(e′, f)]

(6.2)

The modeling problem is how to choose suitable feature functions that

capture relevant properties of the translation task. The training problem

is how to obtain suitable parameter values λF1 . Parameter estimation was

performed by optimizing the error rate according to one of the following

criteria:

• mWER

133


• mPER

• BLEU score

• NIST score

Error minimization relied on the availability of a set of M−best candidate

translation for each input sentence, produced by the search algorithm (Note

that we distinguish M−best list generated by text MT from N−best list

generated by speech recognition). During training, optimal parameters

were searched by using the Powell’s procedure. Since the M−best list can

significantly change by modifying the parameters, the procedure is iterated

until the M−best list remains stable.

The authors report that for a certain error criterion in training, they

obtained in most cases the best results using the same criterion as the

evaluation metric on the test data.

A similar approach for training parameters λF1 was proposed

in [Cettolo, 2004] with two main differences compared to [Och, 2003a]:

• The simplex algorithm was used instead of Powell’s algorithm.

• All possible solutions were exploited instead of just using M−best

candidates.

An other way of using M−best list in machine translation is given in

[Shen, 04], namely discriminative re-ranking for machine translation. In

this work, like the works mentioned above, the authors just experimented

with text translation. But in principle, the proposed algorithm can be

applied for speech translation as well. Informally the re-ranking approach

for machine translation is defined as follows. First, for each source sentence,

a baseline system generates M−best target sentence candidates. Features

that can potentially discriminate between good vs. bad translations are

extracted from these M−best candidates. These features are then used

134


Table 6.1: Results reported in [Shen, 04] comparing minimum error training with discrim-

inative re-ranking (BLEU%).

Algorithm Baseline BestFeat FeatComb

Minimum Error 31.6 32.6 32.9

Spliting 31.7 32.8 32.6

to determine a new ranking for the M−best list, by using the so-called

splitting algorithm. The new top ranked candidate in this M−best list

is the new best candidate translation. Formally, the splitting algorithm

searches a linear function f(x) = wf · x that successfully splits the top

R−ranked and bottom K−ranked translations for each sentence where

K + R ≤ N , w being the weight vector and x the feature vector. In fact,

the algorithm is a perceptron-like algorithm. The idea of the algorithm is

as follows. For every two translation x ij and xil if:

• the rank of xij is higher than or equal to R, yi,j ≤ R,

• the rank of xil is lower than R, yi,l ≥ N −K + 1,

• the weight vector w can not successfully separate (x i,j and wi,l) with

a learning margin τ , w · xi,j < w · xi,l + τ

then we need to update w with the addition of x i,j − xi,l. The updating is

finished when all the inconsistent pairs are found.

The authors present the experimental results of this algorithm by using

4 different kinds of feature combination, as shown in Table 6.1. It is clearly

that the improvements are not significant. We have also implemented this

algorithm and some initial results on the BTEC corpus are reported in

Table 6.2, which are consistent with the results of the authors.

The only work that used N ×M−best list for speech translation was

proposed in [Zhang, 2004]. In this paper, after introducing the log-linear

models for speech translation, the authors presented a method for training

135


Table 6.2: Experiments of the splitting algorithm on BTEC data

Rank NIST BLEU MWER MPER MSER

baseline 9.7352 0.5208 35.3 30.2 75.7

Splitting 9.7832 0.5309 34.5 30.0 74.4

and optimizing parameters, which is almost similar to [Och, 2003a]. The

details of the method are given below.

The log-linear model used for translate an acoustic vector X (see Sec-

tion 2.3.3) gives the following optimization criterion

e∗ = arg maxe

F∑

i=1

λi log Pi(X, e) (6.3)

A total of 12 features (F = 12), which includes 2 features from speech

recognition, 5 features from machine translation and 5 additional features

has been used for experiments.

The Powell’s algorithm was used to optimize model parameters, λ F1 ,

based on different objective translation metrics. The authors used four

metrics mentioned in Chapter 3:

• BLEU score.

• NIST score.

• mWER

• mPER.

With the optimization scheme mentioned above, the authors built four

log-linear models in order to quantify translation improvement by features

from speech recognition and machine translation respectively. To opti-

mize λ parameters of log-linear models, they used the development data

of 510 speech utterances and adopted an M−best hypothesis approach

[Och, 2003a] to train λ. The experimental results reported from this work

show that:

136


• Models with optimized parameters performed better than models with

uniform parameters.

• Translation performance with N−best recognition is better than with

single-best recognition.

• Translation performance improves when more features are incorpo-

rated into the log-linear model.

6.2.2 Word Graph-based Speech Translation

As we have shown, the word graph offers a very compact way for repre-

senting competitive hypotheses. With the word graph, we can also extract

the exact N−best list for other experiments (see Chapter 3), while it is

difficult to keep the M−best list with a large value of M .

In [Ueffing, 2002], the authors proposed a method for generating word

graphs in statistical text MT decoder. Just with a small modification in

the beam search, we can obtain a word graph of candidate translation

sentences, given the source sentence f. The details of this method is given

as follows.

Word Graph Structure

During the search in statistical machine translation system, a bookkeeping

tree is kept, with the following information:

• the output target word, ei,

• the covered source sentence position, j,

• a backpointer to the preceding bookkeeping entry.

After the search finished, the best sentence is found by tracing-back

through this bookkeeping tree. If we want to generate a word graph, we

137

6.3. ITC-IRST WORKS Vu Hai Quan

have to store both alternatives in the bookkeeping when two hypotheses

are recombined. Thus an entry in the bookkeeping structure may have

several backpointers to different preceding entries. For the easily main-

taining purpose, the concept word graph was defined with nodes and edges

containing following information:

• node: the last covered source sentence position j.

• edge:

– the target word ei,

– the probabilities according to the different models: the language

model and the translation sub-models,

– the backpointer to the preceding bookkeeping entry.

Word Graph Pruning

After the pruning in beam search, all hypotheses that are no longer active

do not have to be kept in the bookkeeping structure. This reduces the size

of the bookkeeping structure significantly.

The generated word graph can be further pruned by using the beam-

search concept, which is a very similar to the one mentioned in Chapter

3. Specifically, the probability of the best sentence in the word graph is

determined first. Then, all the hypotheses in the graph in which their

probabilities are lower than this maximum probability multiplied with a

pruning threshold t, 0 ≤ t ≤ 1 are discarded.

6.3 ITC-irst Works

The use of N ×M−best lists for speech translation at ITC-irst was illus-

trated in Fig 1.1. As we mentioned in Chapter 1, there are two modules

138


for the text translation component, namely the N−best MT and the con-

fusion network MT. Both translation modules produce word graphs, which

contain alternative translation candidates. The best translation hypothe-

sis can be obtained by finding the path which has the highest score in the

word graph by applying the algorithm described in Chapter 3.

6.3.1 System Parameter Tuning

Similarly to [Och, 2003a], [Shen, 04], [Cettolo, 2004] and [Zhang, 2004],

the objective here is to find optimal parameter values {λ i} on the develop-

ment set. These values are then used to evaluate the performance of the

speech translation system on the test set by using the criteria mentioned

in Section 2.3.4.

ASR

optimization simplex

Ref sentences

1-best hyp

Figure 6.3: The estimation of parameters for speech recognition (the first stage).

Figs 6.3, Fig 6.4 and Fig 6.5 illustrate the scheme of our parameter

tuning method. It includes the following three stages:

• Estimation of parameters for speech recognition, λASR.

139


MT

optimization simplex

ref text test set

1-best hyp

ref text dev . set

Figure 6.4: The estimation of parameters for machine translation (the second stage).

• Estimation of parameters for machine translation, λMT .

• Estimation of parameters for the whole system, λSY S.

In fact, there are totally 8 parameters used by the whole system. The

first two parameters, {λ1, λ2}, are for the ASR component, corresponding

to the acoustic model and the source language model weights respectively.

The remain six parameters {λ3, ..., λ8} are for the MT component, corre-

sponding to the lexicon(1), the fertility(2), the distortion(2) and the target

language model score(1).

As shown in Fig 6.3 and Fig 6.4, the first two stages use the similar

scheme. Specifically, the objective of the first two stages is to find the

optimal parameters for the speech recognition and the text machine trans-

lation on the development sets. Those values are then used as the initial

parameters for the last stage, which aims to search for the optimum param-

eters of the whole system. The details of the second stage, which optimizes

the {λ3, ..., λ8} parameters of the log-linear models on the development set,

140


re-ranking

best hypotheses

evaluator

simplex step

SCALING FACTORS

N x M - Best lists reference translation(s)

score

final

new

Figure 6.5: The whole system parameter tuning (the first stage).

are given in the MT-Tuning procedure. By using the initial values {λ 83}

(at line 1), the text MT module translates all the source sentences, sen i, in

the development set, Ds, (see line 3) to all the best translation hypotheses,

hypi in the target language. The optimize uses the simplex method to

optimize parameters according to the BLEU scores, given the best transla-

tion hypotheses and the reference translations in the target language. The

procedure is iterated until the optimal values of the parameters, λ ?MT , are

found (see also Fig 6.4).

In general, we can apply the above procedure for the first stage to esti-

mate the speech recognition parameters {λ1, λ2}, in which the optimization

is subjected to the WERs. However for simplicity, we just choose the values

of {λ1, λ2} empirically by setting different language model weights for the

141


MT-tuning

1 Initialize λ3, ..., λ6

2 repeat

3 for each seni ∈ Ds

4 do hypi ← Translate(seni, λ8

3);

5 λ?MT = optimize(λ8

3,refs);

6 until convergence.

7 return λ?MT

speech recognition and take out the values (denoted as λ?ASR in Fig 6.3),

by which the speech recognizer achieves the lowest WER.

Finally, in the last stage the whole system parameter tuning is ap-

plied. Concretely, this stage estimates the {λ81} parameters on the devel-

opment set and uses the estimated values to rescore the best translations

on the test set. The detail implementation of this stage is given in the

sys-para-tuning procedure.

sys-para-tuning

1 N ×M−list ← text-MT(N−bestlist,λ?MT );

2 repeat

3 {hypi} ← re-ranking(N ×M−list,λ?ASR, λ?

MT );

4 BLEU ← evaluator({hypi},ref. trans.);

5 λ?SY S ← simplex-step(BLEU);

6 until convergence;

7 return λ?SY S ;

As also illustrated in Fig 6.5, the re-ranking takes the N ×M−best

list and the previous estimated values {λ81} as its inputs then rescores the

best translation hypotheses. Next, the evaluator computes the BLEU

score of the best translations hypotheses and the reference translations.

Finally, the simplex-step returns a new set of parameters {λ′

}. The loop

142


is finished when the final optimal values of parameters, {λ ?} are found.

In the following subsections, we present the experimental results of our

works that follow the scheme mentioned above. Specifically, in the first

subsection we mention about the data sets, which include both the devel-

opment and the test sets used for training and testing the whole system

performance. In the next subsection, we present the results of the first two

stages on the development sets. Finally, results of parameter tuning on the

development and test sets are given in the last subsection.

6.3.2 New BTEC Test and Development Sets

In Chapter 5, we mentioned about the BTEC test set, which contains 506

sentences in Italian and records by 10 persons (see Table 5.1). In these

experiments, a larger test set was used:

• 3006 sentences with one reference

• 10 speakers

• in addition, 50 sentences from each speaker for development set.

Table 6.3 shows the statistical information about the new data sets.

#sent. W | V | #spk speech (min)

dev set 500 3961 953 10 (5f+5m) 34.0

old test set 506 2985 940 10 (5f+5m) 28.7

added 2500 20527 2410 10 (5f+5m) 176.5

new test set 3006 23512 2768 17 (8f+9m) 205.2

Table 6.3: The new BTEC test and development sets

143


6.3.3 The First Stage Results (for ASR)

As mentioned in the previous subsection, the first stage was done empir-

ically. Concretely, the development set was recognized by using different,

pre-defined language model weights.

20.8

21

21.2

21.4

21.6

21.8

22

22.2

22.4

22.6

22.8

7 7.5 8 8.5 9 9.5 10 10.5 11

WE

R (%

)

LM Weight

"WER_LMW.dat"

Figure 6.6: WER vs. LM weight on the development set.

Fig 6.6 draws the WER of the development set as a function of the

language model weight. As we can see, at the weight value of 9.25 the

lowest WER of 20.93 is obtained. Using this weight for recognizing the

test set, we obtain the WERs reported in Table 6.4.

144


Test Set WER 95% Conf. Interval

3006 22.09 21.14 - 23.03

Table 6.4: WER of the speech recognition on the 3006−test set when the LM weight is

9.25.

6.3.4 The Second Stage Results ( for Pure Text MT)

Experiments for the second stage were carried out by the following steps:

1. translation of the manual transcriptions of the test set by using uni-

form parameter (all λ’s=1).

2. parameter optimization on the manual transcriptions of the develop-

ment set.

3. translation of the manual transcriptions of the test set with optimized

parameters.

Method BLEU 95% Conf. Interval

Baseline 52.63 51.45 - 53.75

MT tuning 53.15 52.00 - 54.30

Table 6.5: The pure text translation results of the second stage (BLEU score)

The results of the three steps are given in Table 6.5. As we can see,

the improvement is not statistical significant. Concretely, the BLEU score

of the MT-tuning is 53.15, which lies inside the confidence interval of the

baseline’s BLEU score, 51.45− 53.75.

6.3.5 The Third Stage Results (for the Speech Translation)

In this final subsection the results of the system parameter tuning, the

third stage, are presented. Concretely, it was done by the following steps:

145


1. translate the test set, output by speech recognition, by using the uni-

form parameters. The translation results of this step is considered as

the baseline.

2. estimate the system parameters λ?SY S on the development set by using

the sys-para-tuning procedure.

3. produce the 100×100−best translation list for the test set and rescore

the best translations by using the estimated values from the previous

step.

Methods BLEU Conf. Interval

1asrbest (baseline) 39.66 38.49 - 40.79

sys.para.tuning 41.22 40.03 - 42.42

Table 6.6: The speech translation results of the third stage (BLEU score).

Table 6.6 reports the results of the third stage. The following observa-

tions can be draw:

• There is a big gap between the translation performance of the manual

transcripts and the transcripts resulting from the speech recognition.

Concretely the BLEU scores of the two tasks are 52.36 and 39.66

respectively.

• System tuning gives a statistically significant improvement to speech

translation performance.

146

Chapter 7

Conclusions and Future Works

As mentioned through the previous chapters, the main problem of the

work is to provide a new interface for speech translation in terms of word

graphs, N−best lists and confusion networks. With the new interface,

the machine translation can exploit deeper and more extensive knowledge

sources to improve translation quality. In particular, the following aspects

were studied:

7.1 Efficiency and Quality of Word Graph Genera-

tion

The word graph construction algorithm, which was implemented in Chap-

ter 3, can be fully integrated in the general m−gram language model speech

decoder. Moreover, by using the m−gram language model state constrains

for optimizing the word boundaries, the algorithm results in better word

boundaries and in enhanced capabilities to prune the word graphs. Con-

cretely, the bigram and trigram word graphs were generated and their

results were carefully reported in Chapter 5. Furthermore, we have also

implemented and evaluated various pruning methods, namely the beam

search, the forward-backward pruning, and the node compression algo-

147

7.1. EFFICIENT WG GENERATION Vu Hai Quan

Beam-Width WGD NGD BGD GER

300 1476.2 181.8 19.7 4.2

150 1415.9 175.9 19.3 4.2

100 684.5 104.2 14.5 4.2

90 460.4 77.4 12.3 4.3

80 269.2 51.8 9.8 4.5

70 137.4 31.4 7.4 4.8

50 25.2 9.0 3.8 5.8

Table 7.1: Word graph results of [Ortmanns, 1997] on the NAB’94 task.

rithm, in order to obtain quality word graphs. An important capability

is the word graph expansion which can be used to rescore the word graph

with a higher language model score.

To compare to other works, Table 7.1 contains the word graph results

of [Ortmanns, 1997] on the NAB’94 task.

As our word graph experiments were carried out on the different test

sets (the IBNC and the BTEC test sets), we only compare the results on

the relationship between the graph sizes, the GERs and the WERs. At the

WGD value of 684.5, [Ortmanns, 1997] reported the GER value of 4.2 and

the WER value of 14.3 respectively. In our case, on the IBNC test set,

at the GER value of 5.2 and the WER value of 19.8, the WGD is 991.3 as

shown in Table 5.6. That means that even working on the different data

tests and the WER of our speech recognizer was worse but both the graph

qualities are comparable. Moreover, as shown in Table 5.12, our GERs for

the confusion networks are even lower than the GERs of [Ortmanns, 1997]

on the same two data tests mentioned above. Concretely, at the WGD

value of 103.2, we achieve the GER value of 3.6, while the smallest value

of GER reported in [Ortmanns, 1997] is 4.2 at the WGD value of 684.5.

Finally evaluating the algorithms on two different data sets, one is a

large vocabulary speech corpus (IBNC) and one is spontaneous speech

148

CHAPTER 7. CONCLUSIONS AND FUTURE WORKS

Method WER

1-best 33.1

Consensus 32.5

Table 7.2: Word graph decoding results (WER) of [Mangu, 1999] on the DARPA Hub-4

task

corpus (BTEC) asserts that our algorithms are working properly and ef-

ficiently.

7.2 Efficiency and Quality of Word Graph Decoding

Three word graph decoding algorithms, namely the 1−best decoding,

the N−best decoding and the consensus decoding, have been developed

and evaluated. Table 7.2 reports the results of word graph decoding

of [Mangu, 1999] on the DARPA Hub-4 task. As noted in the previous

section, it is difficult compared the results on the different task, accept the

relationship between quantities.

Clearly the WERs reported by [Mangu, 1999] on the DARPA Hub-4

task are relatively higher than the WERs on the IBNC test set that we

reported in Table 5.3. However, as we can see, the consensus decoding

always achieves a lower WER than the WERs obtained by the 1−best

decoding on the same task. Concretely, on the IBNC test set, we have

the WER values of 17.8% and 18.0% for consensus decoding and 1− best

decoding respectively while the corresponding results of Mangu’s work are

32.5% and 33.1%.

Moreover, the capability of generating N−best lists and confusion net-

works from word graphs has played a crucial role in our speech translation

system. As presented in Chapter 6, the use of word graphs for speech

translation gains significant improvements in the translation quality. The

N−best decoding algorithms, which were presented in Chapter 4 and eval-

149

7.3. RESULTS ON PARAMETER TUNING Vu Hai Quan

uated in both BTEC and IBNC datasets, are robust and efficient. It

works much faster than the real time and produces the N−best list in a

general format which can be used by other groups.

7.3 New Results on the System Parameter Tuning

Finally, in Chapter 6 we reported the application of word graphs to improve

the speech translation quality. This is actually the extension of our previous

work [Cettolo, 2004]. Specifically, by exploiting the word graph generation,

the word graph decoding, we were able to produce the final N ×M -best

translation candidates as the output of the speech translation system. The

N×M -list was used in a parameter tuning scheme, which was proposed by

us, to optimize the system parameters. To evaluate the new tuning scheme,

an extension of BTEC corpus was recorded and transcribed, which includes

the 500−development set and 3006−test set. The experiments reported in

Chapter 6 showed the significant improvements in the translation quality,

compared to the baseline.

7.4 Future Works

There are some works that should be investigated more carefully in future.

• Efficient of word graph rescoring for high-order language models. This

can be easily achieves by the following two steps:

– in the first step, the word graph expansion algorithm will be ap-

plied to expand bigram or trigram word graphs to four-gram or

even higher word graphs.

– in the second step, the word graph decoding algorithms can be

used to rescore on the expanded word graphs. The algorithm that

150

CHAPTER 7. CONCLUSIONS AND FUTURE WORKS

we proposed and implemented can be naturally adopted for this

work.

• Estimation and use of confidence measures for speech recognition. The

properties derived from the confusion networks have shown a very

nice characteristics. Specifically, the confusion network defines a total

ordering among words and permits to associate posterior probabilities

to single words. Within the most probable path, these probabilities

can be interpreted as confidence scores. The availability of reliable

confidence scores is of paramount importance for many applications,

which make use of automatic transcripts, e.g. content based indexing,

text summarization, text classification, information extraction, etc.

• Experimenting the system parameter tuning scheme with additional

features. Currently our tuning scheme was only able to work with

the 8 features, which were used inside the whole system, as described

in Chapter 6. In addition to the 8 features mentioned above, several

features are also integrated into the log-linear models such as the

part-of speech language models, the length model, the jump weight,

the maximum entropy alignment model, the example matching score,

the dynamic example matching score, et. We expect that by adding

the additional features to the log-linear models and then applying the

system parameter tuning on these features can help to improve more

significant the speech translation quality.

151

Bibliography

[Amtrup, 1996] Jan. W. Amtrup, H. Heine, U. Jost, “What in a

Word Graph, Evaluation and Enhancement of Word

Lattices”, Technical Report, University of Hamburg,

1996.

[Antoniol, 1995] G. Antoniol, F. Brugnara, M. Cettolo and M. Fed-

erico. “Language Model Representations for Beam-

Search Decoding”. Proceedings of ICASSP 95, Inter-

national Conference on Acoustics, Speech and Signal

Processing, Detroit, USA, pp.588-591, 8th-12th May

1995.

[Aubert, 1995] X. Aubert and H. Ney, “Large vocabulary continu-

ous speech recognition using word graphs”, in Pro-

ceedings IEEE International Conference on Acous-

tics, Speech, and Signal Processing 1995, Detroit,

MI, USA, May 1995, vol. 1, pp. 49-52.

[Bertoldi, 2004] Bertoldi, Nicola; Cattoni, Roldano; Cettolo, Mauro;

Federico, Marcello. “The ITC-irst Statistical Ma-

chine Translation System for IWSLT-2004”. In Pro-

ceedings of the International Workshop on Spoken

Language Translation (IWSLT). pp. 51-58. Septem-

ber, 2004. Kyoto, Japan.

153

BIBLIOGRAPHY Vu Hai Quan

[Brown, 1993] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra,

and R. L. Mercer, “The Mathematics of Statistical

Machine Translation: Parameter Estimation,” Com-

putational Linguistics, vol. 19, no. 2, pp. 263–313,

1993.

[Berger, 1996] A. Berger, S. Della Pietra, and V. Della Pietra, “A

Maximum Entropy Approach to Natural Language

Processing,” Computational Linguistics, vol. 22,

no. 1, pp. 39–71, 1996.

[Brugnara, 2000] F.Brugnara, M. Cettolo, M. Federico, D. Giuliani,

“A Baseline for the Transcription of Italian Broad-

cast News”, Proceedings IEEE International Con-

ference on Acoustics, Speech, and Signal Processing

2000, Istanbul, Turkey, June 2000.

[Chase, 1997] L. Chase, “Word and acoustic confidence annotation

for large vocabulary speech recognition”, in Proceed-

ings ISCA European Conference on Speech Com-

munication and Technology 1997, Rhodes, Greece,

September 1997, vol. 2, pp. 815-818.

[Cettolo, 2004] M. Cettolo, M.Federico, “Minimum Error Train-

ing of Log-Linear Translation Model”, In Proceed-

ings of the International Workshop on Spoken Lan-

guage Translation (IWSLT), September, 2004. Ky-

oto, Japan.

[De Mori, 1998] R. de Mori el al, “Spoken Dialogues with Comput-

ers”, Academic Press, San Diego, CA, USA, 1998.

154

BIBLIOGRAPHY

[Darroch, 1972] J. Darroch and D. Ratcliff, “Generalized Iterative

Scaling for Log-Linear Models,” The Annals of Math-

ematical Statistics, vol. 43, no. 5, pp. 1470–1480,

1972.

[Pietra, 1997] S. Della Pietra, V. Della Pietra, and J. Lafferty, “In-

ducing features of random fields,” IEEE Trans. on

Pattern Analysis and Machine Intelligence, vol. 19,

no. 4, pp. 380–393, 1997.

[Dempster, 1977] A. P. Dempster, N. M. Laird, and D. B. Rubin,

“Maximum-likelihood from incomplete data via the

EM algorithm,” Journal of the Royal Statistical So-

ciety, B, vol. 39, pp. 1–38, 1977.

[Eppstein, 1998a] David Eppstein, ”K shortest paths and other

”k best” problems,” http://www1.ics.uci.edu/ epp-

stein/bibs/ kpath.bib.

[Eppstein, 1998b] David Eppstein, ”Finding the k shortest paths,”

SIAM J.Computing, vol. 28, no. 2, pp. 652-673, 1998.

[Evermann, 1999] G. Everman, “Minimum Word Error Rate Decod-

ing”, MPhil Computer Speech and Language Pro-

cessing, University of Cambridge, 1999.

[Evermann, 2000] G. Evermann and P. C. Woodland, “Large vocabu-

lary decoding and confidence estimation using word

posterior probabilities”, in Proceedings IEEE Inter-

national Conference on Acoustics, Speech, and Sig-

nal Processing 2000, Istanbul, Turkey, June 2000,

vol. 3, pp. 1655-1658.

155


[Fetter, 1996] P. Fetter, F. Dandurand, and P. Regel-Brietzmann,

“Word graph rescoring using confidence measures”,

in Proceedings International Conference on Spoken

Language Processing 1996, Philadelphia, PA, USA,

October 1996, vol. 1, pp. 10-13.

[Federico, 2000] M. Federico, ”A Baseline System for the Retrieval

of Italian Broadcast News”, Speech Communication,

Special Issue on ”Accessing Information in Spoken

Audio”, 32:37-47, 2000.

[Federico, 1995] M. Federico, M. Cettolo, F. Brugnara and G. An-

toniol, ”Language Modelling for Efficient Beam-

Search”, Computer Speech and Language, 9:353-379,

1995.

[Goel, 1999] V. Goel and W. Byrne, “Task dependent loss func-

tions in speech recognition: A* search over recogni-

tion lattices”, in Proceedings ISCA European Con-

ference on Speech Communication and Technology

1999, Budapest, Hungary, September 1999, vol. 3,

pp. 1243-1246.

[Jelinek, 1998] F. Jelinek, “Statistical Methods for Speech Recogni-

tion”, The MIT Press, 1998.

[Johnson, 2000] .T. Johnson, “Incorporating prosodic information

and language structure into speech recognition sys-

tems”, Ph.D. Thesis, Purdue University , 2000.

[Kemp, 1997] T. Kemp and T. Schaaf, “Estimating confidence us-

ing word lattices”, in Proceedings ISCA European

156

BIBLIOGRAPHY

Conference on Speech Communication and Technol-

ogy 1997, Rhodes, Greece, September 1997, vol. 2,

pp. 827-830.

[Lee, 1989] C. H. Lee and L. R. Rabiner, “ Frame synchronous

network search algorithm for connected word recog-

nition”, IEEE Transactions on Acoustics, Speech,

and Signal Processing, vol. 27, no. 11, pp. 1649-1658,

November 1989.

[Lee, 1995] C. H. Lee, F. K. Soong, and K. K. Paliwal, editors,

“Automatic Speech and Speaker Recognition, Ad-

vanced Topics”, pages 1-30. Kluwer Academic Pub-

lishers, 1996.

[Mangu, 1999] L. Mangu, E. Brill, and A. Stolcke, “Finding con-

sensus among words: Lattice-based word error min-

imization”, in Proceedings ISCA European Con-

ference on Speech Communication and Technology

1999, Budapest, Hungary, September 1999, vol. 1,

pp. 495-498.

[Ney, 1994] H. Ney and X. Aubert, “Word graph algorithm

for large vocabulary continuous speech recognition”,

in Proceedings International Conference on Spo-

ken Language Processing 1994, Yokohama, Japan,

September 1994, vol. 3, pp. 1355-1358.

[Ney, 1993] H. Ney and M. Oerder, “Word graphs: An efficient

interface between continuous speech recognition and

language understanding”, in Proceedings IEEE In-

ternational Conference on Acoustics, Speech, and

157


Signal Processing 1993, Minneapolis, MN, USA,

April 1993, vol. 2, pp. 119-122.

[Ney, 1987] H. Ney, D. Mergel, A. Noll, and A. Paeseler, “Data-

driven organization of the dynamic programming

beam search for continuous speech recognition”,

in Proceedings IEEE International Conference on

Acoustics, Speech, and Signal Processing 1987, Dal-

las, TX, USA, April 1987, pp. 833-836.

[Ney, 1997] H. Ney, S. Ortmanns, and I. Lindam, “Extensions to

the word graph method for large vocabulary continu-

ous speech recognition”, in Proceedings IEEE Inter-

national Conference on Acoustics, Speech, and Sig-

nal Processing 1997, Munich, Germany, April 1997,

vol. 4, pp. 1787-1790.

[Neukirchen,2001] C. Neukirchen, D. Klakow, X. Aubert, “Generation

and expansion of Word Graph using Long Span Con-

text Information”, ICASSP 2001, pp. 41-44.

[Noord, 2001] G. van Noord., “Robust Parsing of Word Graphs”,

Robustness in Language and Speech Processing,

Kluwer Academic Publishers, 2001.

[Och, 2000] F. J. Och and H. Ney, “Improved statistical align-

ment models,” in Proc. of the 38th Annual Meet-

ing of the Association for Computational Linguistics,

Hongkong, China, October 2000, pp. 440–447.

[Och, 2002] F. Och and H. Ney, “Discriminative training and

maximum entropy models for statistical machine

158

BIBLIOGRAPHY

translation,” in ACL02: Proc. of the 40th Annual

Meeting of the Association for Computational Lin-

guistics, PA, Philadelphia, 2002, pp. 295–302.

[Och, 2003] F. J. Och and H. Ney, “A systematic comparison of

various statistical alignment models,” Computational

Linguistics, vol. 29, no. 1, pp. 19–51, 2003.

[Och, 2003a] F.J. Och, ”Minimum Error Rate Training in Statis-

tical Machine Translation”, In Proc. of ACL’2003,

pages 160-167.

[Odell, 1995] J. J. Odell, “The Use of Context in Large Vocab-

ulary Speech Recognition”, Ph.D. thesis, University

of Cambridge, Cambridge, UK, 1995.

[Ortmanns, 1997] S. Ortmanns, H. Ney, and X. Aubert, “A word graph

algorithm for large vocabulary continuous speech

recognition”, Computer, Speech, and Language, vol.

11, no. 1, pp. 43-72, January 1997.

[Paul,2004] M. Paul, H.Nakaiwa, M.Federico, “Towards Innova-

tive Evaluation Methodologies for Speech Transla-

tion”. In Working notes of the NTCIR-4 2004 Meet-

ing. pp. 17-21. 2004. Tokyo.

[Rabiner, 1993] L. Rabiner and B. H. Juang, “Fundamentals of

Speech Recognition”, Prentice Hall Publishers, 1993.

[Schwartz, 1990] R. Schwartz, Y.L. Chow, “The N-Best Algorithm:

An Efficient and Exact Procedure for Finding the

N Most Likely Sentence Hypotheses” in Proceedings

IEEE International Conference on Acoustics, Speech

159


and Signal Processing 1990, pp. 81-84, Albuquerque,

April 1990.

[Schwartz, 1991] R. Schwartz and S. Austin, “A comparison of several

approximate algorithms for finding multiple (N-best)

sentence hypotheses”, in Proceedings IEEE Interna-

tional Conference on Acoustics, Speech, and Signal

Processing 1991, Toronto, Canada, May 1991, vol. 1,

pp. 701-704.

[Sixtus, 2001] A. Sixtus and H. Ney, “From within-word model

search to acrossword model search in large vocab-

ulary continuous speech recognition”, submitted to

Computer, Speech, and Language, 2001.

[Sixtus, 1999] A. Sixtus and S. Ortmanns, “High quality of word

graphs using forward - backward pruning”, ICASSP

1999, pp. 593-596.

[Shen, 04] L.Shen, A.Sarkar, F.J.Och, “Discriminative Rerank-

ing for Machine Translation”, 2004.

[Soong, 1991] F.K. Soong, E.F. Huang: “A Tree Trellis Based Fast

Search for Finding the N Best Sentence Hypothesis

in Continuous Speech Recognition”, in Proceedings

IEEE International Conference on Acoustics, Speech

and Signal Processing 1991, pp. 705–708, Toronto,

May 1991.

[Stolcke, 1997] A. Stolcke, Y. Konig, and M. Weintraub, “Explicit

word error rate minimization in N-best list rescor-

ing”, in Proceedings ISCA European Conference

160

BIBLIOGRAPHY

on Speech Communication and Technology 1997,

Rhodes, Greece, September 1997, vol. 2, pp. 163-166.

[Stolcke, 2002] A. Stolcke, “SRILM - an extensible language model-

ing toolkit”, ICSLP 2002.

[Thomas, 1984] Thomas H. Byers, Michael S. Waterman, “Determin-

ing All Optimal and Near-Optimal Solutions when

Solving Shortest Path Problems by Dynamic Pro-

gramming”, Operations Research, Vol. 32, No. 6,

Nov-Dec, 1984.

[Thomas, 2001] Thomas H. Cormen, “Introduction to Algorithms”,

MIT Press, 2001.

[Tran, 1996] B. H. Tran, F. Seide, V. Steinbiss, “A word graph

based N-Best search in continuous speech recogni-

tion”, in Proceedings of ICSLP 1996.

[Valtchev, 1997] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J.

Young, “MMIE training of large vocabulary recogni-

tion systems”, EURASIP/ISCA Speech Communi-

cation, vol. 22, no. 4, pp. 303-314, September 1997.

[Weng, 1999] F. Weng, A. Stolcke, A. Sankar, “Efficient Lattice

Representation and Generation”in Proceedings of

ICSLP, (Sydney, Australia), 1998.

[Wessel, 2002] F. Wessel, “Word Posterior Probabilities for Large

Vocabulary Continuous Speech Recognition”, Ph.D.

thesis, Aachen University of Technology.

[Ueffing, 2002] N. Ueffing, F.J. Och, H. Ney. ”Generation of Word

Graphs in Statistical Machine Translation”. In Proc.

161


Conference on Empirical Methods for Natural Lan-

guage Processing, pp. 156-163, Philadelphia, PA,

July 2002.

[Zhang, 2004] R.Zang el al, “A Unified Approach in Speech-to-

Speech Translation: Integrating Features of Speech

Recognition and Machine Translation”, 2004

162

assets.disi.unitn.itassets.disi.unitn.it/uploads/doctoral_school/documents/phd-thesis/x… ·...

Documents