assets.disi.unitn.itassets.disi.unitn.it/uploads/doctoral_school/documents/phd-thesis/x… ·...
TRANSCRIPT
PhD Dissertation
International Doctorate School in Information andCommunication Technologies
DIT - University of Trento
Applications of Word Graphs
In Spoken Language Processing
Vu Hai Quan
Advisor:
Dr Marcello Federico
ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
February 2005
Abstract
This work explores the application of word graphs to spoken language pro-
cessing, in particular both automatic speech recognition and speech trans-
lation. For speech recognition, efficient algorithms for word graph genera-
tion, word graph expansion and word graph rescoring have been investigated
within the ITC-irst large vocabulary system. Two domains are considered:
Italian Broadcast News Corpus (IBNC) and the Basic Traveling Expres-
sions Corpus (BTEC). The first is a large vocabulary domain while the
second is a spontaneous-speech limited vocabulary domain. To evaluate the
quality of word graphs, various measures have been experimented. From
the generated word graph, I have further worked with confusion network
construction and N−best list generation, which gave positive results com-
pared with the baseline system. For the broadcast news domain, our best
word error rate is 17.7% which favorably compares to the baseline system
word error rate of 18.0%. Similarly to speech recognition, a word graph
can also be generated as output by a machine translation algorithm. As a
difference with the speech decoder, the translation decoder cannot proceed
synchronously with the input. However, partial theories examined during
the search have an exact correspondence with those of produced by a speech
decoder. In my thesis, I extended word graph processing algorithms to word
graphs generated by the ITC-irst statistical machine translation decoder, for
instance to generate M−best lists of translations. The possibility of having
the M−best list of translation candidates and the N−best list of hypothe-
ses from the speech recognition provides the system a richer of additional
knowledge sources which can be exploited in further steps to improve the
system performance. In particular, the N × M best list has been used
in a minimum error training scheme which improved translation quality.
Specifically, the BLEU score improved from 39.66 to 41.22
Keywords
statistical speech recognition, statistical speech translation, statistical ma-
chine translation, word graph, N−best list, confusion network, parameter
tuning.
4
Acknowledgment
I do not know how to express my love to Italy, the most beautiful country
with its age-old culture where I have lived and worked; to Italians, the
most friendly and kindly people in over the world whom I have met and
shared my life for three years; to the University of Trento, specifically the
International Doctorate School in Information and Communication Tech-
nologies, where I have started my scientific research. Probably, they will
deeply stay inside my heart for all my life.
I would like to thank to Marcello Federico, my advisor, who showed me
not only what a scientist should do but also how a gentleman should be.
He is a true scholar with his deep knowledge about the statistical models,
especially in information retrieval, speech recognition and machine trans-
lation. His lectures and advices was obviously guiding me to the place
where I should begin the research. Without him, I was certainly not able
to complete the study.
I also would like to thank to Fabio Brugnara and Mauro Cettolo, the
senior researchers at the ITC-irst. Fabio has a great effect on my speech
recognition education. Somehow, he knew the answer to all of my questions
and has such a clear way of explaining them to me. Mauro Cettolo has
spend a lot of time for reading and correcting the thesis. He also helps me
by providing the results on the speech translation by which my works can
be shown to have a value.
I am indebted to Nicola, Vanessa, Roldano, Stemmer and my colleges at
ITC-irst for their supports me on all sort of things.
I could not enjoyed graduate student life as much as I did without my
Vietnamese friends at Trento, especially Hoc, my roommate.
Finally I would like to make my parent and my wife a present of my work.
Contents
1 Introduction 1
2 Speech Translation 7
2.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Feature Extraction . . . . . . . . . . . . . . . . . . 10
2.1.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Language Model . . . . . . . . . . . . . . . . . . . 14
2.1.4 Search . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Multiple-Pass Search . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 N-Best Algorithms . . . . . . . . . . . . . . . . . . 23
2.2.2 Word Graph Algorithm . . . . . . . . . . . . . . . . 28
2.2.3 Consensus Decoding . . . . . . . . . . . . . . . . . 33
2.3 Statistical MT . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Log-linear Model . . . . . . . . . . . . . . . . . . . 34
2.3.2 Decoding . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.3 Speech Translation . . . . . . . . . . . . . . . . . . 37
2.3.4 Evaluation Criterion . . . . . . . . . . . . . . . . . 39
3 Word Graph Generation 41
3.1 Word Graph Definitions . . . . . . . . . . . . . . . . . . . 41
3.1.1 Word Graph Accessors . . . . . . . . . . . . . . . . 42
3.1.2 Word graph properties . . . . . . . . . . . . . . . . 43
i
3.1.3 Topological ordering . . . . . . . . . . . . . . . . . 44
3.2 Word Graph Generation . . . . . . . . . . . . . . . . . . . 44
3.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 Best predecessor . . . . . . . . . . . . . . . . . . . 47
3.2.3 Language Model State . . . . . . . . . . . . . . . . 48
3.2.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.5 Implementation Details . . . . . . . . . . . . . . . . 50
3.2.6 Bigram and Trigram-Based Word Graph . . . . . . 52
3.2.7 Dead Paths Removal . . . . . . . . . . . . . . . . . 53
3.3 WG Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Word Graph Size . . . . . . . . . . . . . . . . . . . 54
3.3.2 Graph Word Error Rate . . . . . . . . . . . . . . . 56
3.4 Removing Empty-Edges . . . . . . . . . . . . . . . . . . . 58
3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Implementation Details . . . . . . . . . . . . . . . . 59
3.5 FW-BW Pruning . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.1 Edge Posterior Probability . . . . . . . . . . . . . . 64
3.5.2 Forward-Backward Based Algorithm . . . . . . . . 65
3.5.3 Implementation Details . . . . . . . . . . . . . . . . 66
3.5.4 Forward-Backward Based Pruning . . . . . . . . . . 66
3.6 Node Compression . . . . . . . . . . . . . . . . . . . . . . 68
3.7 Word-Graph Expansion . . . . . . . . . . . . . . . . . . . . 69
3.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 69
3.7.2 Conventional Algorithm . . . . . . . . . . . . . . . 71
3.7.3 Compaction Algorithm . . . . . . . . . . . . . . . . 73
4 Word Graph Decoding 77
4.1 1Best WG Decoding . . . . . . . . . . . . . . . . . . . . . 77
4.2 N-Best Decoding . . . . . . . . . . . . . . . . . . . . . . . 78
ii
4.2.1 The Stack-Based N-Best Word Graph Decoding . . 80
4.2.2 Exact N-Best Decoding . . . . . . . . . . . . . . . . 82
4.3 Consensus Decoding . . . . . . . . . . . . . . . . . . . . . 88
4.3.1 Word Posterior Probability . . . . . . . . . . . . . . 89
4.3.2 Confusion Network Construction . . . . . . . . . . 90
4.3.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.4 Confusion Network . . . . . . . . . . . . . . . . . . 96
4.3.5 Consensus Decoding . . . . . . . . . . . . . . . . . 97
5 Improvements of Speech Recognition 99
5.1 ASR Experiments . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 ASR System . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.1 Segmentation and Clustering . . . . . . . . . . . . . 100
5.2.2 Acoustic Adaptation . . . . . . . . . . . . . . . . . 101
5.2.3 Speech Transcription . . . . . . . . . . . . . . . . . 101
5.2.4 Training and Testing Data . . . . . . . . . . . . . . 103
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . 103
5.3.1 Word Graph Decoding . . . . . . . . . . . . . . . . 104
5.3.2 Impact of Beam-Width . . . . . . . . . . . . . . . . 105
5.3.3 Language Model Factor Experiments . . . . . . . . 107
5.3.4 Forward-Backward Based Pruning Experiments . . 111
5.3.5 Node-Compression Experiments . . . . . . . . . . . 120
5.3.6 Word Graph Expansion Experiments . . . . . . . . 121
5.4 N-Best Experiments . . . . . . . . . . . . . . . . . . . . . 123
6 Speech Translation Experiments 129
6.1 ITC-irst Machine Translation System . . . . . . . . . . . . 129
6.2 N-Best and Word Graph . . . . . . . . . . . . . . . . . . . 130
6.2.1 N-Best based Speech Translation . . . . . . . . . . 133
6.2.2 Word Graph-based Speech Translation . . . . . . . 137
iii
6.3 ITC-irst Works . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3.1 System Parameter Tuning . . . . . . . . . . . . . . 139
6.3.2 New BTEC Test and Development Sets . . . . . . . 143
6.3.3 The First Stage Results (for ASR) . . . . . . . . . 144
6.3.4 The Second Stage Results ( for Pure Text MT) . . 145
6.3.5 The Third Stage Results (for the Speech Translation) 145
7 Conclusions and Future Works 147
7.1 Efficient WG Generation . . . . . . . . . . . . . . . . . . . 147
7.2 Efficient WG Decoding . . . . . . . . . . . . . . . . . . . . 149
7.3 Results on Parameter Tuning . . . . . . . . . . . . . . . . 150
7.4 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . 150
Bibliography 153
iv
List of Tables
3.1 A list of hypotheses output by the decoder at time frame
t = 6, Nt = 7. . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 A list of hypotheses output by the decoder at time frame
t = 3, Nt = 4. . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 The Illustration of the stack-based N−best decoding. . . . 83
5.1 Training and Testing Data for BTEC and IBNC. . . . . . 103
5.2 BTEC: Word error rates with different rescoring methods. 104
5.3 IBNC: Word error rates with different rescoring methods. 104
5.4 BTEC: Costs of the decoder with different thresholds values.106
5.5 BTEC: Trigram-based graph word error rate. . . . . . . . 113
5.6 BTEC: Bigram-based graph word error rate. . . . . . . . 113
5.7 IBNC: Trigram-based graph word error rate. . . . . . . . 113
5.8 IBNC: Bigram-based graph word error rate. . . . . . . . . 114
5.9 BTEC: Trigram-based confusion network word error rate. 116
5.10 BTEC: Bigram-based confusion network word error rate. . 116
5.11 IBNC: Trigram-based confusion network word error rate. 116
5.12 IBNC: Bigram-based confusion network word error rate. . 119
5.13 BTEC: Node compression experiments. . . . . . . . . . . 121
5.14 IBNC: Forward-backward pruning and node compression. 121
5.15 BTEC: Bigram-based word graph expansion experiments. 122
5.16 IBNC: The N−best experiments. . . . . . . . . . . . . . . 124
v
6.1 Results reported in [Shen, 04] comparing minimum error
training with discriminative re-ranking (BLEU%). . . . . . 135
6.2 Experiments of the splitting algorithm on BTEC data . . . 136
6.3 The new BTEC test and development sets . . . . . . . . . 143
6.4 WER of the speech recognition on the 3006−test set when
the LM weight is 9.25. . . . . . . . . . . . . . . . . . . . . 145
6.5 The pure text translation results of the second stage (BLEU
score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.6 The speech translation results of the third stage (BLEU score).146
7.1 Word graph results of [Ortmanns, 1997] on the NAB’94 task.148
7.2 Word graph decoding results (WER) of [Mangu, 1999] on
the DARPA Hub-4 task . . . . . . . . . . . . . . . . . . 149
vi
List of Figures
1.1 The ITC-irst Speech Translation System. . . . . . . . . . . 3
2.1 Source-Channel model of speech generation and recognition 8
2.2 Source-Channel model of speech generation and recognition 10
2.3 An example of Hidden Markov Model . . . . . . . . . . . . 12
2.4 The construction of a word model by concatenating phoneme
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 The construction of a compound model for recognizing a
sequence of numbers; from [DeMori, 1998], Chapter 5. . . . 17
2.6 The multiple-pass search framework. . . . . . . . . . . . . 22
3.1 Bigram and Trigram Constraint Word Graphs . . . . . . . 52
3.2 The counting number of paths algorithm in a word graph . 56
3.3 A word graph with @BG edges. . . . . . . . . . . . . . . . 61
3.4 A word graph with @BG edges removed. . . . . . . . . . . 62
3.5 A word graph with words placed on nodes. . . . . . . . . . 63
3.6 Link posterior computation . . . . . . . . . . . . . . . . . 65
3.7 Illustration of the conventional word graph expansion algo-
rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.8 Illustration of the compact word graph expansion where ex-
plicit trigram probability only exists for (w 1, w4, w5) . . . . 74
4.1 An word graph example for the N−best stack-based decoding. 83
4.2 A word graph example for the exact N-Best decoding . . . 87
vii
4.3 The exact N-Best decoding - Step 1: The initial FwSco and
N-Best Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4 The exact N-Best decoding - Step 2: The N-Best tree at
step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5 Time dependent word posteriors . . . . . . . . . . . . . . . 91
4.6 A word graph example. . . . . . . . . . . . . . . . . . . . . 94
4.7 The resulted confusion network from the word graph in Fig 4.6. 95
5.1 Broadcast News Retrieval System . . . . . . . . . . . . . . 100
5.2 BTEC: Time for generating word graphs vs. different thresh-
old values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 BTEC: GER vs. different threshold values. . . . . . . . . 109
5.4 BTEC: Number of paths in word graphs vs. different thresh-
old values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 BTEC: WER vs different language factors. . . . . . . . . 112
5.6 BTEC: Confusion network and its word graph representation.115
5.7 IBNC: Confusion network size vs. word graph size. . . . . 117
5.8 IBNC: Consensus decoding word error rate vs. the beam-
width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.9 BTEC: N− different best sentences and N−best sentences
vs. WER. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.10 BTEC: N− different best sentences and N−best sentences
vs. time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1 The architecture of the ITC-irst SMT system . . . . . . . 131
6.2 Training of the ITC-irst SMT system . . . . . . . . . . . . 132
6.3 The estimation of parameters for speech recognition (the
first stage). . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4 The estimation of parameters for machine translation (the
second stage). . . . . . . . . . . . . . . . . . . . . . . . . . 140
viii
6.5 The whole system parameter tuning (the first stage). . . . 141
6.6 WER vs. LM weight on the development set. . . . . . . . 144
ix
Chapter 1
Introduction
From human prehistory to the new media of the future, speech communi-
cation has been and will be the dominant mode of human social bonding
and information exchange. The spoken word is now extended, through
technological mediation such as telephony, movies, radio, television, and
the Internet. Moreover, the demands for overcoming the barrier of differ-
ent languages are increased day by day. This trend reflects the primacy of
spoken communication in human psychology. A spoken language system
needs to have speech recognition, speech synthesis and speech translation
capabilities. For all the three components, significant challenges exist,
including robustness, flexibility, ease of integration, and engineering effi-
ciency. The goal of building commercially viable spoken language systems
has long attracted the attention of scientists and engineers all over the
world. The purpose of the work is also to move a small step toward this
goal. Concretely, we deal with some specific problems in the boundary
between speech recognition and speech translation, aiming to improve the
performance of speech translation.
In comparison with written language, speech and especially spontaneous
speech poses additional difficulties for the task of automatic translation.
Typically, these difficulties are caused by errors of the recognition compo-
1
Vu Hai Quan
nent, which is carried out before the translation process. As a result, the
sentence to be translated is not necessarily well-formed from a syntactic
point-of-view. Even without recognition errors, speech translation has to
cope with a lack of conventional syntactic structures because structures of
spontaneous speech differ from those of written language. Recently, the
statistical approach for machine translation showed the potential to tackle
these problems for the following reasons. First, the statistical approach is
able to avoid hard decision at any level of the translation process. Second,
for any source sentence, a translated sentence in the target language is
guaranteed to be generated. In most cases, this will be hopefully a syntac-
tically perfect sentence in the target language; but even if this is not the
case, in most cases, the translated sentence will convey the meaning of the
spoken sentence [Och, 2000].
Currently, statistical speech translation systems show typically a cas-
caded structure: speech recognition followed by machine translation. This
structure, while explicit, lacks some joint optimality in performance since
the speech recognition module and translation module are running rather
independently. Moreover, the translation module of a speech translation
system usually takes a single best recognition hypothesis transcribed in
text and performs standard text-based translation. Lots of supplemen-
tary information available from speech recognition such as N−best list,
word graphs, confusion networks and likelihoods of acoustic and language
model are not well utilized in the translation process. This kind of in-
formation can be effective for improving translation quality if employed
properly [Zhang, 2004]. The main objective of this work is just to exploits
these sources of information, providing a new interface for speech transla-
tion. Specifically, results of the speech recognition process are represented
by means of word graphs, N−best lists and confusion networks which are
subsequentially used as the input of the machine translation process.
2
CHAPTER 1. INTRODUCTION
Figure 1.1: The ITC-irst Speech Translation System.
textMT
WGASR
speech signal
conf.
solution
translationdecoding
speechhypotheses
translationhypotheses
best
speechdecoding
Rescoring
network
N−best
N−best
1−best
WG
: extractor (from WG)
MTnetworkconfusion
Figs 1.1 illustrates the speech translation system currently developed at
ITC-irst [Bertoldi, 2004], which can be virtually divided into two parts. In
the left-hand side, beginning from the speech signal of the utterance, the
automatic speech recognition (ASR) produces a word graph that contains
alternative recognition hypotheses. By using the generated word graph,
we can either extract the best hypothesis or the N−best list and pass it
to the text machine translation module (text MT). Moreover, a confusion
network can also be built from the word graph with a dominant property
that it has a more compact representation and a lower word error rate than
the former. A special machine translation algorithm (confusion network
MT) has been recently developed at ITC-irst for dealing with this kind
of input, which gains promising results. Similarly, in the right-hand side,
output of machine translation is again a word graph, a compact represent-
ing of the translation hypotheses in the target language. Clearly, if we are
just interested in the translation result, the best translation hypothesis can
be extracted directly from the word graph. Additionally, the possibility of
having word graphs and N−best list outputs from machine translation al-
low us to adjust parameters of the translation models or rescore translation
hypotheses with deeper and more extensive knowledge sources or optimize
the model parameters.
3
Vu Hai Quan
The thesis is organized as follows. In Chapter 2, we review the state
of the art in speech translation in terms of speech recognition and speech
translation. Specifically, a short introduction about the speech recognition
components including acoustic model, language model and beam search
are given first. Then, Multi-pass searches and methods for generating word
graphs and N−best lists are reviewed. Finally, the last section covers the
statistical machine translation in terms of translation models, beam search
and evaluation criteria.
Chapter 3 is about word graph generation for speech recognition, word
graph evaluation, word graph pruning and word graph expansion. The
key section in Chapter 3 is the word graph generation in which an efficient
algorithm for constructing general m−gram word graphs is fully described.
Moreover, three different word graph pruning techniques, namely beam-
search, forward-backward pruning and node compression are sequentially
presented. Finally, Chapter 3 is ended up with the general m−gram word
graph expansion in which two algorithms, namely the conventional word
graph expansion and the compact word graph expansion, are introduced.
Given the generated word graph, Chapter 4 presents word graph decod-
ing algorithms which consist of the 1−best decoding, the N−best decod-
ing and the consensus decoding. Note that the N−best decoding works
directly on word graphs output both from speech recognition and machine
translation.
In Chapter 5, experiments of word graphs for speech recognition using
two datasets, namely the Italian Broadcast News Corpus and the Basic
Traveling Expression Corpus, are described in details. Using two datasets
for doing experiments implies that our algorithms work for both large vo-
cabulary and spontaneous speech tasks.
Results of using word graphs, N−best list and confusion network as
inputs for machine translation are given in Chapter 6. Moreover, in Chap-
4
CHAPTER 1. INTRODUCTION
ter 6 a new method for system parameter tuning in which the N−best lists
are used for optimizing model parameters is presented. This new method
gives significant improvements to the translation performance compared to
the baseline results.
Finally, we end up with Chapter 7 by analyzing our results in conjunc-
tion with results from other research groups and highlighting some future
works.
5
Vu Hai Quan
6
Chapter 2
Speech Translation
In this chapter some of the theoretical foundations of speech recognition
and machine translation which are based on the statistical approach are
discussed. First, the overviews of speech recognition and machine transla-
tion are given. Then specific algorithms employed for the development of
speech recognition and machine translation systems are described.
2.1 Speech Recognition
According to [Jelinek, 1998], at a very basic form, a speech recognizer is
a device that automatically transcribes speech into text. The recognized
words can be the final results, as for applications such as commands and
control, data entry, and document preparation. They can also serve as
the input to further linguistic processing in order to achieve speech under-
standing as spoken information retrieval, speech translation etc.
As illustrated in Fig 2.1, the person’s mind decides the source word
sequence W which is delivered through his/her text generator. The source
is passed through a noisy communication channel that consists of speaker’s
vocal apparatus to produce the speech waveform and the speech signal
processing component of the speech recognizer. Finally, the speech decoder
aims to decode the representation X of the acoustic signal into a word
7
2.1. SPEECH RECOGNITION Vu Hai Quan
Figure 2.1: Source-Channel model of speech generation and recognition
word sequence Noisy
Channel
speech signal
Channel Decoding
word sequence
W
feature vectors
X W
sequence W which is hopefully close to the original word sequence W.
In fact, the recognizer is usually based on some finite vocabulary that
restricts the words that can be output. To discuss the problem of speech
recognition, we need its mathematical formulation.
Let X denote the acoustic evidence (data) on the basis of which the
recognizer will take its decision about which words were spoken. Without
loss of generality we may assume that X is a sequence of symbols taken
from some alphabet X :
X = x1, x2, ..., xT ; xi ∈ X (2.1)
The symbol xi could be thought of as having been generated in time,
as indicated by the index i.
Let:
W = w1, w2, ..., wN , wi ∈ W (2.2)
denote a string of N words each belonging to a fixed and known vocab-
ulary W
If P (W|X) denotes the probabilities that the word sequence W were
spoken, given that the evidence X was observed, then the recognizer should
decide in favor of a word string W satisfying:
W = arg maxW
P (W|X) (2.3)
That is, the recognizer will pick up the most likely word string given
the observed acoustic evidence.
8
CHAPTER 2. SPEECH TRANSLATION
The well-known Bayes’ formula of probability theory allows to rewrite
the right-hand side probability of Eq. 2.3 as
P (W|X) =P (W) · P (X|W)
P (X)(2.4)
where P (W) is the probability that the word sequence W will be ut-
tered, P (X|W) is the probability that when the speaker says W the acous-
tic evidence X will be observed, and P (X) is the average probability that
X will be observed, that is:
P (X) =∑
W ′
P (W′) · P (X|W′) (2.5)
Since the maximization in Eq. 2.3 is carried out with the variable X
fixed, it follows from Eq. 2.3 and Eq. 2.4 that the recognizer’s aim is to
find the word sequence:
W = arg maxW
P (W) · P (X|W) (2.6)
An added complication is the fact that the two factors used in the Eq. 2.6
have very different dynamic spaces. If they are just multiplied as indicated
in Eq. 2.6, the decision for a word sequence would be dominated by the
acoustic scores and the language models would have hardly influence. To
balance the probabilities, it is customary to use an exponential language
model factor for the language model, denoted by γ. Thus Eq. 2.6 can be
written as follows:
W = arg maxW
P (W)γ · P (X|W) (2.7)
Fig 2.2 shows the basic components of a typical speech recognition
system. Acoustic Models include the representation of knowledge about
acoustics, phonetics, environmental variability, gender and dialect differ-
ences among speakers etc. Language Models refer to a system’s knowledge
9
2.1. SPEECH RECOGNITION Vu Hai Quan
Figure 2.2: Source-Channel model of speech generation and recognition
Speech Signal
Feature Extraction
Search Algorithm
Acoustic data
Text Corpus
Acoustic Model Language Model
Recognized Text
about constituents of words, word co-occurrences and word sequences. In
the following section we will describe in detail the components of a speech
recognizer.
2.1.1 Feature Extraction
The aim of the feature extraction module is to parameterize the speech
waveform into a sequence of feature vectors which contain the relevant in-
formation about the utterance sounds. For any speech recognition system,
acoustic features should have the following properties:
• good discrimination in order to distinguish between similar speech
sounds;
• allowing the building of statistical models without the need for an
excessive amount of training data;
• having statistical properties which are invariant across speakers and
over a wide range of speaking environments.
Of course, there is no single feature set that possesses all the above
properties. Features used in speech recognition systems are largely derived
10
CHAPTER 2. SPEECH TRANSLATION
from their utility in speech analysis, speech coding and psycho-acoustics.
In statistical automatic speech recognition, the speech waveform is usu-
ally sampled at a rate between 6.6 kHz and 20 kHz and processed to pro-
duce a new representation as a sequence of vectors containing values which
are generally called parameters. The vectors X typically comprise between
10 and 20 parameters, and are usually computed every 10 or 20 msec.
Parameter values are used in the estimation of the probability that the
portion of waveform under analysis is a particular acoustic phenomenon.
Currently, implementations of feature extraction typically include:
• Short-Time Spectral Features: Most speech recognition systems use
either discrete Fourier transform (DTF) or linear predictive coding
(LPC) spectral analysis methods based on fixed size frames of win-
dowed speech data, and extract spectral features, including LPC-
derived features, such as reflection coefficients, log area ratios, line
spectral frequencies, composite sinusoidal model parameters, autocor-
relations, etc. The short-time spectral feature set for each frame is
extended to include dynamic information (e.g the first and the second
order derivatives) of the features. The most popular representation
includes cepstral features along with its first and second time deriva-
tives.
• Frequency-Warped Spectral Features: Sometimes non-uniform fre-
quency scales are used in spectral analysis to provide the so-called
Mel-frequency or bark-scale spectral feature set. The motivation is
to mimic the human auditory system which processes the spectral
information on a non-uniform frequency scale.
The details of this topic can be found in [De Mori, 1998], Chapters 2,3,4
and [Lee, 1995].
11
2.1. SPEECH RECOGNITION Vu Hai Quan
2.1.2 Acoustic Model
As shown in Eq. 2.6, the recognizer has to be able to determine the value
P (X|W) of the probability that when the speaker uttered the word se-
quence W the acoustic processor produced the data X. Thus to compute
P (X|W) we need a statistical model of the speaker’s interaction with the
acoustic process. The usual acoustic model employed in speech recognizers,
the hidden Markov model (HMM), will be briefly discussed in the following.
An HMM is a composition of two stochastic processes, a hidden Markov
chain, which accounts for temporal variability, and an observable process,
which accounts for spectral variability. This combination has proved to be
powerful enough to cope with the most important sources of speech am-
biguity, and flexible enough to allow the realization of recognition systems
with dictionaries of tens of thousands of words.
Figure 2.3: An example of Hidden Markov Model
a 33a a44
24a13a
11 22b b 33b b44
34a23a12
121 a
b11a 22
23 34
2413
b b
b b
2 43
Fig 2.3 illustrates an example of an HMM model which has 4 states.
Formally, we can summarized the definition of a HMM as follows:
Let x ∈ X be a variable representing observations and s i, sj ∈ S be
variables representing model states; the model can be represented by the
following parameters:
A ≡ {aij|si, sj ∈ S} transition probabilities (2.8)
B ≡ {bij(·)|si, sj ∈ S} output probabilities (2.9)
12
CHAPTER 2. SPEECH TRANSLATION
π ≡ {πi} initial probabilities (2.10)
where:
aij ≡ p(st = sj|st−1 = si) (2.11)
bij(x) ≡ p(xt = x|st−1 = si, st = sj) (2.12)
πi ≡ p(s0 = si) (2.13)
The details of HMM and its use in speech recognition can be found in
[De Mori, 1998], Chapter 5.
For a large-vocabulary system, typically, there is a set of basic recog-
nition units that are smaller than whole words, which are named subword
units. Examples of these so-called subword units are phonemes, demisylla-
bles, and syllables. The word models are then obtained by concatenating
the subword models according to the phonetic transcription of the words
in a pronunciation lexicon or dictionary. In most systems, these subword
units are modeled by HMMs. Fig 2.4 illustrates the construction of a word
model by linking phoneme models.
Usually the choice of HMM topologies, as well as the type of probability
distribution, is decided by the developer. The values of the distribution
parameters, as well as transition probabilities, are estimated by a training
algorithm, which processes a set of labeled examples, the training set, for
computing an optimal set of values for HMM parameters. Optimality is
defined by means of an objective function depending both on the HMM
parameters and on the observations contained in the training set. Once the
objective function has been chosen, model training becomes a constrained
maximization problem. Training algorithms can differ in the optimality
criterion and/or in the method of performing the optimization. In general,
parameters of statistical models are estimated by iterative learning algo-
rithms (e.g. the EM algorithm) in which the likelihood of a set of training
13
2.1. SPEECH RECOGNITION Vu Hai Quan
Figure 2.4: The construction of a word model by concatenating phoneme models.
Transcription
one w ah ntwo t uwthree th r iy
zero z ih r ow
Lexical Rules
w ah n HMM Linking"one"
w
"one"
ah n
aa
ae
zh
Phoneme Models
data is guaranteed to increase at each step. Details of these algorithms are
given in [De Mori, 1998], Chapter 6.
2.1.3 Language Model
Eq. 2.6 further requires that we have to compute for every word sequence
W the probability P (W), that the speaker wishes to utter W. P (W)
is interpreted as the language model. The language model functionally
captures syntax, semantics, and pragmatics of a language and provides the
prior probability P (W) for a word sequence W. The probability P (W)
can be expressed by:
P (W) =N∏
i=1
P (wi|w1, ..., wi−1) (2.14)
14
CHAPTER 2. SPEECH TRANSLATION
=N∏
i=1
P (wi|hi) (2.15)
'N∏
i=1
P (wi|wi−1i−n+1) (2.16)
where hi = w1, ..., wi−1 is the history or context of word w i. The probabili-
ties P (wi|hi) may be difficult to estimate as the sequence of words h i grows.
In order to estimate these probabilities, it is usually assumed that a word
sequence follows an (n− 1)-th order Markov process, as in Eq. 2.16. The
corresponding language models are called n−gram language models. To-
day, bigram (n = 2) and trigram (n = 3) language models are usually used
in most ASR systems. In the following we will briefly introduce trigram
language models and methods for estimating its probabilities.
From Eq. 2.16 and setting n = 3, we have
P (W) =N∏
i=1
P (wi|wi−2, wi−1) (2.17)
The basis trigram probabilities can be estimated by:
P (w3|w1, w2) = f(w3|w1, w2) 'C(w1, w2, w3)
C(w1, w2)(2.18)
where f(|) denotes the relative frequency function and C(h i) denotes the
number of times the event hi appeared in the training corpus. Unfor-
tunately, even in large real training texts, most of possible trigrams do
not occur. Hence, for each W including any of such unobserved event, the
model would assign P (W) = 0. In this case, the recognizer would be forced
to commit a large number of errors. It is therefore necessary to smooth
the trigram frequencies. There are two main methods for smoothing the
trigram probabilities.
The first one, namely the linear smoothing, is done by interpolating
trigram, bigram and unigram relative frequencies as in Eq. 2.19 where the
non-negative weights satisfy the constraint λ1 + λ2 + λ3 = 1. The different
15
2.1. SPEECH RECOGNITION Vu Hai Quan
ways of choosing weights λi leads to the different interpolating schemes.
P (w3|w1, w2) = λ3f(w3|w1, w2) + λ2f(w3|w2) + λ1f(w3) (2.19)
The second method, namely the backing-off, is quite prevalent in state of
the art speech recognizers. It is defined through the formula:
P (w3|w1, w2) =
f(w3|w1, w2) if C(w1, w2, w3) ≥ K
αQT (w3|w1, w2) if 1 ≤ C(w1, w2, w3) < K
β(w1, w2)P (w3|w2) otherwise
(2.20)
where α, β are chosen so that the probability P (w3|w1, w2) is properly
normalized. Furthermore, QT (w3|w1, w2) is a Good-Turing type function
and P (w3|w2) is a bigram probability estimate having the same form as
P (w3|w1, w2):
P (w3|w2) =
f(w3|w2) if C(w2, w3) ≥ L
αQT (w3|w2) if 1 ≤ C(w1, w2, w3) < L
β(w2)f(w3) otherwise
(2.21)
E.q 2.20 then constitutes a recursion. The thresholds K and L are chosen
empirically. The argument for backing-off is that if there is enough evidence
for it, then the relative frequency is a very good estimate of the probability.
If not, then one should back-off and rely on bigrams; if there is not enough
evidence even for bigram, unigrams are needed. For details on language
models, see [De Mori, 1998], Chapter 7.
2.1.4 Search
The decision on which words have been spoken must be made by means of
an optimization procedure that combines information from several sources:
the language model, the acoustic-phonetic models of phonemes, and the
pronunciation lexicon.
16
CHAPTER 2. SPEECH TRANSLATION
Figure 2.5: The construction of a compound model for recognizing a sequence of numbers;
from [DeMori, 1998], Chapter 5.
zero
one
two
one
two
zero
Word Network end
end
start
start
startHMM
Compound end
Transcription
HMM Linking
zeroz ih r ow
one
two
Phoneme Models
w ah n
t uw
Lexical Rules
ih owrz
w ah n
t uw
Phoneme Network
For hypothesizing a word sequence w1, ..., wN , a compound HMM is
searched which includes the three knowledge sources mentioned. Fig 2.5
shows an example of such compound model for recognizing digits. As we
can see, the construction of a compound model for recognition includes
three steps. First, the language is represented by a network with word-
labeled arcs. The connections between words are made by means of empty
transitions, which could be assigned a probability. Given the network
representing the language, each word-labeled arc is replaced by a sequence,
or possibly a network, of phoneme-labeled arcs, according to a set of lexical
rules. Finally, each phoneme-labeled arc is replaced by an instance of the
corresponding HMM, obtaining the final compound model as illustrated in
Fig 2.5.
With this kind of model, the knowledge for acoustic, lexical and lin-
17
2.1. SPEECH RECOGNITION Vu Hai Quan
guistic knowledge can be naturally represented in a graph structure, that
can become huge when dealing with large vocabularies. The search for
the most probable word sequence translates into the search of an optimal
path over a derived structure, called “trellis”, which corresponds to the
unfolding of this graph along the time axis.
There are two main search strategies used in speech recognition. The
first one is named Viterbi search and the second one is called stack decoding
(A? search). The first one is normally carried out via time-synchronous
fashion by the so-called Viterbi algorithm, that relies on the principles of
dynamic programming. To avoid exhaustive exploration of a possibly huge
search space, the beam-search technique is used, which consists in pruning
the less promising hypotheses, based on a local estimation.
Stack decoding represents the best attempt to use A? search instead
of time-synchronous search for continuous speech recognition. It is a
tree search algorithm, which takes a slightly different viewpoint than the
time-synchronous Viterbi Search. Time-synchronous search is basically a
breadth-first search, so it is crucial to control the number of all possible
model states. On the other hand, stack decoding treats the search as a
stack for finding a path in a tree whose branches correspond to words in
the vocabulary V . The search tree has a constant branching factor of |V |,
if we allow every word to be followed by every word. In the following sub-
section, we present the one-pass Viterbi based search which is currently
used in most speech recognition system. The detail of the Viterbi-based
search algorithm and also stack decoding can be found in [Jelinek, 1998],
[De Mori, 1998], Chapters 8,9.
One-Pass Viterbi Search
Let X = x1, ..., xT be the sequence of acoustic vectors and S = s1, ..., sT
be the sequence of states through the compound HMM. We can define the
18
CHAPTER 2. SPEECH TRANSLATION
joint probability as follows:
P (X,S|W) =T∏
t=1
p(xt, st|st−1,W) (2.22)
=T∏
t=1
p(st|st−1,W)p(xt|st) (2.23)
where p(xt, st|st−1,W) denotes the transition and emission probabilities for
the compound HMM of W.
Denoting the language model probability by P (W), the Bayes decision
rule as in Eq. 2.6 results in the following optimization problem:
W = arg maxW{P (W)
∑
sT1
P (xT1 , sT
1 |W)} (2.24)
≈ arg maxW{P (W) max
sT1
P (xT1 , sT
1 |W)} (2.25)
Here we have made use of the so-called maximum approximation, which
is also referred to as Viterbi approximation. Instead of summing over all
paths, we consider only the most probable path. With maximum approx-
imation, the search space can be described as a huge network through
which the best time alignment path has to be found. The search has to be
performed at two levels: at the state level (sT1 ) and at the word level (W).
Viterbi Beam Search
The survey of the Viterbi search is given here. The details can be found
in [Ortmanns, 1997], [De Mori, 1998], chapter 9,10.
To explain the time-synchronous Viterbi search in a formal way, we
define some quantities:
Q(t, s; w) = score of the best path up to time t that ends in state s of
word w and
B(t, s; w) = start time of the best path up to time t that ends in state
s of word w
19
2.1. SPEECH RECOGNITION Vu Hai Quan
There are two types of dynamic programming transition rules, namely
intra-word and inter-word transition. In the word interior, we have the
recurrent equation:
Q(t, s; w) = maxs′
{p(xt, s|s′; w).Q(t− 1, s′; w)} (2.26)
B(t, s; w) = B(t− 1, smax(t, s; w); w) (2.27)
where smax(t, s; w) is the optimum predecessor state for the hypothesis
(t, s; w). When encountering a potential word boundary, we must perform
the recombination over the predecessor words. For doing this, let us define:
H(w; t) = maxv{p(w|v).Q(t, Sv; v)} (2.28)
where p(w|v) is the conditional language model probability of word bigram
(v, w). The symbol Sv denotes the terminal state of word v. The algorithm
can be summarized as follows, [Ortmanns, 1997]: It is also important
to notice that the above one-pass Viterbi search uses linear lexicon for
bigram language model. A very similar search algorithm which uses tree
lexicon was also presented in [Ortmanns, 1997]. With this approach, the
pronunciation lexicon is organized in the form of a prefix tree, in which each
arc represents a phoneme model. The lexical tree organization provides a
more compact space for the searching algorithm.
Beam Search
Since, for a fixed time frame, all (word, state)-hypotheses cover the same
portion of the input, their scores can be directly compared. This enables
the system to avoid an exhaustive search, and to perform a data-driven
search instead, i.e., to focus the search on those hypotheses that are most
likely to result in the best state sequence. In details, every 10-ms frame,
the score of the best hypothesis is determined, then all hypotheses whose
scores are below of this optimal by more than a fixed factor are pruned, i.e.
20
CHAPTER 2. SPEECH TRANSLATION
One-Pass Viterbi Search
1 for t = 1 to T
2 do
Acoustic Level: Process (word, state)-hypothesis
3 Initialization
Q(t, s = 0, w) = H(w, t− 1)
B(t− 1, s = 0; w) = t− 1
4 Time alignment: Q(t, s; w) using E.q 2.26
5 Propagate back-pointers B(t, s; w), using E.q 2.27
6 Pruning unlikely hypotheses
7 Purge bookkeeping lists.
Word Pair Level: Process word end hypotheses
8 for each word w
9 do
10 H(w, t) = maxv{p(w|v).Q(t, Sw; w)}
11 ν0(w; t) = arg maxv{p(w|v).Q(t, Sw; w)}
12 Store the best predecessor ν0 = ν0(w; t)
13 Store the best boundary τ0 = B(t, Sν0; ν0)
they are removed from further consideration. This beam search strategy
will be considered in full detail in chapter 5, for the experiments of word
graph generation.
2.2 Multiple-Pass Search
A speech recognition system should take into account all available knowl-
edge sources when recognizing an utterance. Besides the speech signal,
and the models of the recognition units, also knowledge about syntax, se-
mantic, and other properties of the natural language might be used when
searching for the most likely word sequence. One way to include these
knowledge sources in the search process is to use them simultaneously to
21
2.2. MULTIPLE-PASS SEARCH Vu Hai Quan
Figure 2.6: The multiple-pass search framework.
speech input
Word graph / N -Best Generation
First pass
Ordered Sentence Lists Word graphs Rescore Top
Choice
Knowledge Source 1
Knowledge Source 2
Second pass
Knowledge Source 1 Knowledge Source 2
statistical grammar, syntax,
bigram / trigram LM
semantics, long-span LM
other knowledge source, etc.,
constrain a single search. Since many of the natural language knowledge
sources contain ”long-distance” effects, the search can become quite com-
plex. Furthermore, the common left-to-right search strategy requires that
also all knowledge sources are formulated in a predictive, left-to-right man-
ner, which restricts the type of knowledge that can be used.
One way to solve these problems is to apply sources not simultane-
ously but sequentially so that the search for the most likely hypothesis is
constrained progressively. Thus the advantages provided by a knowledge
source can be traded-off against the costs of applying it. First, the most
powerful and cheapest knowledge sources are applied to generate a list of
the top N hypotheses or word graphs (word lattices). Then, these hypothe-
ses are evaluated by means of the other more expensive knowledge source
so that the list of hypotheses can be reordered according to a more refined
likelihood score. The two-pass search paradigm is illustrated in Fig 2.6.
In this section we will review N -best and word graph algorithms and their
applications in speech recognition.
Besides the two-pass search paradigm, there are also other uses for these
algorithms.
22
CHAPTER 2. SPEECH TRANSLATION
• The N -best list and word graph generated during recognition stage can
be used to investigate new knowledge sources. Since it is not necessary
to resume the whole recognition process, experimental evaluation of
the additional information provided by a new knowledge source can
be done much easier.
• Methods for discriminative training of HMMs usually require a list of
errors and near-misses so that the correct answer can be made more
likely and the errors and near misses can be made less likely. Such list
can be provided either by N -best list or word graph algorithms.
• In speech recognition system, some parameters like the weights of
different knowledge sources can not be easily estimated. For the fine-
tuning of these parameters, normally repeated recognition test are
required. Using the N -best lists or word graphs, generated during
a single run of the recognizer, the parameters optimization can then
be done much easier. This topic will be exploited in full detail in
the last chapter, when we apply the N -best lists and word graphs for
parameter tuning in both speech recognition and machine translation.
2.2.1 N-Best Algorithms
Different algorithms for finding the N -best sentence hypotheses have been
proposed in [Schwartz, 1991]. Some of these algorithms are exact while
others use different approximations to reduce computational requirements.
Basically, the Viterbi algorithm typically used in an HMM-based speech
recognizer only finds the best word sequence (corresponding to the state
sequence with the highest likelihood score). To obtain not only the first
best hypothesis but the list of the best N -hypotheses, several modifications
of the Viterbi algorithm are necessary. Different algorithms that are able
to find the N -best list of hypotheses are presented in the following.
23
2.2. MULTIPLE-PASS SEARCH Vu Hai Quan
The Exact N-Best Algorithm
The exact N -best algorithm was proposed in [Schwartz, 1990]. This algo-
rithm is similar to the time-synchronous Viterbi algorithm, but instead of
likelihood scores for state sequences, likelihood scores for word sequences
are computed. To be able to find the N -best hypotheses, it is necessary
to keep separate records for theories (paths) with different word sequence
histories. When two or more paths come to the same state at the same
time and also have the same history (word sequence), their probabilities
are added. When all paths for a state have been calculated, only a spec-
ified number of these local theories are maintained. Their probabilities
have to be within a threshold of the probability of the most likely theory
at that state. Therefore, any word sequence hypothesis that reaches the
end of the utterance has an accumulated score. This score is the condi-
tional probability of the observed speech signal given the word sequence
hypothesis. Thus, the list of N - best hypotheses is generated. To reduce
the exponentially growing number of possible word sequences, pruning is
used. It can be shown that this algorithm will find all hypotheses that are
within a search beam specified by the pruning threshold. To reduce the
computational requirements connected with the exact N -best algorithm,
it is possible to combine the N -best algorithm with the forward-backward
search algorithm which will be described in detail in the next chapter. Ba-
sically, the forward-backward search algorithm takes place in two stages.
In the first stage, a fast time-synchronous search of the utterance in for-
ward direction is performed. In the second stage, a more expensive search
is performed, processing the utterance in reverse direction and using infor-
mation gathered by the forward search. The information from the forward
search is used to avoid expanding the backward tree toward non-promising
hypotheses thus saving computational costs.
24
CHAPTER 2. SPEECH TRANSLATION
The Tree-Trellis Algorithm
The tree-trellis algorithm for finding the N -best hypotheses was proposed
in [Soong, 1991]. This algorithm combines a frame-synchronous forward
trellis search with a frame-asynchronous backward tree search. In the for-
ward trellis search, a modified Viterbi algorithm is used. In a normal
Viterbi algorithm, only the back-pointer arrays necessary to trace-back the
best hypothesis would be stored. The modified algorithm used here also
stores rank-ordered predecessor lists for each grammar node time frame.
For a given grammar node and time frame, such a list has an entry for
each predecessor of that grammar node. This entry contains the likelihood
score of the best partial path coming via that predecessor to the grammar
node. Before being stored, the entries in a predecessor list are rank-ordered
according to their likelihood scores. When the modified Viterbi search has
reached the end of the utterance, the best hypothesis can easily be obtained
by tracing-back. In the backward search, an A? tree search algorithm is
used to find the N -best hypotheses. This tree search starts from the end of
the utterance at the final grammar node. In each step, the backward par-
tial path is extended toward the beginning of the utterance by time-reverse
Viterbi search for single best word extension. The best single word exten-
sion is found using the rank-ordered predecessor lists generated during the
forward search. When the backward partial path reaches the beginning of
the utterance, the best hypothesis is found (it is identical to the hypoth-
esis already found in the forward Viterbi search). By continuing the A ?,
the N -best hypotheses can be found sequentially. A good summary of the
theory behind the tree-trellis algorithm can be found in [Soong, 1991]. In
[Soong, 1991], a modified version of the tree-trellis algorithm is presented
where a simple grammar is used in the forward Viterbi search and a more
complex grammar is used in the backward tree search. This concept is sim-
25
2.2. MULTIPLE-PASS SEARCH Vu Hai Quan
ilar to the forward-backward algorithm mentioned in the previous section
and can result in reduced computational requirements.
Lattice N-Best Algorithms
The exact N -best algorithm and tree-trellis algorithm require both a sig-
nificant computational overhead with respect to that of a normal Viterbi
search. Due to this reason, faster N -best algorithms adopting some approx-
imations have been suggested. These algorithms do not guarantee that the
exact list of N -best hypotheses will be found. It can either happen that the
likelihood score of an entry is underestimated or that an entry is missing
totally. But the approximations might still be sufficient for many applica-
tions. Two N -best algorithms using different approximations are described
now.
The lattice N -best algorithm was proposed in [Schwartz, 1991]. It is
based on a standard time synchronous forward Viterbi search but differs
in the back-pointer information stored during the search progress. At each
grammar node for each time frame, not only the best scoring word but all
words that arrive at that node are stored in a trace-back list, together with
their scores and the time when the word is started. Instead of storing all
the arriving words, it is also possible to store only the best N local words
(or word theories). Like in Viterbi search, only the score of the best word
is passed on as a base for further scoring together with a pointer to the
stored trace-back list. At the end of the utterance, a simple tree search
is used to step through the stored trace-back lists and obtain the N -best
complete sentence hypotheses sequentially. This tree search requires nearly
no computation and can be performed very fast. A serious disadvantage
of the lattice algorithm is that it underestimates or completely misses high
scoring hypotheses due to the fact that all (except the best) hypotheses
are derived from segmentations found for other higher scoring hypotheses.
26
CHAPTER 2. SPEECH TRANSLATION
This is caused by the assumption that is inherent to the lattice algorithm.
This problem can be mostly overcome by the word-dependent algorithm.
The Word-Dependent Algorithm
Like the lattice N -best algorithm, the word-dependent algorithm was also
proposed in [Schwartz, 1991]. It is a compromise between the exact N -best
algorithm and the lattice algorithm. Here it is assumed that the starting
time of a word does depend on the previous word but does not on any
word before that. Therefore, theories are distinguished if they differ in the
previous word. With this algorithm, within a word, the likelihood scores
for each of the different local theories are preserved. At the end of each
word, the likelihood score for each previous word is recorded along with
the name of the previous word. Then a single theory with the name of the
word that just ended is used to proceed. At the end of the utterance, a tree
search similar to the one used in the lattice algorithm, is used to obtain
the list of the N most likely hypotheses. To reduce the computational
requirements, the number Nlocal of theories kept locally should be limited.
Typically, values for Nlocal range from 3 to 6.
Word Graph Based N-Best Search
The last, but very different approach for finding the N -best hypotheses was
proposed by [Tran, 1996] which is based on the word graph. The detail of
word graphs will be given in the next section. Here we will shortly review
the proposed algorithm.
The principle of the approach is based on the following considerations:
When several paths lead to the same node in the word graph, according to
the Viterbi criterion, only the best scored path is expanded. The remaining
paths are not considered for further expansions. Assuming that the first
best sentence hypothesis was found by the Viterbi decoding through a given
27
2.2. MULTIPLE-PASS SEARCH Vu Hai Quan
word graph, the second best path is the path which competed with the best
one but was recombined at some node of the best path. Thus in order to
find the second best sentence hypothesis, we have to consider all possible
partial paths in the word graph which reach some node of the best path
and might share the remaining section with the best path. By applying
this procedure repeatedly, N -best sentence hypotheses can be successively
extracted from a given word graph.
More in detail, the best path can be determined simply by comparing
the accumulative scores of all possible paths leading to the final node of
the word graph. In order to ensure that this best word sequence is not
taken into account while searching for the second best path, the complete
path is copied into a so-called N -best tree. Using this structure, a back-
ward cumulative score for each word copy is computed and stored at the
corresponding tree node. This allows for fast and efficient computation
of the complete path scores required to determine the next best sentence
hypothesis. The second best sentence hypothesis can be found by taking
the path with the best score among the candidate paths which might share
a remaining section of the best path. The partial path of this sentence hy-
pothesis is then copied into the N -best tree. Assuming the N -best paths
have been found, the (N + 1)-th best path can be determined by exam-
ining all existing nodes in the N -best tree, because it can share the last
part of some path among the top N paths. Thus it is important to point
out that this algorithm performs a full search within the word graph and
delivers exact results as defined by the word graph structure. The detail
implementation of this algorithm will be presented in Chapter 4.
2.2.2 Word Graph Algorithm
The main obstacle of N-best method is that the number of sentences needed
to include the correct hypothesis grows exponentially with the length of the
28
CHAPTER 2. SPEECH TRANSLATION
utterance. In order to find a way to compactly represent the alternative
hypotheses, word graphs, [Ney, 1993, 1994; Ortmanns, 1997; Neukirchen,
2001] and word lattices [Odell, 1995] were introduced. In a lattice or word
lattice, words are represented by weighted arcs where weights correspond to
acoustic scores (usually log probabilities) while in a word graph, words are
represented by nodes and node weights correspond to acoustic likelihood
scores. However in principle they are similar. So from now on, we only
refer to them as word graphs (WGs for short). The main idea of word
graph is to represent word alternatives in regions of the speech signal where
the ambiguity in the acoustic recognition is high. The advantage is that
the acoustic recognition is decoupled from the application of the language
model, in particular a long-span language model, which can be applied
in a subsequent post-processing step. The number of word alternatives
should be adapted to the level of ambiguity in the acoustic recognition.
In the following, we present two algorithms: the first one is for lattice
generation which was proposed by [Odell, 1995]; the second one is for word
graph generation, proposed by [Ney, 1993; Ney, 1994; Ortmanns, 1997;
Neukirchen, 2001].
Lattice generation
According to [Odell 1995], only a few simple modifications are needed to
modify the one-pass time-synchronous decoding to generate a lattice of
hypotheses. Rather than discarding all but the most likely partial path
when these merge at word ends, it is possible to retain information about
them to allow lattice traceback.
When only the most likely hypothesis is required, the language model
likelihoods are added to the word ending state and the traceback infor-
mation updated. The states from equivalent partial paths are recombined
and only the most likely survives to propagate into the following network.
29
2.2. MULTIPLE-PASS SEARCH Vu Hai Quan
When a lattice of hypotheses is needed, the less likely word ending states
are not discarded but are linked to the most likely one and the combined
structure propagates into the following network. The calculation in the
remainder of the network is only performed on the most likely word ending
state but when traceback occurs at the end of the sentence, all of the word
ending states are used to construct a lattice of hypotheses. At the end of
each utterance, traceback proceeds separately through each of the linked
word ending states and a lattice of alternative hypotheses is constructed.
Each node in the generated lattice has an associated time. Each arc
has an associated word identity, the acoustic likelihood and the language
model likelihood, and forms a link between two nodes which define the
start and end times of the word hypothesis.
Word Graph Generation
Two algorithms [Ney, 1993], [Ney, 1994] have been proposed to build word
graphs without a backward phase. Both are based on time-synchronous
beam search decoding using a tree-organization lexicon. The search space is
built dynamically, instantiating a new tree each time a leaf, corresponding
to a word end, is reached. Basically, each word ending at a given time
corresponds to a word hypothesis, which is kept and then used to build a
word graph. The word graph is defined as a direct acyclic graph, where
each arc is labeled with a word and a score, and each node corresponds to
a time frame.
In [Ney, 1993], a first algorithm is proposed, in which a word hypotheses
generator (WHG) finds, with a beam search, word hypotheses consisting of
a word identifier, an acoustic score, start and end times. Hypotheses can
be arranged in a large word graph that must be pruned and optimized by
the subsequent module, called word graph optimizer (WGO). A reduction
in the number of arcs can be obtained by the WHG if words are allowed
30
CHAPTER 2. SPEECH TRANSLATION
only every other or every third frame. A possible WGO works as follows:
first the word graph is unfolded from the start node into a tree structure;
then, for each set of partial paths with identical start time, end time and
word sequence, only the most probable path is kept; finally, edges having
a score below a certain threshold, with respect to the best complete score,
are removed. Other actions that may be performed by a WGO concern
the merging of nodes with identical time, or merging of subgraphs having
identical time boundaries and word labels.
A different word graph builder is presented in [Ney 1994], [Ortmanns,
1997] which is integrated into a one-pass search algorithm. It exploits the
word-pair approximation, assuming that the boundary between two words
is independent of previous history. Using this assumption in conjunction
with an m-gram language model, it is possible to recombine, at time t, all
the word sequences having identical m− 1 words. The algorithm is based
on a dynamic programming recursion that finds, at a time t, the optimal
word boundary between words wi and wj at ending time t, say τ(t; wi, wj),
which, under the word-pair assumption, is independent of previous words.
The cumulative score for word wj from the optimal word boundary to t,
say h(wj, τ(t; wi, wj), t) is also computed. The algorithm is summarized as
follows. The details can be found in [Ortmanns, 1997]. Let:
• τ(t; v, w) = B(t; Sw; w) : word boundary between the predecessor word
v and the current word w
• h(w; τ, t) = P (xtτ+1|w) : probability that word w produces the acoustic
vector xτ+1...xt.
• H(w; t) = maxv{p(w|v).Q(t, Sw; w)} : (joint) probability of generating
the acoustic vector x1...xt and a word sequence wn1 with ending word
w and ending time t.
The definitions of B(·), H(·), Q(·) are given in Section 2.1.4. For each
31
2.2. MULTIPLE-PASS SEARCH Vu Hai Quan
predecessor word v, along with word boundary τ = τ(t; v, w) the word
scores are recovered using the equation:
h(w; τ, t) =Q(t, Sw; w)
H(v, τ)(2.29)
Given the above defined quantities, WGs can be built by the following
bigram and one-pass Viterbi search based algorithm: [Neukirchen, 2001]
Word Graph Constructing Algorithm
1 for t = 1 to T
2 do for each triple (v, w; t) ends at t
3 do keep track of:
4 -the (unique) word boundary τ(t; v, w)
5 -the acoustic score h(w; τ, t)
At the utterance end, the word graph is constructed
by tracing back through the bookkeeping list.
takes into account the language model during the generation of the word
graph. One thing that makes this algorithm different from the previous
one is the node creating process. When creating a new node, it takes into
account the m−gram constraint (or the m− histories of the current word).
The advantages of this method are:
• better modeling of word boundaries due to an extended word m-tuple
boundary optimization.
• improved pruning of word graphs by exploiting the more detailed
knowledge sources.
• smaller costs for graph expansion since higher order context con-
straints are encoded in the word graph structure.
We will discuss in detail the implementation of this algorithm in the next
chapter.
32
CHAPTER 2. SPEECH TRANSLATION
2.2.3 Consensus Decoding
With word graph decoding, we have to face with a computation problem.
In general, the number of paths through a word graph is exponential in
the number of links. These paths correspond to different segmentation
hypotheses (i.e. word sequence plus boundary times) of utterances. A
method to overcome these problems has been suggested in [Mangu, 1999].
The word graph is first transformed into a special form in which the calcu-
lation of the expected word error rate becomes trivial. This special form
of a word graph is called a confusion network. A confusion network itself
is a word graph. Each edge is labeled with a word and a probability. The
most important feature of these word graphs is that they are linear, in the
sense that every path from the start to the end node has to pass through
all nodes. A consequence of this (combined with the acyclicity) is that all
paths between two nodes have the same length. Thus the confusion net-
work naturally defines an alignment for all pairs of paths (called a multiple
alignment by Mangu). This alignment is used as the basis for the word
error rate calculation.
An approach to actually construct a confusion network from a word
graph is also presented in [Mangu, 1999]. The task is treated as a clustering
problem, where the edges from the original word graph have to be clustered
into groups according to criteria such the overlapping in time between
edges and the phonetic similarity between words. The confusion network
construction algorithm will be described in Chapter 4.
2.3 Statistical Machine Translation
The goal is the translation of a text given in some source language into
a target language. Precisely, we are given a source string f = f 1...fj...fm,
which is to be translated into a target string e = e 1...ei...el. The key idea
33
2.3. STATISTICAL MT Vu Hai Quan
here is, among all possible target strings, to choose the string with the
highest probability given by the Bayes’ decision rule [Brown, 1993]:
e? = arg maxe{P (e|f)} (2.30)
= arg maxe{P (e) · P (f|e)} (2.31)
Here, P (e) is the language model of the target language, P (f|e) is the string
translation model while arg max denotes the search problem, i.e., the gen-
eration of the output sentence in the target language. If we look back to
the Eq. 2.6, we will find a strong similarity between the problem of statis-
tical speech recognition and machine translation. The details of parameter
estimation and models for decoding can be found in [Brown, 1993]. In
the following we consider an alternative way to see at statistical machine
translation, namely, the log-linear models [Och, 2002].
2.3.1 Log-linear Model
As originally proposed by [Brown, 1993], the most likely translation of a
foreign source sentence f into English is obtained by searching for the
sentence with the highest posterior probability:
e? = arg maxe
Pr(e|f) (2.32)
Usually, the hidden variable a is introduced:
e? = arg maxe
∑
aP (e, a|f) (2.33)
which represents an alignment from source to target positions.
The framework of maximum entropy [Berger, 1996] provides a mean
to directly estimate the posterior probability P (e, a|f). It is determined
34
CHAPTER 2. SPEECH TRANSLATION
through suitable real valued feature functions h i(e, f, a), i = 1 . . .M , and
takes the parametric form:
pλ(e, a|f) =exp{
∑i λihi(e, f, a)}
∑e,a exp{
∑i λihi(e, f, a)}
(2.34)
The maximum entropy solution corresponds to values λ i that maximize
the log-likelihood over a training sample T :
λ? = arg maxλ
∑
(e,f,a)∈T
log pλ(e, a|f) (2.35)
Unfortunately, a closed-form solution of (2.35) does not exist. An iterative
procedure converging to the solution was proposed by [Darroch, 1972]; an
improved version is given in [Pietra, 1997].
If the following feature functions are chosen [Och, 2002]:
h1(e, f, a) = log P (e)
h2(e, f, a) = log P (f, a|e)
exploiting eq. (2.34), eq. (2.33) can be rewritten as:
e? = arg maxe
P (e)λ1
∑
aPr(f, a|e)λ2 (2.36)
where λi’s represent scaling factors of models.
In eq. (2.36), English strings e are ranked on the basis of the weighted
product of the language model probability P (e), usually computed through
an m-gram language model, and the marginal of the translation probability
P (f, a|e).
In [Brown, 1993, Och, 2003] six translation models (Model 1 to 6) of in-
creasing complexity are introduced. These alignment models are usually es-
timated through the Expectation Maximization algorithm [Dempster, 1977],
35
2.3. STATISTICAL MT Vu Hai Quan
or approximations of it, by exploiting a suitable parallel corpus of trans-
lation pairs. For computational reasons, the optimal translation of f is
computed with the approximated search criterion:
e? ≈ arg maxe
P (e)λ1 maxa
P (f, a|e)λ2 (2.37)
In summary, given the string e = e1, . . . , el, a string f and an alignment
a are generated as follows: (i) a non-negative integer φ i, called fertility, is
generated for each word ei and for the null word e0; (ii) for each ei, a list τi,
called tablet, of φi source words and a list πi, called permutation, of φi source
positions are generated; (iii) finally, if the generated permutations cover
all the available source positions exactly once then the process succeeds,
otherwise it fails. Fertilities fix the number of source words to be aligned to
each target word, and the total length of the foreign string. Moreover, as
permutations of Model 4 are constrained to assign positions in ascending
order, it can be shown that if the process succeeds in generating a triple
(φl0, τ
l0, π
l0), then there is exactly one corresponding pair (f, a), and vice-
versa. This property justifies the following decomposition of Model 4:
• fertility model: p(φ|e)
• lexicon model: p(τ | φ, e)
• distortion model: p(π | φ, τ, e).
The detail of the decompositions is specified in [Brown, 1993].
2.3.2 Decoding
Given the source sentence f = fm1 , the optimal translation e? is searched
through the approximate criterion (2.36). According to the dynamic pro-
gramming paradigm, the optimal solution can be computed through a re-
cursive formula which expands previously computed partial theories, and
36
CHAPTER 2. SPEECH TRANSLATION
recombines the new expanded theories. A theory can be described by its
state, which only includes information needed for its expansion; two par-
tial theories sharing the same state are identical (indistinguishable) for the
sake of expansion, i.e. they should be recombined.
Pruning
In order to reduce the huge number of theories to generate, some methods
are used, which affect the optimality of the search algorithm:
• Comparison with the best theory: theories are pruned, whose score is
worse than the so-far best found complete solution, as theory expan-
sion always decreases the score.
• Beam search: at each expansion less promising theories are also pruned.
In particular, two types of pruning define the beam:
– threshold pruning: partial theories whose score is smaller than the
current optimum score are eliminated;
– histogram pruning: hypotheses not among the top N best scoring
ones are pruned.
These criteria are applied, first to all theories with a fixed coverage
set, then to all theories of fixed output length.
• Reordering constraint: a smaller number of theories is generated by
applying the so-called IBM constraint on each additionally covered
source position, i.e. by selecting only one of the first 4 empty positions,
from left to right.
2.3.3 Speech Translation
In the previous subsection, we have introduced the framework for statis-
tical machine translation which assumed that the input is written text.
37
2.3. STATISTICAL MT Vu Hai Quan
Considering the problem of speech input rather than text input for trans-
lation, we can distinguish three levels, namely the acoustic vector x, the
source sentence f and the target sentence e.
x→ f→ e (2.38)
From a strict point of view, the source sentence, f are not of direct inter-
est for the speech translation task. Mathematically, this is captured by
introducing the source sentence hypotheses as hidden variables into Bayes’
decision rule as in Eq. 2.6.
arg maxe
P (e|x) = arg maxe{P (e) · P (x|e)} (2.39)
= arg maxe{P (e) ·
∑
fP (f,x|e)} (2.40)
= arg maxe{P (e) ·
∑
fP (f|e) · P (x|f, e)} (2.41)
= arg maxe{P (e) ·
∑
fP (f|e) · P (x|f)} (2.42)
∼= arg maxe{P (e) ·max
f{P (f|e) · P (x|f)}} (2.43)
In the above equation, we have made only a reasonable assumption,
P (x|f, e) = P (x|f) (2.44)
that is the target string e does not help to predict the acoustic vector
if the source string f is given. In addition, similar to Eq. 2.25, here we
have also used the maximum approximation as in Eq. 2.43. The key issue
here is the question of how the requirement of having both a well-formed
source sentence f and a well-formed target sentence e at the same time is
satisfied. From the statistical point of view, this question is captured by
finding suitable models for the joint probability:
P (f, e) = P (e) · P (f|e) (2.45)
38
CHAPTER 2. SPEECH TRANSLATION
2.3.4 Evaluation Criterion
A generally accepted criterion for evaluating automatic machine translation
does not yet exist. Therefore, the usual way is to use a large variety of
different criteria. The good system is the one that produces the good
quality of translation over almost or all of these criteria. The following
criteria are widely used in recent literature.
• SER (sentence error rate): The SER is computed as the number of
times that the generated sentence corresponds exactly to one of the
reference translations.
• WER (word error rate): The WER is computed as the minimum
number of substitution, insertion and deletion operations that have
to be performed to convert the generated sentence into the target
sentence.
• PER (position-independent WER): A shortcoming of the WER is the
fact that it requires a perfect word order. The word order of an
acceptable sentence can be different from that of the target sentence,
so that the WER measure alone could be misleading. To overcome
this problem, the PER criterion is introduced as additional measure,
that compares the words in the two sentences ignoring the word order.
• mWER (multi-reference word error rate): For each test sentence, there
is not only a single reference translation, as for WER, but a set of ref-
erence translations. For each translation hypothesis, the edit distance
to the most similar sentence is calculated.
• mPER (multi-reference position independent WER): This criterion
ignores the word order by treating a sentence as a bag-of-words and
computing the minimum number edit distance needed to transform
the hypothesis into the closest of the given reference translations.
39
2.3. STATISTICAL MT Vu Hai Quan
• SSER (subjective sentence error rate): For a more detailed analysis,
subjective judgments by test persons are necessary. Each translated
sentence is judged by a human examiner according to some error scales
(i.e. from 1.0 to 5.0).
• IER (information item error rate): The test sentences are segmented
into information items. For each of them, if the intended information
is conveyed and there are no syntactic errors, the sentence is counted
as correct.
• BLEU score: This criterion computes the geometric mean of the pre-
cision of n−gram of various lengths between a hypothesis and a set
of reference translations multiplied by a factor that penalizes short
sentences.
• NIST score: This criterion computes a weighted precision of n−grams
between a hypothesis and a set of reference translations multiplied by
a factor that penalizes short sentences.
Both NIST and BLEU are accuracy measures, and thus larger values reflect
better translation quality.
40
Chapter 3
Word Graph Generation
In this chapter we first introduce some terms and common operators related
to word graphs. These terms and operators are then used to describe
algorithms that we have implemented, including word graph generation,
word graph evaluation, word graph pruning and word graph expansion.
3.1 Word Graph Definitions
A word graph is a directed, acyclic, weighted, labeled graph with distinct
start and end nodes. It is a quadruple G = (V, E, I, F ) with the following
components:
• A nonempty set of vertices or nodes V = {v1, ..., vN}.
• A nonempty set of weighted, labeled, directed edges E = {e1, ..., eM}.
Each edge e is defined by e = (vi, vj, τ, t, w, s) where:
– vi, vj ∈ V are the starting node and the ending node of e.
– w ∈ L is a word label where L denotes a non-empty set of words,
L = {w1, ..., w|L|}.
– τ, t ∈ R are the starting time and the ending time of the word
hypothesis w.
41
3.1. WORD GRAPH DEFINITIONS Vu Hai Quan
– s = (ac, lm · Flm) ∈ R × R is the weight of e, consisting of the
acoustic score, ac, and the language model score, lm, multiplied
with the language model factor F lm, (see Eq. 2.7). We will include
p as its posterior score. This score will be discussed in the next
chapters.
• I ∈ V : the start node. This node represents the start of the utterance.
Every node in the word graph is reachable from the start node I. By
default, I = v1.
• F ∈ V : the end node. This node represents the end of the utterance.
By the default, F = vN .
3.1.1 Word Graph Accessors
Accessors for Word Graph nodes
Given a node v we define the following accessors:
• fan-In(v): return the number of edges incoming to node v.
• fan-Out(v): return the number of edges outgoing from node v.
• Expand(v): return the set of edges outgoing from node v.
• Inpand(v) return the set of edges incoming to node v.
Accessors for Word Graph Edges
Given an edge e ∈ E we define the following accessors:
• s(e): return the weight s of e.
• ac(e): return the acoustic score ac of e.
• lm(e): return the language model score lm of e.
42
CHAPTER 3. WORD GRAPH GENERATION
• w(e): return the word label w of e.
• b(e): return the starting node vi of e.
• f(e): return the ending node vj of e.
• τ(e): return the starting time τ of e.
• t(e): return the ending time t of e.
3.1.2 Word graph properties
Reachability
We define the relation of reachability for nodes (−→) as follows:
∀vi, vj ∈ V : vi −→ vj ⇐⇒ ∃e ∈ E : b(e) = vi ∧ f(e) = vj (3.1)
The transition hull of the reachability relation −→ is denoted by∗−→.
We define:
Reachable(vi, vj) return true if vi∗−→ vj
Paths
A path through a graph is a sequence of edges p = e1, e2, .., eK such that:
f(ei) = b(ei+1), i = 1, .., K − 1
We call the number K of edges in the sequence p the length of the path.
To express that two nodes are connected by a sequence p of edges, we use:
vip−→ vj. We also define the score of a given path p = e1, e2, .., eK as
follows:
path-Score(p) =K∑
i=1
s(ei) (3.2)
43
3.2. WORD GRAPH GENERATION Vu Hai Quan
3.1.3 Topological ordering
A topological ordering is a function γ : V −→ {1, .., |V |} on the nodes of a
word graph having the property that:
∀e ∈ E : γ(b(e)) < γ(f(e))
The function topo-Sort(G) sorts all nodes ∈ G in topological order when
there are no loops in that graph.
3.2 Word Graph Generation
In the previous chapter, it has been shown how acoustic, lexical and lan-
guage knowledge can be compiled into a stochastic finite-state integrated
network to be used for generating hypotheses. In general, the search for the
most probable word sequence W, given the sequence of acoustic vectors,
X, is translated into the search of an optimal path over a derived structure,
called trellis, which corresponds to the unfolding of the integrated network
along the time axis. The search, or decoding, is normally carried out in
time-synchronous fashion by the so-called Viterbi algorithm. To avoid ex-
haustive search of a huge search space, the beam search technique is used.
The main idea of beam search is to prune the less promising hypotheses,
on the basic of a local estimation. At each time frame t, the decoder pro-
duces a best hypothesis - the one among all hypotheses ending at time t,
which has highest local accumulated score. All hypotheses whose scores
are below the best one with respect to a given threshold are pruned. The
pruned hypotheses are no longer considered in the next time frames. The
best local hypotheses are kept by using a back-pointer list. At the ending
time T of the utterance, the word sequence W will be found by tracing
back through the list.
The main idea of word graphs is to represent word alternatives in regions
44
CHAPTER 3. WORD GRAPH GENERATION
of the speech signal where the ambiguity in the acoustic recognition is
high. The advantage is that the acoustic recognition is decoupled from
the application of the language model, in particular a long-span language
model can be applied in a subsequent post-processing step. The number
of word alternatives should be adapted to the level of ambiguity in the
acoustic recognition. In the following we present the m−gram word graph
generation, proposed by [Neukirchen,2001]. Similarly to the survey about
the word graph algorithms presented in the previous chapter, we need to
define some quantities:
• W = (w1, ..., wN), be an N word sequence.
• Sm(W) = (wN−m+2, ..., wN), the m−gram LM-state of a word se-
quence W is given by the (m− 1) most recent words.
• h(w; τ, t) = P (xtτ+1|w) : probability that word w produces the acoustic
vector xτ+1...xt.
• G(W) = P (W) ·P (xt1|W) (joint) probability of generating the acous-
tic vector x1...xt and a word sequence (w1, ..., wN) with ending time
t.
• H(Sm; t) :(joint) probability of generating the acoustic vector x 1...xt
and a word sequence with the final (m − 1) words given by Sm at
ending time t
The optimization in the search is conditioned by word history Sm. When
the search reaches the leaf for word w in the lexicon tree this results in an
extension of the preceding partial hypothesis W to the new hypothesis
W = (W, w). Within the lexicon tree, the recombination is applied to
all preceding hypotheses being in an identical m−gram state Sm(W) but
having entered the tree at different starting times τ . Using the m−gram
45
3.2. WORD GRAPH GENERATION Vu Hai Quan
P (w|Sm(W)), the following optimization generates the probability for the
new partial hypothesis W ending at time t :
G(W; t) = P (w|Sm(W)). maxτ{H(Sm(W); τ).h(w; τ, t)} (3.3)
The optimal tree starting time (boundary between W and w), τ opt is given
implicitly by the optimization in Eq. 3.3 and the boundary only depends
on the ending time t and the m most recent words in W.
In order to complete the word graph constructing algorithm, two more
quantities have to be computed, the word boundary τ(Sm(W), w) and the
word score, h(w; τ, t). Taking directly from Eq. 3.3 we have:
τopt = arg maxτ{H(Sm(W); τ).h(w; τ, t)} (3.4)
score(w) =G(W)
Hmax(Sw(W); τ)(3.5)
where Hmax(Sw(W); τ) = maxτ{H(Sm(W); τ).h(w; τ, t)} and Score(w)
is the accumulated score of word w spanning from time τ + 1 to time t.
In order to describe the word graph generating algorithm in detail, in
the following subsections some concepts on the hypotheses output by the
decoder, the best-predecessor hypothesis and the language model state will
be introduced.
3.2.1 Hypothesis
At each time frame t, after pruning, the decoder outputs a list of N t hy-
potheses {%it}i=1..Nt
, which represents a set of hypothesized words ending
at time t. With a given hypothesis %, we also define some accessors as
follows:
• b(%) return the beginning state of the decoding network.
• e(%) return the ending state of the decoding network.
46
CHAPTER 3. WORD GRAPH GENERATION
Table 3.1: A list of hypotheses output by the decoder at time frame t = 6, Nt = 7.
e(%) w(%) S(%) b(%) τ(%)
172 c -718.115601 3 3
213 a -717.378235 3 3
226 ho -718.540283 3 3
284 e‘ -709.984314 3 3
287 e -717.644409 3 3
2 @BG -605.717712 0 0
3 @BG -629.394775 0 0
• τ(%) return the starting time.
• t(%) return the ending time.
• w(%) return the word label.
• S(%) return the total likelihood score up to this hypothesis.
Table 3.1 shows an example of such hypothesis list containing N t = 7
hypotheses, output by the decoder at time frame t = 6.
3.2.2 Best predecessor
For a given hypothesis %it we define its best predecessor hypothesis by:
%∗τ = arg max%
jτ
{S(%jτ ) : e(%j
τ ) = b(%it), t(%
jτ) = τ(%i
t)} ≡ f(%it) (3.6)
The following function returns the best predecessor of a given hypothesis
%it, which was implemented as in Eq. 3.6:
find-Best-Predecessor(%it)
Let us consider the hypothesis in the first row of Table 3.1. We recognize
that it starts from the time frame τ(%) = 3 and the beginning state b(%) =
3. Back to the list of hypotheses ending at time frame t = 3 we find the
hypothesis %? which has the ending state equal to 3 and get the results as
shown in Table 3.2.
47
3.2. WORD GRAPH GENERATION Vu Hai Quan
Table 3.2: A list of hypotheses output by the decoder at time frame t = 3, Nt = 4.
e(%) w(%) S(%) b(%) τ(%)
172 c -414.504028 0 0
284 e‘ -410.906921 0 0
2 @BG -302.052124 0 0
3 @BG -325.729187 0 0
3.2.3 Language Model State
Best hypothesis sequence
From the definition of the best predecessor of a given hypothesis, we define
a best hypothesis sequence ηm ending at the hypothesis %m as a sequence
of m hypotheses:
ηm = %1, .., %m such that %i = f(%i+1), i = 1...m− 1 (3.7)
The corresponding word sequence W (ηm), taken from the best hypothesis
sequence, is defined as follows:
W (ηm) = w1, .., wm such that wi = w(%i), i = 1...m (3.8)
Language model state
With a given word sequence W = w1, ..., wm we define its language model
states, L-state[W ] and R-state[W ] as follows:
• L-state[W ] = w1, ..., wm−1
• R-state[W ] = w2, ..., wm
We call L-state[W ] the left language model state of a given word sequence,
W . In fact, it is the context of that word sequence, discussed in the previous
section. The right language model state R-state[W ] is defined in a similar
way.
Initially, at time frame t = 0, there are no hypotheses output, so:
48
CHAPTER 3. WORD GRAPH GENERATION
• L-state(null) = R-state(null) =< s >.
Here, < s > is a special symbol used to denote the beginning of the utter-
ance.
At time frame t = 1, the word sequence of each output hypothesis has
length 1, then:
• L-state[W (η1)] =< s >
• R-state[W (η1)] =< s >, w.
Finally, the left and the right language model state L-state[W (η m)],
R-state[W (ηm)] are built incrementally during word graph construction.
3.2.4 Algorithm
At each time frame t, keep Nt hypotheses, {%ti}, i = 1..N within the beam
search.
• For each hypothesis %it, i = 1..N take its starting time τ(% i
t) and start-
ing state b(%it) and its word wm = w(%i
t)
– find the best predecessor %∗τ , given by Eq. 3.6
– update the best hypothesis sequence: ηm(%it) = ηm−1(%
∗t ), %
it
– update the L-state[W (ηm)]
– update the R-state[W (ηm)]
– create or find node vj = (t,R-state[W (ηm)])
– create a new edge e whose score is given by %it, %∗τ and p(wm|w1, .., wm−1)
– create or find node vi = (τ,R-state[W (ηm−1)])
– insert edge e between node vi and vj
49
3.2. WORD GRAPH GENERATION Vu Hai Quan
Generate-WG
� create the initial node I
1 create-Node(0,R(null))
2 for each time frame t, t = 1..T
3 do for each %it ∈ Beam(t), i = 1..Nt
4 do
� Find the best predecessor of %it
5 %∗
τ = find-Best-Predecessor(%it)
� Update the LM states
6 L-State(W (ηm(%it))) = R-state(W (ηm(%∗
τ )))
� Find or create the ending node
7 vj = create-Node(t,R-state(W (ηm(%it))))
� Create the edge
8 e = create-Edge(w(%it), score(%i
t))
� Find or create the starting node
9 vi = create-Node(τ,R-state(W (ηm(%∗
τ ))))
� Insert the newly created edge
10 insert-Edge(vi, vj, e)
� create the final node F and its edges
11 create-Node(T + 1,R-state(null))
12 topo-Sort
3.2.5 Implementation Details
At line 3, Beamt denotes a set of hypotheses which are produced by
the decoder at time frame t.
At line 6, the left language model state of the current hypothesis % it is
updated according to its definition. That means that it takes the right
language model state from its best predecessor.
At line 7, the function create-Node(t, context) creates a new node v
50
CHAPTER 3. WORD GRAPH GENERATION
according to (t, context) and adds it to V . The parameter t is the current
time frame, and context denotes the R-state of the current word hypoth-
esis w. If this node has been already created before, this function return
its index.
At line 8, the function create-Edge creates a new edge e and then
appends information to it. Information of an edge includes:
• current word label
• score, including both the acoustic score, ac, and the language model
score, lm. The procedure to extract ac and lm from hypothesis % it is
as follows:
– Score(%it)
lm = p(w(%it)|L-state(W (ηm(%i
t)))
ac = s(%it) − s(%∗τ ) − lm ∗ Flm, where Flm is the language model
factor, see also Eq. 3.5
At line 10, the function insert-Edge(v i, vj, e) inserts an edge e be-
tween two nodes vi and vj
At line 11, the function create-Node(T + 1, context):
• creates a final node F at time T+1 and context = R-state(null)
• creates a special edge e with w(e) =< s > and s(e) = 0
• for each node v ∈ V , which has ending time t = T , insert-Edge(v, F, e)
Finally, at line 12, the generated word graph is topologically sorted by
calling the procedure topo-Sort.
51
3.2. WORD GRAPH GENERATION Vu Hai Quan
pruning / recombination
v
w
w
u
u
v w
u
u
v w
v
u
v
w u
u w
u v
w
w
v w u
v
w v
w
w
u
u
v w
u
u
v w
v
u
v
w u
u w
u
w u
w
v
w v
v
v
v w
A word graph using bigram constraint A word graph using trigram constraint
dead path
dead path
Figure 3.1: Bigram and Trigram Constraint Word Graphs
3.2.6 Bigram and Trigram-Based Word Graph
In the above described Generate-WG algorithm, if we limit the length of
word sequence in the left-language model state, L[W (ηm)], to be 1 or 2, we
get the so-called bigram-based word graph or trigram-based word graph
respectively. It is worth to noticing that the bigram constraint is more
relaxed than the trigram constraint. This results in larger word graphs
in bigram cases than in trigram cases. Fig 3.1 shows an example for this
property. As we can see, due to the recombination, the dotted-line in the
right graph is no further expanded while this is not true in the left graph.
The dotted-line in this case is referred as the dead path, which should be
removed by the algorithm described in the following subsection.
Since the decoder uses a trigram language model, only with trigram
word graphs we are able to separate the real trigram language model scores,
see Eq. 3.5. In the bigram case, the language model scores are computed
approximately by using the trigram with an unseen word. This causes word
52
CHAPTER 3. WORD GRAPH GENERATION
error rates (WERs) of bigram-based word graph decoding to be always
higher than WERs for trigram-based word graph decoding. In summary,
we have:
• graph error rates (GERs) of bigram-based word graphs are always
lower than the GERs of trigram-based word graphs,
• decoding WERs with bigram-based word graphs are always higher
than with trigram-based word graphs.
Experimental results given in Chapter 5 will confirm these properties.
3.2.7 Dead Paths Removal
When a partial hypothesis is not further expanded due to recombination
or pruning, it can cause the generation in the word graph of dead paths
that do not reach the final graph node F . After the word graph has been
built we can safely remove all dead paths without affecting GERs and
WERs. Moreover, removing the dead paths also makes the word graph
smaller. The observation here is that: at a given node v, v ∈ V, v 6=
F, v 6= I if there are no outgoing edges from or incoming edges to this node
then all paths leading to v are dead and should be removed. Therefore
the algorithm just simply traverses the word graph in topological order
and removes all dead-nodes. Back to Fig 3.1, the path passing through
edges (v, w) drawn by dotted-line is no longer expanded. In this case,
it can be safely removed. The detailed implementation is given in the
remove-Dead-Paths procedure.
3.3 Word Graph Evaluation Methods
In order to evaluate the quality of word graphs, we usually rely on two
criteria: the size of the generated word graph and the graph word error
53
3.3. WG EVALUATION Vu Hai Quan
dead-Node(vi)
� Node vi is dead if fan-Out(vi) = 0 or fan-In(vi) = 0
1 return (fan-Out(vi) = 0)
remove-Node(vi)
� Delete node vi by removing all edges incoming from and outgoing to vi
1 for each edge e(vk, vi, τ, t, w, s) ∈ Inpand(vi)
2 do remove-Edge(vk, vi, e)
3 for each edge e(vi, vj , τ, t, w, s) ∈ Expand(vi)
4 do remove-Edge(vi, vj , e)
5 delete vi
remove-Dead-Paths(wg)
� Traverse the word graph and remove all dead-node
1 for each node vi, vi 6= I, vi 6= F in topological order
2 do if dead-Node(vi)
3 then remove-Node(vi)
4 for each node vi, vi 6= I, vi 6= F in reversed topological order
5 do if dead-Node(vi)
6 then remove-Node(vi)
rate (GER). This section will cover both criteria.
3.3.1 Word Graph Size
Limiting to our domain (generation of the word graphs for speech recog-
nition and machine translation), there are some different measures about
the quality of the word graph, with respect to its size.
[Aubert,1995] and [Woodland, 1995] used the word graph density, the
node graph density and the boundary graph density as measures. Those
criteria have been widely used in most word graph evaluation systems.
They are informally defined as follows.
54
CHAPTER 3. WORD GRAPH GENERATION
• The word graph density (WGD) is defined as the total number of word
graph arcs divided by the number of actually spoken words.
• The node graph density (NGD) is defined as the total number of
different words ending at each time frame divided by the number of
actually spoken words.
• The boundary graph density (BGD) denotes the number of word
boundaries, i.e. different start times, respectively, per spoken word.
Finally, in [Amtrup, 1996], the number of paths (nPaths), the number
of derivations and the rank of a path are used to measure the quality
of the word graph, which are suitable for natural language and machine
translation. We describe the algorithm to enumerate the number of paths
in the word graph [Amtrup, 1996] here. The algorithm that computes the
rank of a given path within a word graph is skipped as we actually use the
N−best decoding algorithm, which will be described in Chapter 4.
Number of Paths
Given a word graph as an acyclic, directed graph, defined in Section 3.1,
it is quite easy to compute the number of paths contained in this graph.
Specifically, starting at the initial node I in which we assume that the
number of path at this node is 1, traverse all the graph node v i, vi ∈ V
in topological order. For each incoming edge e(v j, vi) to node vi, add the
values from the previous nodes vj to node vi. The value of the final node
F is the number of paths in the graph. The detail implementation of the
algorithm is given in the num-Paths(wg) procedure.
Fig 3.2 illustrates the num-Paths algorithm. The number in each node
is the number of paths incoming to the node. For example, at the final
node we get the value of 4 and it is actually the number of path in this
55
3.3. WG EVALUATION Vu Hai Quan
num-Paths(wg)
� return the total number of paths in a word graph.
� p[N ] temporary variable.
1 p[I ]=1;
2 for each node vi ∈ V \ {I}
3 do p[vi]=0;
4 for each node vi ∈ V \ {I} in topological order
5 do for each edge e(vi, vj, τ, t, w, s) ∈ expand(vi)
6 do p[vj ]+ = p[vi];
� return the total number of paths in the word graph
7 return p[F ];
1 1 1 1 2 4
a
c
b
d
d
e
e
e
Figure 3.2: The counting number of paths algorithm in a word graph
graph. Moreover, examining at this node we see that it has a total of three
incoming edges where two edges come from nodes that have value 1 and
the remainder comes from a node that has value 2.
3.3.2 Graph Word Error Rate
The graph word error rate (GER) is computed by determining which sen-
tence through the word graph best matches the actually spoken sentence.
The match criterion is defined in terms of word substitutions (SUB), dele-
tions (DEL) and insertions (INS). This measure provides a lower bound of
56
CHAPTER 3. WORD GRAPH GENERATION
the word error rate for this word graph. The algorithm, for its computa-
tion, is very similar to the one that computes the string edit distance. The
following procedure, namely GER, computes the GER, given the word
graph wg and the reference sentence s with length T .
GER(s)
Initial :
1 t = 0, ∀ node vi ∈ V , νt[vi] =∞; νt[0] = 0
Matching :
2 for t = 1 to T
3 do for each node vi in topological order
4 do � compute the deletion error
5 νt[vi] = min(νt[vi], νt−1[vi] + 1)
6 for each edge e(vi, vj, τ, t, w, s) ∈ expand(vi)
7 do � compute the substitution error
8 νt[vj ] = min(νt[vj ], νt−1[vi] + δ(w(e), s[t]))
� compute the insertion error
9 for each node vi in topological order
10 do for each e(vi, vj, τ, t, w, s) ∈ expand(vi)
11 do νt[vj ] = min(νt[vj ], νt[vi] + 1)
� update the score table for next loop
12 copy(νt−1, νt); ∀vi ∈ V , νt[vi] =∞;
13 return νT [F ]
Let νt[vi] denote the total cost when matching process is at the graph
node vi and the sentence position t. The algorithm is based on dynamic
programming. Specifically, at each position t, considering the word s[t]
in the reference sentence s, the νt[vi] value is computed by choosing the
minimum cost in terms of INS, DEL, SUB when comparing all the words
of the outgoing edges, e(vi, vj) to word s[t]. This value is then compared to
the previous value νt−1[vi] or νt[vi] to choose the minimum one. The value
νT [F ] at the final position T of the reference sentence and at the final node
57
3.4. REMOVING EMPTY-EDGES Vu Hai Quan
F of the word graph is the GER.
The GER is one of the main criteria for evaluating the quality of a word
graph. All the results in Chapter 5 are given in terms of GER.
3.4 Removing Empty-Edges
Empty edges, denoted by the label @BG in the word graph, represent the
background noise hypotheses which were produced by the decoder. They
are usually supposed not to carry any meaning that has to be preserved for
subsequent language processing steps after recognition. Removing empty
edges can make simpler other algorithms which also operate on the word
graph as in our case are the node-merging algorithm and word graph ex-
pansion algorithm, which will be described in Section 3.6 and Section 3.7.
In this section we describe the algorithm for removing all empty-edges from
the word graph.
3.4.1 Algorithm
• For each node vi in topological order
– Get a list of nodes {vj} which can be reached from vi through
empty transitions and the corresponding scores, map[v j]. By
map[vj] we mean the best accumulated scores of the paths from
vi to vj through empty transitions.
– For each node vj
∗ For each edge e(vj, vk, τ, t, w, s), w 6= empty
· update score for edge e : s(e) = s(e) + map[vj];
· insert-Edge(vi, vk, e)
∗ remove-Edge(vi, vj, w)
58
CHAPTER 3. WORD GRAPH GENERATION
3.4.2 Implementation Details
Notation
� return list of nodes and scores reachable from vi through empty transitions.
� The corresponding scores are kept in map
� Here map[vj ] is the accumulated score of @BG links from node vi to vj
get-Empty-Node-Scores(vi, st, map)
1 for each e(vi, vj, τ, t, w, s) ∈ Expand(vi)
2 do if w(e) = empty
3 then if (map[vj ] = 0)
4 then get-Empty-Node-Scores(vj, st + s(e), map)
5 map[vj ] = st + s(e)
6 else tmp = st + s(e)
7 if map[vj ] < tmp
8 then get-Empty-Node-Scores(vj, tmp, map)
9 map[vj ] = tmp
Removing empty edges could result in a bigger word graph size com-
pared to the one that contains empty edges. Fig 3.3 shows a word graph
with @BG edges while the equivalent word graph without @BG edges is
illustrated in Fig 3.4.
A very important property of word graphs without @BG edges is that
all incoming edges to a node vi, ∀vi ∈ V have the same word label because
node vi must have the same left-language model state. In this case, we
can freely put the word label on node v i and suppose that the initial node
I contains the word label <s>, as shown in Fig 3.5. This property is
exploited in the word graph expansion and node compression algorithms,
which will be described in the next sections.
59
3.5. FW-BW PRUNING Vu Hai Quan
remove-Empty-Edge(vi)
1 get-Empty-Nodes-Scores(vi, 0, map)
2 for each vj such that map[vj ] > 0
3 do for each e(vj , vk, τ, t, w, s) ∈ Expand(vj)
4 do if w(e) 6= empty
5 then s(e) = s(e) + map[vj ]
6 Insert-Edge(vi, vk, e)
7 Remove-Edge(vi, vj, w)
remove-All-Empty-Edges
1 for each node vi ∈ V taken in topological order
2 do Remove-Empty-Edge(vi)
3.5 Forward-Backward Pruning
Since the word graph produced by using hypotheses directly output by the
decoder can be very large, it is essential to use pruning methods for gen-
erating compact word graphs. In fact, three different pruning techniques
are considered:
• The forward-pruning. At each time frame, only the most promising
hypotheses are retained in one-pass search. This pruning technique is
usually named “beam search”.
• The forward-backward pruning. It consists of two passes after the gen-
eration of a huge word graph. It employs a beam search with respect
to the forward and backward scores of a specific hypothesis. Strictly
speaking, for every word graph arc representing a word hypothesis
(w, τ, t) with starting time (τ) and ending time t and its score, we
compute the overall score of the best path passing through this spe-
cific arc. Word arcs whose scores are relatively close to the global
score of the best path are kept in the word graph, while the others are
60
CHAPTER 3. WORD GRAPH GENERATION
0
1
a
2
a 3
@BG
4
d
5
e
6
@BG c c
@BG
c
c
7
@BG
8
f f f
Figure 3.3: A word graph with @BG edges.
61
3.5. FW-BW PRUNING Vu Hai Quan
0
1
a
2
b
6
c
7
d
8
f
5
e
f
e c
f f
f
Figure 3.4: A word graph with @BG edges removed.
62
CHAPTER 3. WORD GRAPH GENERATION
<s>
a b
c d
f
e
Figure 3.5: A word graph with words placed on nodes.
63
3.5. FW-BW PRUNING Vu Hai Quan
pruned.
• Node compression. This pruning technique is based on the topology
of a word graph. If two nodes vi and vj have the same incoming or
outgoing edges, they can be merged.
The experiments of all pruning techniques mentioned above are given in
Chapter 5. In this section we present the forward-backward based pruning.
The next section is about the node compression algorithm.
Firstly, the edge posterior probability is introduced. This posterior is
not only used in forward-backward pruning but also in confusion network
construction, which will be described in Chapter 4.
3.5.1 Edge Posterior Probability
In a word graph, each edge is labeled with a word, the starting and ending
nodes, and the likelihood score calculated from the acoustic and language
models as showed in the previous section. From these edge likelihood
scores, posterior probabilities p(e|X tτ+1) can be calculated for each edge e.
This is done by an algorithm very similar to the forward-backward used to
train HMMs.
Definition
The edge posterior p(e|X) is defined as the sum of the probabilities of all
paths q passing through the link e, normalized by the probability of the
signal p(X) :
p(e|X) =
∑q,e∈q p(q, X)
p(X)(3.9)
where:
• p(X) is approximated by the sum over all paths through the word
graph.
64
CHAPTER 3. WORD GRAPH GENERATION
v i v j e
p( e 1i )
p( e 2i )
p( e 3i )
p( e 4i )
) ( i v a ) ( j v b
p( e j1 )
p( e j2 )
p( e j3 )
p( e j4 )
Figure 3.6: Link posterior computation
• p(q, X) is the probability of path q composed from acoustic likelihood
pac(X|q) and the language model likelihood p lm(W ), given in Eq. 3.2
• Forward-Backward based algorithm is used to compute p(e|X).
3.5.2 Forward-Backward Based Algorithm
For each node v ∈ V , two quantities are defined, the forward probability
and the backward probability, as follows:
• α(v) is the sum of likelihoods of all paths from node I to node v
• β(v) is the sum of likelihoods of all paths from node vto node F .
We have:
α(v) =∑
e:f(e)=v
s(e)α(b(e)) (3.10)
β(v) =∑
e:b(e)=v
s(e)β(f(e)) (3.11)
From α(v), β(v) we have posterior for a given edge e ∈ E
p(e|X) =α(b(e))s(e)β(f(e))
p(X)(3.12)
An example of edge posterior computation is illustrated in Fig 3.6.
65
3.5. FW-BW PRUNING Vu Hai Quan
3.5.3 Implementation Details
This section describes the implementation of the forward-backward for the
computation of the edge posterior. The details of this algorithm can also
be found in [Wessel, 2002]. As we can see, the Forward-Backward
algorithm consists of two phases.
In the first phase, the forward-pass, all the initial values of α[v i], ∀vi ∈
V are set to log 0, except the value at the initial node, α[I] is set to 0.
Then, the algorithm works by traversing all the node v i, vi ∈ V, vi 6= I in
topological order and compute value α[v i] for each node by using Eq. 3.11
(from line 4 to line 6).
In the second phase, the backward pass, values β[v i] are computed by
using Eq. 3.11 but in reversed topological order (from line 10 to line 12).
After that, posterior probabilities are computed for the edges by exploiting
Eq. 3.9.
Here, there are two things worth to be mentioned. First, logPlus(a, b)
returns the log of the sum of two variables expressed in logarithm. Second,
the value α[F ] and the value β[I] are actually the probability of all paths
in the word graph.
3.5.4 Forward-Backward Based Pruning
One of the applications of edge’s posteriors is for word graph pruning.
In [Sixtus, 1999], the authors proposed a method to gain high quality word
graphs by using the forward-backward based pruning algorithm. In fact,
the forward-backward was used to compute the overall score of the best
path traversing a given arc. Word arcs whose scores are relatively close to
the global score of the best path are kept in the word graph, while the others
are pruned. Implementation details are given in the prune-Posterior(t)
procedure, which input is a threshold t. The algorithm works by traversing
66
CHAPTER 3. WORD GRAPH GENERATION
Forward-Backward(wg)
� compute the posterior for all edges e ∈ wg
� based on the forward-backward algorithm.
� forward pass
� α[N ] local variable for forward probability α
1 for vi = I to F
2 do α[vi] = −∞;
3 α[I] = 0
4 for each node vi = I + 1 to F in reversed topological order
5 do for each edge e(vj , vi, τ, t, w, s) ∈ Inpand(vi)
6 do α[vi] = logPlus(α[vi], α[vj] + s(e))
� backward pass
� β[N ] local variable for backward probability β
7 for vi = I + 1 to F
8 do β[vi] = −∞;
9 β[F ] = 0
10 for each node vi=F -1 to I in topological order
11 do for each edge e(vi, vj, τ, t, w, s) ∈ Expand(vi)
12 do β[vj] = logPlus(β[vj], β[vi] + s(e))
� edge posterior computation
13 for each node vi = I to F
14 do for each edge e(vi, vj, τ, t, w, s) ∈ Expand(vi)
15 do p(e) = (α[vi] + s(e) + β[vj])− α[F ];
all edges in the word graph and removing edges whose posterior probabil-
ities lower than the given threshold t. Practically, the threshold value t is
chosen such that all edges are removed if α[F ]− p(e) < t where α[F ] and
p(e) are computed in the procedure Forward-Backward. The experi-
mental results of this algorithm are given in Chapter 5, which show that
it dominates over other pruning algorithms.
67
3.6. NODE COMPRESSION Vu Hai Quan
prune-Posterior(t)
� prune the word graph based on edge posteriors
1 for each node vi = I to F
2 do for each edge e(vi, vj, τ, t, w, s) ∈ Expand(vi)
3 do if p(e) < t
4 then remove-Edge(vi, vj, e)
3.6 Node Compression
In this section, a word graph pruning technique is introduced which exploits
the graph topology [Weng, 1999]. Implementation details are also given.
Results on the application of this algorithm are given in Chapter 5.
The idea of this algorithm is to combine identical sub-paths in the word
graph so that the redundant nodes and edges are removed. The key ob-
servation underlying this algorithm is that if two nodes v i and vj have the
same word label w(vi) = w(vj) and meet one of the following criteria:
• Inpand(vi) = Inpand(vj)
• Expand(vi) = Expand(vj)
• (Inpand(vi) ⊂ Inpand(vj) or Inpand(vj) ⊂ Inpand(vi) ) or
(Expand(vi) ⊂ Expand(vj) or Expand(vj) ⊂ Expand(vi) )
then two nodes vi ad vj can be merged without changing the language
of the word graph, where the language of a word graph is defined as the
set of all word strings starting at the initial node and ending at the final
node. Implementation of this algorithm is given in the procedure named
Node-Compression. As we can see, the key operation of the procedure
is at line 7 where two nodes vi and vj are merged. The merging operation
is only performed when both vi and vj have the same set of outgoing edges
68
CHAPTER 3. WORD GRAPH GENERATION
Node-Compression(wg)
1 finish = false;
2 while (!finish)
3 finish = true;
4 do for each node v, v ∈ V in reverse topological order
5 do for each pair of predecessor nodes (vi, vj) of node v
6 do if Expand(vi) = Expand(vj)
7 then merge-Node(vi, vj)
8 finish = false;
or incoming edges. The loop, from line 2 to line 8, is finished when there
are no couple of nodes satisfying the condition in line 6.
The merge-Node(vi, vj) procedure, which merges two nodes vi and
vj, consists of two phases. In the first phase, all the outgoing edges from
node vj are copied and added into the set of outgoing edges from node v i.
Similarly, in the second phase, all incoming edges to node v j are copied and
added into the set of incoming edges to node v i. In case that the copied
edge from node vj was already in the set of edges incoming to or outgoing
from node vi, then the score of those edges is updated by choosing the best
one as specified at line 3 and line 9 of the merge-Node(v i, vj) procedure.
The experimental results of this algorithm can be found in Chapter 5.
3.7 Word-Graph Expansion
3.7.1 Introduction
Since the cost of generating bigram word graphs is small, it would be
attractive to be able to generate bigram word graphs and then expand
them to m−grams. An m−gram word graph is a word graph whose tran-
sitions (edges) contain m−gram probabilities. Thus with m−gram word
69
3.7. WORD-GRAPH EXPANSION Vu Hai Quan
merge-Node(vi, vj)
� Merge two nodes vi and vj
� First copy incoming edge to vj
1 for each edge e(vk, vj, w, s)
2 do if ∃e′(vk, vi, w, s′)
3 then s′ = Min(s, s′)
4 else enew = create-Edge(w, s)
5 insert-Edge(vk, vi)
6 � Now copy outgoing edges from vj
7 for each edge e(vj, vk, w, s)
8 do if ∃e′(vi, vk, w, s′)
9 then s′ = Min(s, s′)
10 else enew = create-Edge(w, s)
11 insert-Edge(vi, vk)
graph, more context information is encoded in and could be used in the
subsequence processing steps. In this section the method for word graph
expansion by [Weng, 1999] is described. For simplicity, our discussion is
focused on trigram word graphs only, but the algorithms described can be
easily generalized to m−gram models of higher order.
We have to make an assumption before describing the algorithms in
detail. Specifically:
• ∀ei, ej ∈ E if f(ei) = f(ej) then w(ei) = w(ej)
The assumption ensures that all edges ei incoming to node vi will have the
same word label. We can obtain this property if we do the remove-Empty-Edge
after generating the word graph. With this assumption we can put word
label w(ei) on node vi and add the following accessory to nodes:
• w(vi) returns the word label in node vi. Moreover, we simply denote
wi the word label put in node vi.
70
CHAPTER 3. WORD GRAPH GENERATION
The consequence result of this is that:
• ∀vi ∈ V, wi 6= empty
In the following subsections, two algorithms for word graph expansion are
described, namely, the conventional algorithm and the compaction algo-
rithm. The first one simply traverses the word graph and duplicates new
nodes and edges to ensure that all edges in the word graph contain trigram-
language model scores. The compaction algorithm takes into account the
back-off language property when duplicating new nodes. That means that
the duplications are only performed when there are true trigram language
model scores. Naturally, the second approach keeps the word graph as
small as possible.
3.7.2 Conventional Algorithm
To place trigram probabilities on the graph edges we must create a unique
two-word context for each edge. It is very similar to the definition of
the language model state, L-state(w), as given in Section 3.2.3. For
example in Fig 3.7, case a , a node which contains word label w 4 has
its edges duplicated to guarantee the uniqueness of trigram contexts for
placing p(w5|w1w4) and p(w5|w2w4) on edges e41 and e51 respectively. When
a central node has two predecessors nodes labeled with the same word,
only one additional node and its corresponding outgoing edges need to be
duplicated. This case is illustrates in Fig 3.7, case b. As we can see, the
central node labeled with word w4 has two predecessor nodes containing
the same word label w1 and for this, only a new node with word label
w4 is duplicated. The conventional trigram expansion algorithm, given in
the procedure named conventional-Expansion, works by duplicating
nodes and edges in the manner indicated.
As we can see, at a given node vj, two loops are performed at line
71
3.7. WORD-GRAPH EXPANSION Vu Hai Quan
Case a . Bigram word graph before expansion Word graph after expansion
Case b Bigram word graph before expansion Word graph after expansion
w 1
w 2
w 3
w 4
w 5
w 6
w 1
w 2
w 3
w 4
w 4
w 4
w 5
w 6
w 1
w 2
w 1
w 4
w 1
w 2
w 1
w 4
w 4
w 5
w 6
w 5
w 6
e 1
e 2
e 3
e 4
e 5
e 1
e 2
e 3
e 41
e 51
e 1
e 2
e 3
e 4
e 5
e 1
e 3
e 2
e 41
e 51
Figure 3.7: Illustration of the conventional word graph expansion algorithm
3 and line 4 for its incoming edges and its outgoing edges, respectively.
Specifically, in the outmost loop,
• ea(vi, vj, wj) is an incoming edge to node vj. Based on the above
assumption, the word labels of v i and vj are wi and wj respectively.
Similarly in the inner loop,
• eb(vj, vk, wk) is an outgoing edges from node vj to node vk which is
labeled with word wk.
At line 5, if a node v? with its word label wj and a trigram (wi, wj, wk)
has been created, we just need to insert edge eb from node v? to node vk
(line 6).
On the contrary, at line 7, node v ? is duplicated from node vj and the word
label wj is also set to node v?. Then, at line 8, edge ea is placed between
node vi and v?.
The key of the algorithm is at line 9 and line 10, where edge e b is updated
72
CHAPTER 3. WORD GRAPH GENERATION
conventional-Expansion
1 � The conventional word graph expansion
2 for each vj ∈ V taken in topological order
3 do for each ea(vi, vj , wj) ∈ Inpand(vj)
4 do for each eb(vj, vk, wk) ∈ Expand(vj)
5 do if exists(v?, wj, (wi, wj, wk))
6 then insert-Edge(v?, vk, eb)
7 else v? = dup-Node(vj)
8 insert-Edge(vi, v?, ea)
9 lm(eb) = p(wk|wi, wj)
10 insert-Edge(v?, vk, eb)
11 remove-Node(vj)
with a new trigram language model score and then inserted from node v ?
to node vk. Now, the edge between v? and vk contains exactly the trigram
language model, p(wk|wi, wj).
When all the incoming edges to node vj are examined, it can be safely
removed, as in line 11.
3.7.3 Compaction Algorithm
The above algorithm considerably increases the word graph size. In fact,
for most trigram language models, the number of trigrams is much smaller
than the number of all possible trigrams. It would be attractive to share
the bigram back-off weights for trigram contexts since in this core, we need
to duplicate only enough nodes to uniquely represent the explicit trigram
probabilities in the word graph.
The idea underlying the algorithm is to factorize the backed-off trigram
probability p(wi+2|wi, wi+1) into the back-off weight b0(wi, wi+1) and the
bigram probability p(w i+2|wi+1), and to multiply the back-off weight by
the weight of the edge e(vi, vi+1), while keeping only the bigram estimate
73
3.7. WORD-GRAPH EXPANSION Vu Hai Quan
w 1
w 2
w 3
w 4
w 5
w 6
w 1
w 2
w 3
w 4
w 4
w 5
w 6
e 1
e 2
e 3
e 4
e 5
e 1
e 2
e 3
e 14 e 41
e 42
e 5
Figure 3.8: Illustration of the compact word graph expansion where explicit trigram
probability only exists for (w1, w4, w5)
.
on the edge e(vi+1, vi+2). Thus no node duplication is required. Since back-
off weights and probabilities are multiplied, the total score along a path
from vi through vi+1 to vi+2 will include the correct trigram probability
p(wi+2|wi, wi+1).
Fig 3.8 illustrates the compact expansion idea given that there is only
one explicit trigram probability p(w 5|w1, w4). Notice that only one new
node labeled with word w4 is duplicated together with its incoming edge
e14 and its outgoing e41. The trigram probability, p(w5|w1, w4), is placed
on the edge e41. The language model score on edge e14 is directly copied
from the language score on edge e1. After the explicit trigrams are pro-
cessed, the language model score on the edges outgoing from the origi-
nal node labeled with word w4, which are e42 and e5 on the right-hand
side of Fig 3.8, are weighted with their corresponding bigram probabilities
p(w5|w4) and p(w6|w4). Furthermore, bigram back-off weights b0(w4, w1),
b0(w4, w2), b0(w4, w3) are multiplied by the corresponding incoming edges,
e1, e2, e3, of the original node labeled with word w 4.
As we can see, at a given node vj, the compaction-Expansion proce-
dure also contains two loops, at line 2 and line 3. In case that there is an
explicit trigram language probability for w i, wj, wk, see line 4, a new node
is duplicated together with its new incoming and outgoing edges. Lines
74
CHAPTER 3. WORD GRAPH GENERATION
compactation-Expansion
� The compactation word graph expansion
1 for each vj ∈ V taken in topological order
2 do for each ea(vi, vj , wj) ∈ Inpand(vj)
3 do for each eb(vj, vk, wk) ∈ Expand(vj)
4 do if trigram(wi, wj, wk)
5 then if exists(v?, wj, (wi, wj, wk))
6 then insert-Edge(v?, vk, eb)
7 else v? = dup-Node(vj)
8 insert-Edge(vi, v?, ea)
9 lm(eb) = p(wk|wiwj)
10 insert-Edge(v?, vk, eb)
11 else mark(ea)
12 mark(eb)
13 if !(mark(ea))
14 then remove-Edge(vi, vj , ea)
15 else lm(ea) = lm(ea) · bo(wi, wj)
16 for each eb(vj, vk, wk) ∈ Expand(vj)
17 do if (!mark(eb))
18 then remove-Edge(vj , vk, eb)
19 else lm(eb) = p(wk|wj)
20 if !(mark(ea)), ∀ea(vi, vj, wj) ∈ Inpand(vj)
21 then remove-Node(vj)
between 6 and 10 are actually the conventional-Expansion. On the
contrary, no new nodes have to be duplicated. We just need to update the
language model scores by using the back-off factors.
The experiments of the word graph expansion algorithms will be given in
Chapter 5.
75
3.7. WORD-GRAPH EXPANSION Vu Hai Quan
76
Chapter 4
Word Graph Decoding
Word graph decoding is the process of finding the best sentence or N−best
sentences through the word graph. In this chapter, three word graph decod-
ing algorithms are fully presented, namely the 1-best word graph decoding,
the N−best word graph decoding and finally the consensus decoding.
4.1 1-Best Word Graph Decoding
Finding the best sentence in the word graph is equivalent to the shortest
path problem in graph theory [Thomas, 2001]. In fact the best sentence
corresponds to the longest path, i.e. the path that has highest score in the
word graph. In general, the problem of 1-best word graph decoding can
be formulated as follows: given a word graph G = (V, E, I, F ), find paths
which have highest score from I to some vj ∈ V . When vj is equal to F ,
we have the complete best path, or the best path. Let simply denote d(v j)
the best score of the path starting from the initial node I and ending at
node vj. By using dynamic programming, the following recurrence can be
formulated:
d(vj) =
0 if I = vj
max∀e,b(e)=vi,f(e)=vj(d(vi) + s(e))
(4.1)
77
4.2. N-BEST DECODING Vu Hai Quan
The interesting thing in Eq. 4.1 is that each edge is updated only once.
Hence, the best score up to node vj can be found by just comparing the
d(vi) + s(e) of all incoming edges e from node v i to node vj.
The implementation of this algorithm is straightforward. However, we
are not only interested in the best score but also in the word sequence in
the best path. Therefore, a back-pointer list is needed in order to trace-
back the best word sequence from the final node F to the initial node I.
For this, to each node vj ∈ V , an entry is associated as follows:
entry[vj] =
vi → is the best predecessor of node vj
w → is the word label of the edge e(vi, vj)
s→ is the best score up to node vj
The 1Best-Decoding procedure consists of two passes. In the forward-
pass, dynamic programming is applied to compute the best path from the
initial node I to the final node F . Specifically, lines from 7 to 11 are actually
the implementation of Eq. 4.1. When the final node F is reached, the value
of entry[F ].s is, therefore, the score of the best path. In the backward-
pass, the best word sequence is found by back-tracking from entry[F ] to
entry[I] using the back-pointer information as mentioned above.
This algorithm is also referred as the 1-best Viterbi decoding. Experimen-
tal results of the algorithm on the word graphs are given in Chapter 5,
where we will compare different word graph rescoring methods.
4.2 N-Best Decoding
The problem of finding the N−shortest paths of a weighted directed graph
is a well-studied problem in computer science [Eppstein, 1998a]. The prob-
lem also admits a number of variants such as finding just the N−shortest
paths with no cycle, or the N−shortest paths with distinct scores, which
78
CHAPTER 4. WORD GRAPH DECODING
1Best-Decoding
� initial
1 for each node vi = I to F
2 do entry[vi].f = −1
3 entry[vi].s = −∞
4 entry[vi].w = empty
� Dynamic programming pass
5 for each node vj = I to F in topological order
6 do for each edge e(vi, vj, τ, t, w, s) ∈ Inpand(vj)
7 do score = entry[vi].s + s(e)
8 if (entry[vj].s < score)
9 then entry[vj].s = score
10 entry[vj].vi = vi
11 entry[vj].w = w(e)
� Back-tracking pass
12 f = F − 1
13 repeat
14 to = f
15 word = entry[to].w
16 push word to sentence
17 f = entry[to].f
18 until f = I
have all been studied extensively as well. An efficient algorithm introduced
by [Eppstein, 1998b] finds an implicit representation of the N−shortest
paths (allowing cycles and multiple edges) between two nodes in time
O(|E| + |V | · log |V | + N). However, as mentioned in the previous sec-
tion, the related problems that arise in speech recognition applications are
quite different from the problem in graph theory. Concretely, the weighted
graphs considered in our applications are the word graphs defined in Sec-
tion 3.1 and the problem of determining N−shortest paths is intepreted
as determining N− best paths with the highest scores in word graphs.
79
4.2. N-BEST DECODING Vu Hai Quan
The N−word sequences corresponding to the N−best paths are called the
N−best word sequences or N−best sentences or the N−best hypotheses
or more simply, the N−best list. It is often desirable to determine not just
the N− best word sequences in a word graph, but the N−best distinct
word sequences or the N−different best list. In the following sub-sections,
we describe two algorithms for finding the N−best word sequence in a word
graph, namely the stack-based N−best decoder and the exact N−best de-
coder. Experimental results are given in Chapter 5 for speech recognition
and Chapter 6 for speech translation.
4.2.1 The Stack-Based N-Best Word Graph Decoding
In this section, a simple method is presented for finding all the complete
paths in a word graph, from the initial node I to the final node F whose
scores are within a prescribed threshold th with respect to the best path
score. Informally, the algorithm first computes the score of the best path
by using the 1Best-Decoding procedure (Section 4.1), then compare the
best score to all other scores of the remain paths in the word graph. Those
paths whose scores are close to the best score within a given threshold t h
are output. [Thomas, 1984] proposed an efficient implementation for this
algorithm, which uses a push-down (last-in,first-out) stack and has modest
memory requirements.
The algorithm can be described as follows. Let dβ(vi) denote the score of
the best path from the current node vi to the final node F . Let us assume
that there exists a partial path p of score dα(vi) starting from the initial
node I and ending at the current node vi. The edge e(vi, vj, w, τ, t, s) is
said to enter the path p if:
dα(vi) + s + dβ(vj) ≥ dβ(I)− th (4.2)
80
CHAPTER 4. WORD GRAPH DECODING
where dα(vi) + s + dβ(vj) is the score of the complete path p and dβ(I), by
its definition, is the score of the best path of the word graph.
Hence, all edges that enter are those on paths from node I to node F
having scores within th of the best path score. The depth-first procedure
lets us use the same path p for each entry in the stack. This approach
allows to put into the stack very small objects. In fact, each entry in the
stack has three attributes:
entry =
vi → is the next to the last node
vj → is the last node
c→ is the score of path (I, ..., v i, vj)
having a sub-path (I, ..., vi) in p
Since the word graph is an acyclic directed graph, the number of ele-
ments in the stack at any time is at most the number |E| of edges in the
word graph.
The detailed implementation of the algorithm is given in the Stack-Decoding
procedure. In the initial step, the scores dβ[vi], ∀vi ∈ V , of the best path
from node vi to the final node F are computed by using the 1Best-Decoding
procedure, given that the ending node is F . The initial entries of the stack
are also created during this step, at line 3 and line 4. The key of this
algorithm is given at line 13. As we can see, only edges in which the scores
of paths passing through them satisfy Eq. 4.2 are put in the stack. When
the final node F is reached, the complete path p whose score is within t h
of the best path score is output.
The main problem of this algorithm is how to specify the suitable value
of the threshold th. Experimentally, we first compute the score of the
best path and then try values of th until an approximate value of N−best
sentences is found.
In Fig 4.1, the optimal path from the initial node I=v 1 to the fi-
81
4.2. N-BEST DECODING Vu Hai Quan
Stack-Decoding
� Compute the dβ[vi], ∀vi ∈ V
1 1Best-Decoding(F )
� Create the initial path p with node I.
2 set p = (I)
� Create the initial stack.
3 for each e(I, vj , τ, t, w, s) ∈ expand(I) satisfies Eq. 4.2
4 do create an entry(I, vj, s) in the stack.
� The main algorithm.
5 while the stack is not empty
6 do
7 pop the topmost entry(vi, vj, c) in the stack
8 replace p = (I, ..., vi) by p = (I, ..., vi, vj).
9 if (vj 6= F )
10 then
11 vi = vj
12 d = c
13 for each e(vi, vj , τ, t, w, s) ∈ expand(vi) satisfies Eq. 4.2
14 do
15 c = d + s(e)
16 create an entry(vi, vj , c)
17 else
18 output p and c
nal node F=v9 has a cost of -13 and passes through the set of nodes:
{v1, v3, v6, v8, v9}. Table 4.1 shows how the algorithm computes all paths
from node I to node F whose lengths are with 2.6 units of the optimal path
length.
4.2.2 Exact N-Best Decoding
The N−best decoding algorithm presented in the previous section is sim-
ple and fast. However, there is the problem of specifying the thresh-
82
CHAPTER 4. WORD GRAPH DECODING
Figure 4.1: An word graph example for the N−best stack-based decoding.
v 1
v 2
v 3
v 4
v 6
v 7
v 8
v 5 v 9
e 1
- 2
e 2
0
e 3
- 2
e 4
- 6
e 5
- 8
e 6
- 3
e 7
-5
e 8
-3
e 9
-4
e 10
-4
e 11
-5
e 12
-6
{-5}
I F
Table 4.1: The Illustration of the stack-based N−best decoding.
Step Entries Path P
vi vj c
1 v1 v2 -2 {v1}
v1 v3 0
2 v2 v4 -4 {v1, v2}
v1 v3 0
3 v4 v7 -9 {v1, v2, v4}
1 3 0
4 v7 v9 -14 {v1, v2, v4, v7}
v1 v3 0
5 v1 v3 0 Output path {v1, v2, v4, v7, v9}, -14
6 v3 v6 -3 {v1, v3}
7 v6 v8 -7 {v1, v3, v6}
8 v8 v9 -13 {v1, v3, v6, v8}
9 Output path {v1, v3, v6, v8, v9}, -13
83
4.2. N-BEST DECODING Vu Hai Quan
old th. Experimentally the score of the best path is determined by the
1Best-Decoding procedure. Then, the threshold value th is chosen so
that we can obtain approximately N−best sentences. In this section, an
exact N−best decoding based on the word graph [Tran, 1996] is carefully
described. The complexity of this algorithm is higher than the previous
one but with an efficient implementation, we can get the expected results
while keeping the processing time low.
The principle of the approach is based on the following consideration:
when several paths lead to the same node in the word graph, according to
the Viterbi criterion, only the best scored path is expanded. The remaining
paths are not considered any more due to this recombination.
Assuming that the first best sentence hypothesis was found by the
Viterbi decoding through a given word graph, the second best path is
the path which competed with the best one but was recombined at some
node of the best path. Thus in order to find the second best sentence hy-
pothesis, we have to consider all possible partial paths in the word graph
which reach some node of the best path and might share the remaining
section with the best path. By applying this procedure repeatedly, N -best
sentence hypotheses can be successively extracted from a given word graph.
More specifically, the best path can be determined simply by comparing
the accumulative scores of all possible paths leading to the final node of
the word graph. In order to ensure that this best word sequence is not
taken into account while searching for the second best path, the complete
path is copied into a so-called N -best tree. Using this structure, a back-
ward cumulative score for each word copy is computed and stored at the
corresponding tree node. This allows for fast and efficient computation
of the complete path scores required to determine the next best sentence
hypotheses. The second best sentence hypothesis can be found by taking
the path with the best score among the candidate paths which might share
84
CHAPTER 4. WORD GRAPH DECODING
a remaining section of the best path. The partial path of this sentence hy-
pothesis is then copied into the N -best tree. Assuming the N -best paths
have been found, the (N + 1)-th best path can be determined by exam-
ining all existing nodes in the N -best tree, because it can share the last
part of some path among the top N paths. Thus it is important to point
out that this algorithm performs a full search within the word graph and
delivers exact results as defined by the word graph structure. The detailed
algorithm is given in the Exact N-Best procedure.
Fig 4.2 shows the graph example of Fig 4.1 with an additional node v 10
and an edge e13 with score 0. Note that the newly added node and edge do
not change the results of N−best decoding. Some notes on the algorithm
are given below and in Fig 4.3 and Fig 4.4.
• Line 1 : do the 1-best decoding to find the best path in the word
graph.
• Line 2 : for each vj ∈ E, compute the FwSco(vj, ea) leading to it.
• Line 3 : the initial N−best tree is created by using information from
the best path.
Fig 4.3 shows the FwSco and BwSco of the initial N−best tree. It is
important to notice that, by definition of FwSco(v j, ea) and BwSco(vi, eb),
we can compute the complete path score by the following equation:
Ψ(e) = FwSco(vj, ea) + BwSco(vi, eb) + score(e) (4.3)
where the relationship in Eq. 4.3 is given by b(e) = v j and f(e) = vi.
For example, let us consider the edge e10 in Fig 4.2. The complete path
score passing through this edge is given by: Ψ(e10) = FwSco(v10, e6) +
BwSco(κv8, e12) + score(e10) = (−3) + (−6) + (−4) = −13
• the algorithm is repeated from line 4 until N−best paths are found.
85
4.2. N-BEST DECODING Vu Hai Quan
Notation
e ∈ E : be an edge in the word graph, defined in Chapter 3.
FwSco(vj , ea) : cumulative forward score of the best partial path
leading to node vj , passing through ea ∈ E.
BwSco(κi, eb) : cumulative backward score from sentence end
to node vi ∈ Υ ⊂ V passing through edge eb ∈ E.
Υ : the N-Best tree contained all the tree nodes (vi, eb).
Exact N-Best(wg, N)
1 Do 1best-Docoding(wg) to find the best path.
2 Store cumulative path score FwSco and back-pointer for ∀e ∈ E.
3 Create the initial N-Best tree, which contains the best path.
4 for n=1. . . N
5 do for each node (vi, eb) ∈ Υ
6 do for each e(vj , κi) ∈ E,
7 do if w(e) 6= w(eb)
then Compute complete path score
8 Ψ(e) = FwSco(vj , ea)+
BwSco(κi, eb) + score(e)
Save the edge with the best score
9 e? = arg maxe Ψ(e),
Save the best N-Best tree node
10 κ? = arg maxκiΨ(e),
11 Trace back from e?, to the sentence start in the word graph.
by using the FwSco structure found at line 2.
12 Copy this partial path and insert it to the N-Best tree at κ?
to get complete path. The newly created nodes are then κj .
13 Compute backward cumulative score BwSco for the
newly created nodes κj.
14 Output word sequence.
86
CHAPTER 4. WORD GRAPH DECODING
Figure 4.2: A word graph example for the exact N-Best decoding
v 1
v 2
v 3
v 4
v 6
v 7
v 8
v 5 v 9
e 1
- 2
e 2
0
e 3
- 2
e 4
- 6
e 5
- 8
e 6
- 3
e 7
-5
e 8
-3
e 9
-4
e 10
-4
e 11
-5
e 12
-6
I F v 10 e 13
0
Figure 4.3: The exact N-Best decoding - Step 1: The initial FwSco and N-Best Tree
v 3
FwSco = 0 e 2
v 6
FwSco = -3 e 6
v 8
FwSco = -7 e 10
v 9
FwSco = -13 e 12
v 10
FwSco = -13 e 13
v 1
BwSco = -13 e 2
v 3
BwSco = -13 e 3
v 6
BwSco = -10 e 10
v 8
BwSco = -6 e 12
v 9
BwSco = 0 e 13
v 10
BwSco = 0 NULL
The FwSco for the best path
The BwSco for the initial N -best tree
Figure 4.4: The exact N-Best decoding - Step 2: The N-Best tree at step 2
BwSco = -14 e 1
BwSco = -12 e 3
BwSco = -10 e 7
BwSco = -5 e 11
v 1 v 2 v 4 v 7
v 1
BwSco = -13 e 2
v 3
BwSco = -13 e 3
v 6
BwSco = -10 e 10
v 8
BwSco = -6 e 12
v 9
BwSco = 0 e 13
v 10
BwSco = 0 NULL
The BwSco for the N -best tree at first loop
87
4.3. CONSENSUS DECODING Vu Hai Quan
During the first iteration, all the elements in the N−best tree are ex-
amined to find the edge which has the highest complete path score passing
through it. That means: ∀(vi, eb) ∈ Υ, ∀e ∈ E, f(e) = vi compute Ψ(e)
and save the best one. The found edge in the case of Fig 4.2 is e11 with
the complete score of −14. By using the FwSco back-pointer, we get the
partial best path and insert it to the N−best tree at node v 9 resulting in
Fig 4.4.
The Complexity of the Algorithm
As we can see, the N−best tree must be updated, after the N−best path
was found to avoid the extraction of the same sentence hypothesis twice.
Thus the search effort depends on the current size of of the N−best tree.
Assuming a sentence with M words and a very large N , the expected cost
of the computation is:
N∑
n=1
M
2(n + 1) =
M
4N(N + 3) O(N 2)
The experimental results of N−best decoding are given in Chapter 5.
4.3 Consensus Decoding
With word graph decoding, we have to handle computational problems.
In general the number of paths through a word graph is exponential in
the number of edges. These paths correspond to different segmentation
hypotheses (i.e. word sequence plus boundary times) of the utterances.
A method to overcome the exponential problem has been suggested in
[Mangu, 1999]. The word graph is first transformed into a special form in
which is called “confusion network”. A confusion network itself is a word
graph. Each edge is labeled with a word and a probability. The most im-
portant feature of confusion networks is that they are linear, in the sense
88
CHAPTER 4. WORD GRAPH DECODING
that every path from the start to the end node has to pass through all
nodes. A consequence of this (combined with the acyclicity) is that all
paths between two nodes have the same length. Thus the confusion net-
work naturally defines an alignment for all pairs of paths, called a multiple
alignment by Mangu. This alignment is used as the basis for the word error
rate calculation. The special entries labeled “ “ in the network correspond
to deletions in the alignment.
To estimate the word error rate, sets E i are defined that contain all the
edges that connect the node vi with the next node. Thus each Ei contains
the alternative hypotheses for the word in position i in the final word string.
The posteriors of the elements of each set sum to one. In order to come
up with the confusion network constructing algorithm, we first make an
introduction about the word posterior probability.
4.3.1 Word Posterior Probability
The sentence posterior probability is straightforwardly defined as the prob-
ability of the word sequence given the acoustic feature vectors: p(w 1...wn|X).
By definition the sentence hypothesis covers the whole sequence of feature
vectors. The situation for a word level posterior probability is more com-
plicated as the boundaries of the word in question are not a-priori known.
Depending on the application, different variants of a word posterior prob-
ability might be useful. In the following some of these variants are listed
and discussed.
The simplest way to define a word posterior probability is to treat
the start and end times as additional random variables[Evermann, 1999],
[Wessel, 2002]. This leads to posteriors of the form p(w|τ, t, X tτ). The
calculation of these values can be achieved relatively easily:
89
4.3. CONSENSUS DECODING Vu Hai Quan
p(w|τ, t, X) =∑
e:τ(e)=τ∧t(e)=t∧w(e)=w
p(e|X) (4.4)
The problem with these posteriors is that they are too specific, because
they depend on the exact boundary times. For many applications the
concrete start and end times are not relevant, instead it is desirable to find
a more general definition that does not involve the specific times.
A solution to this is to calculate posterior probabilities for each time
in the utterance [Evermann, 1999] [Wessel, 2002]. Informally speaking
these ”instantaneous” word posterior distributions capture which words
the decoder considers likely at that particular time.
The big advantage of this approach is that it does not only take the
best path into account but also encodes information about the number and
likelihoods of all alternatives (different segmentations of the same word and
different words). The calculation from the link posteriors is simple:
p(w|t, X) =∑
e:τ(e)≤t≤t(e)∧w(e)=w
p(e|X) (4.5)
Fig 4.5 illustrates the computation of the time-dependent word poste-
riors. As an example, the posterior of word w 1 given by Eq. 4.5 is:
p(w1|ta) = p(e1) + p(e3) + p(e5)
4.3.2 Confusion Network Construction
[Mangu, 1999] also presented an approach to construct a confusion network
from a word graph. The task is treated as a clustering problem, where
the edges from the original word graph have to be clustered into the sets
Ei mentioned above. To achieve a suitable clustering, the algorithm is
constrained to keep the precedence ordering of the original word graph. If
90
CHAPTER 4. WORD GRAPH DECODING
e 1
e 2
e 3
e 4
e 5
e 6
w 1
w 2
w 1
w 3
w 1
w 2
t a t c
t
Figure 4.5: Time dependent word posteriors
an edge ea precedes an edge eb in the word graph (i.e. there is a path from
ea to eb) then the cluster in which ea falls must also precede eb’s cluster.
Very informally speaking, this ensures that the order of words is kept, i.e.
the word graph is just collapsed along the vertical axis. In fact, this can
be ensured by defining a precedence relation or a partial order on the sets
of edges. A set E1 precedes (�) E2 if each of E1 members precedes (≤) all
E2 members in the original lattice. We define the partial order ≤ on the
edges of a given word graph as follows. For e1, e2 ∈ E, e1 ≤ e2 iff:
e1 = e2 or
t(e1) = τ(e2) or
∃e′ ∈ E such that e1 ≤ e′ and e′ ≤ e2
Two equivalence sets E1, E2 are said to be consistent with the word graph
order ≤ if e1 ≤ e2 implies that E1 � E2. By combining equivalence sets,
additional precedences are introduced. Starting from the partial order
defined by the word graph this is repeated until a total order of all sets
is reached, which corresponds to a linear structure of the lattice. The
91
4.3. CONSENSUS DECODING Vu Hai Quan
algorithm by [Mangu, 1999] consists of the following steps:
• calculation of all edge posteriors as described in Eq. 3.9
• calculation of p(w|τ, t, X) for all words (with associated times) found
in the word graph as in Eq. 4.5. These constitute the initial sets.
• Intra-Clustering: combination of overlapping sets corresponding to
the same word. Uses the similarity measure:
SIM(E1, E2) = maxe1∈E1,e2∈E2
overlap(e1, e2).p(e1).p(e2) (4.6)
• Inter-Clustering: combination of remaining overlapping equivalence
sets (with different words) until a total order of the sets is achieved.
Uses the similarity measure:
SIM(E1, E2) = avgw1∈E1,w2∈E2sim(w1, w2).pE1(w1).pE2(w2) (4.7)
where the function overlap(e1, e2) measures the overlapping in time be-
tween edge e1 and e2; sim(w1, w2) is the phonetic similarity measure of the
canonical pronunciations of the words w1 and w2. The details of the intra-
word clustering algorithm are given in the Intra-Word Clustering
procedure. It is similar to the inter-word clustering just by replacing the
SIM(w1, w2) for phonetic similar.
After the above algorithm has run until no further equivalence sets can-
didates are available, we obtain the confusion network and also the best
alignment which is named the consensus hypothesis. The properties of
confusion networks and consensus hypotheses will be given in the next
sections.
As we can see, the algorithm consists of three stages. The initial edge
equivalence sets are formed by word identity and the starting and ending
92
CHAPTER 4. WORD GRAPH DECODING
Intra-Word Clustering(wg,pro)
1 for E1 ∈ E
2 do for E2 ∈ E
3 do if (word(E1) = word(E2)) & E1 6� E2 & E2 6� E1
4 sim = SIM(E1, E2)
5 if (sim > maxSim)
6 then maxSim = sim
7 bestSet1 = E1
8 bestSet2 = E2
9 Enew = bestSet1 ∪ bestSet2
10 for Ei ∈ E, i = 1, ..., N
11 do if (Ei � bestSet1 or Ei � bestSet2)
12 then Ei � Enew
13 for Ei ∈ E, i = 1, ..., N
14 do if (bestSet1 � Ei or bestSet2 � Ei)
15 then Enew � Ei
16 for Ei ∈ E, i = 1..N
17 do for Ej ∈ E, j = 1..N
18 do if (Ei � bestSet1&bestSet2 � Ej) or
(Ei � bestSet2&bestSet1 � Ej)
19 then Ei � Ej
20 E = E ∩ {Enew} \ {bestSet1, bestSet2}
time:
Ew,t1,t2 = {e ∈ E|w(e) = w; τ(e) = t1, t(e) = t2}
The initial partial order� is given as the transitive closure of the edge order
≤ defined above. The partial order is updated minimally upon merging of
equivalence sets so as to keep it consistent with the previous order.
Fig 4.6 shows a real word graph example while its confusion network is
illustrated in Fig 4.7. The symbol eps is represents for the empty label.
93
4.3. CONSENSUS DECODING Vu Hai Quan
0
1
LA
2
LA
3
LA
4
LAMENTI
5
LAMENTI
6
LA
7
LAMENTI
8
LAMENTI
9
LAMENTI
10
LAMENTI
11
LAMENTI
12
LAMENTI
22
LA
@BG
@BG
@BG
@BG
@BG
@BG
@BG
@BG
@BG
@BG
@BG
13
MA
14
MA
15
MA
16
MA
17
MA
18
MA
19
MA
20
MA
@BG
@BG
@BG
@BG
@BG
@BG
@BG
21
GIUSEPPE
82
</s>
23
VITTIMA
24
VITTIMA
28
VITTIMA
33
VITTIMA
34
VITTIMA
35
VITTIMA
36
VITTIMA
37
VITTIMA
38
VITTIMA
@BG
25
A
26
E
27
E’ @BG
29
@BG
30
@BG
31
@BG AEE’
32
ERA @BG
40
@BG
52
@BG
64
@BG
60
@BG @BG A
41
A
47
DIEERA
61
ERA
E’
@BG
53
E
65
E’
@BG
42
A
48
DI
54
E
62
ERA
66
E’
@BG
43
A
55
E
56
E67
E’
68
E’
@BG
44
A
49
DI
39
GIUSEPPE
50
DI
</s>
@BG
@BG
@BG
@BG
45
GIUSEPPE
46
GIUSEPPE
@BG
</s>
@BG
@BG
@BG
51
GIUSEPPE
</s>
@BG
@BG
@BG
@BG
57
GIUSEPPE
58
GIUSEPPE
59
GIUSEPPE@BG
@BG
</s>
@BG
@BG
63
GIUSEPPE
</s>
@BG
@BG
@BG
@BG
69
GIUSEPPE
70
GIUSEPPE
71
GIUSEPPE
72
GIUSEPPE
74
GIUSEPPE
78
GIUSEPPE
79
GIUSEPPE
80
GIUSEPPE
@BG
@BG
@BG
73
CHE
@BG
@BG
81
TE
</s>
75
E
76
E’
77
DE
</s> </s></s>
@BG
@BG
</s>
</s>
Figure 4.6: A word graph example.
94
CHAPTER 4. WORD GRAPH DECODING
1
2
LA LAMENTI eps
3
VITTIMA MA eps
4
DI E’ eps E A
5
GIUSEPPE eps
Figure 4.7: The resulted confusion network from the word graph in Fig 4.6.
95
4.3. CONSENSUS DECODING Vu Hai Quan
4.3.3 Pruning
In chapter 3, a pruning algorithm, namely forward-backward based algo-
rithm, was presented. In fact, it works by computing the posterior prob-
ability for each edge and then removing edges which have low posterior
probabilities compared to the best one. The same procedure can be ap-
plied to a word graph before constructing the confusion network from that
word graph. In particular, edges which have very low posterior probability
are negligible in computing the total posterior probabilities of word hy-
potheses, but can have a detrimental effect on the alignment. This occurs
because the alignment preserves consistency with the word graph order, no
matter how low the probability of the links imposing the order is. In order
to eliminate low posterior probability edges, we use a preliminary pruning
step. Word graph pruning removes all edges whose posteriors are below an
empirically determined threshold. The clustering procedure merges only
edges that survive the initial pruning. Chapter 5 gives results showing the
effect of lattice pruning on the overall effectiveness of our algorithm.
4.3.4 Confusion Network
The total posterior probability of an alignment set can be strictly less
than 1. This happens when there are paths in the original word graph
that do not contain a word at that position; the missing probability mass
corresponds precisely to the probability of a deletion (or null word). We
explicitly represent deletions by a link ” ”. We can think at the confusion
network as a highly compact representation of the original lattice with the
property that all word hypotheses are totally ordered. Moreover, confusion
networks have other interesting uses besides word error minimization, some
of which will be discussed in Chapter 6 (for Machine Translation).
96
CHAPTER 4. WORD GRAPH DECODING
4.3.5 Consensus Decoding
Once we have a complete alignment, it is straightforward to extract the
hypothesis with the lowest expected word error. Let E i, i = 1, ..., L be the
final link equivalence sets making up the alignment. We need to choose a
hypothesis W = w1, ..., wL such that wi = , or wi = w(ei) for some ei ∈ Ci.
It is easy to see that the expected word error of W is the total sum of word
errors for each position in the alignment. Specifically, the expected word
error rate at position i is:
1− p(ei) if wi 6= “ ”
1−∑
e∈Eip(e) if wi = “ ”
This is equivalent to find the path through the confusion graph with the
highest combined link weight.
97
4.3. CONSENSUS DECODING Vu Hai Quan
98
Chapter 5
Improvements of Speech Recognition
5.1 Speech Recognition Experiments
In this chapter, we present the experiments on the of use word graphs
in speech recognition. There are two datasets which have been used for
evaluating purposes. The first one, named BTEC (Basic Travelling Ex-
pressions Corpus), is a small corpus of spontaneous speech. The second
one, named IBNC (Italian Broadcast News Corpus), is a large corpus of
broadcast news. The two corpora are then quite different in terms of vo-
cabulary size and type of speech, allowing an exhaustive evaluation of word
graph algorithms.
This chapter is organized as follows. In the first section, we mention
about the ITC-irst recognition system, in terms of acoustic and language
models, the training datasets etc. In the second section, experiments on
the use of word graphs for speech recognition are presented. Finally, the
last section covers the experiments of N−Best rescoring methods.
5.2 Speech Recognition System
The automatic speech recognition system employed for experiments has
been developed at ITC-irst since 1990s, first experiments on broadcast news
99
5.2. ASR SYSTEM Vu Hai Quan
Figure 5.1: Broadcast News Retrieval System
were presented in [Brugnara, 2000] and [Federico, 2000]. Fig. 5.1 shows the
components of the ITC-irst system for recognizing broadcast news, which
includes the following components: segmentation and classification, speaker
clustering, acoustic adaptation, and speech transcription. Hereinafter, we
will shortly describe each component.
5.2.1 Segmentation and Clustering
As shown in Fig 5.1, the audio signal of news can contain speech, possibly
from different speakers, music, non-speech. The purpose of the segmenta-
tion and classification stage is to individuate in the audio signal segments
of music, non-speech events, male/female wide-band speech, male/female
narrow band speech. The clustering stage tries to gather speech segments
of the same speaker.
For segmentation, the Bayesian Information Criterion (BIC) is applied
100
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
to segment the input audio into acoustically homogeneous chunks. Gaus-
sian mixture models are then used to classify segments in terms of acoustic
source and channel. Emission probability densities consist of mixtures of
1024 multi-variate Gaussian components having diagonal covariance ma-
trix. Observations are the same 39-dimension vectors used for speech recog-
nition (see below).
Clustering of speech segments is done by a bottom-up scheme that
groups segments which are acoustically close with respect to BIC.
5.2.2 Acoustic Adaptation
Gaussian components in the system are adapted using the Maximum Like-
lihood Linear Regression (MLLR) technique. A global regression class is
considered for adapting only the means or both means and variances. Mean
vectors are adapted using a full transformation matrix, while a diagonal
transformation matrix is used to adapt variances.
5.2.3 Speech Transcription
The core of speech transcription is the recognition engine. It includes the
acoustic model, the language model and the search algorithm.
Acoustic Model
Acoustic modeling is based on continuous-density HMMs. The acoustic
parameter vector comprises 12 mel-scaled cepstral coefficients, the log-
energy and their first and second time derivatives. A total of 85 phone-
like units are used, in which 50 are needed for representing the Italian
phonemes, while the remaining 35 are needed for representing foreign words
coming from the English and German languages. A set of 16 additional
101
5.2. ASR SYSTEM Vu Hai Quan
units has also been introduced to cover silence, background noise and a
number of spontaneous speech phenomena.
Context-dependent models include a set of triphones and a set of left-
dependent or right-dependent diphones used as backoff models for unseen
contexts. Acoustic training was performed with Baum-Welch re-estimation
using the training portion of the IBNC corpus, augmented with other
datasets collected at ITC-Irst.
Language Model
A trigram language model was developed for recognition, by mainly ex-
ploiting newspaper texts. For estimating the trigram language model, a
133M-word collection of the nationwide newspaper La Stampa was em-
ployed. Moreover, the broadcast news transcriptions of the training data
were also added.
A lexicon of the most frequent 64K words was selected. The lexicon
gives a 2.2% OOV rate on the newspaper corpus and about 1.6% on the
IBNC corpus. An interpolated trigram LM [Federico, 2000] was estimated
by employing a non linear discounting function and a pruning strategy that
deletes trigrams on the basis of their context frequency. This results in a
pruned LM with a perplexity of 188 and a size of 14M.
Recognition
The recognizer is a single-step Viterbi decoder. The 64K-word trigram LM
is mapped into a static network with a shared-tail topology [Antoniol, 1995].
The main network has about 11M states, 10M named transition and 17M
empty transitions.
102
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
Table 5.1: Training and Testing Data for BTEC and IBNC.
Training Data Testing Data
Speech Language Speech Vocab. Size Word per Sent.
BTEC 130 h 133M 00:29m 14K 7.4
IBNC 60 h 133M 1h:15m 64K 20.2
5.2.4 Training and Testing Data
The IBNC corpus consists of around 60 hours of news programs from Radio
RAI (the major Italian broadcasting company) for training and 1h:15m
of news programs for testing. The programs mainly contain clean studio
speech reports of anchors or other reporters (52 %), clean telephone reports
and interviews (21 %), studio speech with background music/noise (20 %),
telephone speech with background noise (5 %) and other (2 %).
The Basic Travelling Expression Corpus-BTEC jointly developed by
the C-STAR partners is a collection of sentences that bilingual travel ex-
perts consider useful for people going to or coming from another coun-
try. The initial collection of Japanese and English sentence pairs is be-
ing translated into Chinese, Korean, and (partially) Italian, as reported
in [Paul,2004]. Actually, the BTEC for Italian consists of 506 short sen-
tences and is recorded by 10 persons for a total of 28.7m. This dataset is
used for evaluation purpose.
In summary, Table 5.1 shows the training and testing data for both
BTEC and IBNC tasks.
5.3 Experimental Results
This section presents all experimental results of the algorithms mentioned
in Chapter 3 and Chapter 4, including the word graph generation, the word
graph pruning, the word graph expansion and the word graph decoding.
103
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
Table 5.2: BTEC: Word error rates with different rescoring methods.
Decoder Word Graph Decoding Consensus Decoding
trigram-case bigram-case trigram-case bigram-case
21.8 21.8 22.2 21.8 22.2
Table 5.3: IBNC: Word error rates with different rescoring methods.
Decoder Word Graph Decoding Consensus Decoding
trigram-case bigram-case trigram-case bigram-case
18.0 18.0 19.8 17.8 19.6
5.3.1 Word Graph Decoding
Let us begin with the word graph decoding experiments. Three decoding
methods, namely the decoder, the 1−best word graph decoding and the
consensus decoding, are evaluated in terms of WER. Here, the WER is
the edit distance between the best hypothesis and its given utterance (see
Chapter 3).
The difference between the three decoding methods is in the way the
best hypothesis is chosen. With the decoder, the best output hypothesis
is taken directly from the recognizer, while with the 1−best word graph
decoding, the best hypothesis is the best path in the word graph. Finally,
in consensus decoding, the best hypothesis is the consensus hypothesis,
which has the highest posterior probability among all possible hypotheses
taken from the confusion network (see Chapter 4).
This evaluation should fulfil two important requirements:
• correctness : for the implementation of the word graph generation
algorithm.
• improvement : for the consensus decoding.
From the theoretical descriptions of the word graph generation in Chap-
ter 3 and the word graph decoding in Chapter 4, we expect that the WER
104
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
of the 1−best word graph decoding for trigram-based word graphs should
be exactly the WER of the decoder. This property can not be true for
the bigram-based word graphs, as the approximate language model scores
have been used to separate the acoustic scores from the accumulate word
scores, given by Eq. 3.5. Moreover, the WER of consensus decoding for
trigram-based word graphs should be at least equal to the WER of the
decoder.
Table 5.2 and Table 5.3 report the word error rate for the BTEC and
the IBNC datasets with respect to the three different decoding methods.
For the BTEC dataset, the three methods give the same WER of 21.8%
with trigram case while with bigram-case, the WER is 22.2% for both
1−best word graph decoding and consensus decoding methods. There is
a little bit different with IBNC dataset. Specifically, while the WER of
the decoder and the 1−best word graph decoding give the same value of
18.0% with trigram-case, the WER of the consensus decoding for this type
of word graphs is smaller (17.78%). In addition, with bigram-case the
WER of the consensus decoding is also smaller than the WER of 1−best
decoding, 19.6% compared to 19.9% respectively. These results show both
the correctness and the improvement requirements.
A possible reason why the consensus decoding does not help to improve
the word error rate for the BTEC dataset, as it does with the IBNC dataset,
can be the following: The average length of the sentences in IBNC is longer
than in BTEC, (see Table 5.1). This leads to the more influence of the
language models on IBNC than on the BTEC.
5.3.2 Impact of Beam-Width
Results reported in Table 5.2 and Table 5.3 have been measured on word
graphs of different quality obtained by three different pruning techniques.
The first pruning technique is the beam search with different beam-width.
105
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
Table 5.4: BTEC: Costs of the decoder with different thresholds values.
threshold Time Active States Active Trans Active Model
1 · 10−40 158.45 1629469 2985586 1770564
1 · 10−50 173.36 4448169 8184684 3897707
1 · 10−60 242.02 12252734 21518788 7572416
1 · 10−70 335.03 29111720 48728725 13010224
1 · 10−80 612.41 60120619 95958020 20431202
The second one is based on the forward-backward pruning algorithm. Fi-
nally, the last one exploits the topology of word graphs for compressing
them into a more compact representation. In this and in the following
subsection, we will examine the effect of the pruning methods on the word
graph quality.
Threshold
With the first approach, by setting different beam-width, the number of
pruned hypotheses changes. Specifically, when a large beam is searched,
the number of pruned hypotheses is small (see Chapter 3). This results
in larger word graphs. As a consequence, the time needed to complete
the recognition and word graph generation tasks is longer. In fact, the
recognition system uses a variable named threshold to represent the beam-
width. It is worth to note that a small value of threshold corresponds to
a large beam-width and vice-versa. Hence, the terms threshold and beam-
width are used alternatively. The following experiments show the impact
of beam-width on the performance of the recognizer and the word graph
quality.
Costs of the Recognition System vs. Beam Width
Table 5.4 shows the costs of the recognition system for a part of the BTEC
dataset, with respect to different beam-widths. By decreasing the thresh-
106
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
old value from 10−40 to 10−80, the ratio of time that the system needs to
complete the recognition task is approximately 1 : 3.4 and the ratio of
active states is 1 : 13.37 respectively. This means that not only the time
but also the used memory increase as the beam-width value is set larger.
Word graphs vs. Beam Width
As noticed in the previous section, when the beam-width is large, a small
number of hypotheses is pruned and a longer time is needed to complete
the recognition task. The same result holds for the word graph generation.
Fig 5.2 shows the relationship between the time used for generating the
word graph and the threshold value. The dotted-line is used for trigram-
based word graphs and the solid line refers to bigram-based word graphs.
We also report the GER and the word graph size (in terms of number
of paths) with different threshold values, as shown in Fig 5.3 and Fig 5.4.
Note that, the threshold values on the x−axis in those figures are the power
of exponential. For example, the value −40 actually means 1 · 10−40.
The choice of the beam-width is very important. It should satisfy both
the time and the GER constraints. Keeping this in mind and looking at
the experimental results, we have chosen the threshold value of 10−50 for
all the following experiments.
5.3.3 Language Model Factor Experiments
As shown in Eq. 2.7, the language model factor plays an important role in
computing the likelihood. Actually, the likelihood is computed by combin-
ing two scores, given by the acoustic and by the language models respec-
tively, which have very different dynamic scales. If they are just multiplied
as indicated in Eq 2.6 the decision for a word sequence would be dominated
by the acoustic likelihood and the language model would have hardly any
107
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
0
50
100
150
200
250
300
350
-80 -75 -70 -65 -60 -55 -50 -45 -40
Tim
e
Threshold
"btec.time.crtLat.B.txt""btec.time.crtLat.T.txt"
Figure 5.2: BTEC: Time for generating word graphs vs. different threshold values.
108
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
6
8
10
12
14
16
18
-85 -80 -75 -70 -65 -60 -55 -50 -45 -40
GW
ER
Threshold
"btec.B.fig.pruning.decoding.dat""btec.T.fig.pruning.decoding.dat"
Figure 5.3: BTEC: GER vs. different threshold values.
109
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
20
40
60
80
100
120
140
160
180
200
220
-80 -75 -70 -65 -60 -55 -50 -45 -40
Num
ber o
f Pat
hs (i
n lo
g)
Threshold
"btec.B.npaths_beamthreshold.txt""btec.T.npaths_beamthreshold.txt"
Figure 5.4: BTEC: Number of paths in word graphs vs. different threshold values.
110
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
influence. To balance the two contributions, it is customary to use an expo-
nential weighting factor for the language model. The difference in dynamic
scales is mainly caused by the fact that the acoustic likelihoods are severely
underestimated since consecutive frames are assumed to be independent in
the HMM framework.
Fig 5.5 shows the WER vs. the language model factor, for a part of
BTEC dataset. As we can see, with the value of 7.0, the WER is small-
est. This value of the language model scale was used for all the following
experiments.
5.3.4 Forward-Backward Based Pruning Experiments
The previous sub-sections showed the effect of the “beam search” on the
generation of word graphs by measuring the generation time, the word
graph sizes and the GERs. The following subsections present experiments
on the algorithms which directly work on the generated word graphs. For
each experiment, we report the results for both trigram-based word graphs
and bigram-based word graphs with both BTEC dataset and IBNC dataset.
Word Graph Experiments
Let us begin with the forward-backward based pruning algorithm which
was presented in Chapter 3. Table 5.5 and Table 5.6 show the impact of
this pruning algorithm on the word graph quality for the BTEC dataset.
Table 5.7 and Table 5.8 give results of the same experiments for the IBNC
dataset.
In the first lines of the tables, the beam-width value of Inf means that
no pruning is performed. The GERs, in this initial case, are 15.3% for
trigram-based word graphs, and of 7.9%, for bigram-based word graphs.
When using the threshold value of 100.0, we obtained an absolute increase
111
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
20.5
21
21.5
22
22.5
23
23.5
6 6.5 7 7.5 8 8.5 9 9.5 10
Wor
d E
rror
Rat
e
LM Scales
"btec.lm.dat"
Figure 5.5: BTEC: WER vs different language factors.
112
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
Table 5.5: BTEC: Trigram-based graph word error rate.
Beam-Width WGD NGD BGD NumPaths(log) GER
Inf 447.7 224.8 23.0 42.8 15.3
300 431.6 236.9 23.2 41.0 15.4
250 418.3 230.3 23.2 39.7 15.4
200 393.2 218.0 23.2 38.5 15.5
150 338.0 191.2 23.1 35.1 15.5
120 259.6 152.2 22.6 32.2 15.6
100 156.7 93.8 16.0 29.5 16.3
50 19.2 12.2 3.3 19.3 18.1
Table 5.6: BTEC: Bigram-based graph word error rate.
Beam-Width WGD NGD BGD NumPaths(log) GER
Inf 1558.7 390.4 38.4 126.4 7.9
300 1444.0 367.6 38.0 119.6 8.0
250 1338.3 347.2 37.4 111.9 8.1
200 1113.0 308.0 36.0 104.3 8.1
150 781.0 237.2 32.9 92.0 8.1
100 298.2 111.4 21.0 65.5 8.8
50 38.0 19.1 4.9 30.0 12.1
Table 5.7: IBNC: Trigram-based graph word error rate.
Beam-Width WGD NGD BGD NumPaths(log) GER
Inf 139.2 75.5 8.7 84.3 14.2
300 135.8 73.9 8.7 82.3 14.2
250 132.8 72.4 8.7 79.8 14.2
200 126.8 69.4 8.7 74.5 14.2
150 114.1 63.3 8.7 66.8 14.2
120 98.3 55.9 8.7 59.9 14.3
100 71.9 42.4 7.5 53.8 14.3
50 11.0 7.4 2.3 30.6 14.8
113
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
Table 5.8: IBNC: Bigram-based graph word error rate.
Beam-Width WGD NGD BGD NumPaths(log) GER
Inf 2363.5 335.0 34.7 321.2 5.1
300 2200.5 335.9 34.2 310.0 5.1
250 2014.7 316.1 33.5 308.0 5.1
200 1635.8 274.5 31.8 299.2 5.1
150 991.3 197.3 27.5 263.1 5.2
120 549.5 134.2 22.8 208.9 5.3
100 298.8 87.7 17.6 159.3 5.5
50 27.3 13.3 4.1 54.6 7.1
in GER of nearly 1% but the reduction in WGD is significant, as the ratio
of 1 : 2.8 and 1 : 5.2 are obtained with respect to the initial case for trigram
and bigram-based word graphs respectively.
For the IBNC dataset, the results are even better. At the beam-width
value of 100.0, the WGD for trigram based word graphs is reduced by
nearly 50% while the GER is just increased by 0.1%, compared to the
case where the beam-width value is set to Inf. For the bigram-based word
graphs, the results are further better than the trigram-based word graphs.
At the same beam-width value of 100, the graph size is reduced nearly by
10 times with respect to WGD, while the GER is just increased by 0.4%.
Confusion Network Experiments
We applied the same procedure for pruning word graphs, before passing
them to the confusion network construction. As mentioned in Chapter 4,
the confusion network construction stage takes a word graph as its input
and produces a compact representation, namely the confusion network,
by grouping edges which have similar pronunciation words and overlap in
time.
In order to use the same procedure developed for the word graph eval-
uation, we have converted the confusion network to the word graph-based
114
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
a
b
c e
d
Confusion Network a
b
c e
d The corresponding word graph
@B G
@B G
@B G
@B G @B G
@B G
@B G @B G
Figure 5.6: BTEC: Confusion network and its word graph representation.
representation where each node is labeled with a unique word label. Fig 5.6
shows an example of confusion network and its graph-based representation.
The @BG label stands for the empty-word.
Table 5.9 and Table 5.10 show the impact of the pruning threshold
values on the quality of confusion networks for the BTEC dataset while
Table 5.11 and Table 5.12 are for the IBNC corpus respectively. From
those tables, two comments arise:
Firstly, it is important to notice that the GER for confusion networks is
always smaller than the GER of word graphs. At the beam-width value of
1400, that ban be considered equivalent to the inf value of word graphs, the
GER for trigram-based confusion networks is 13.5% and for bigram-based
confusion networks is 6.6%, while the GER for word graphs, as reported
in Table 5.5 and Table 5.6, is 15.3% and 7.9% respectively.
Secondly, confusion network sizes are always smaller than word graph
sizes, at the same GER values. Fig 5.7 illustrates the relationship between
the GERs for the word graphs (the solid line) and for their confusion net-
works (the dotted-line) and the WGD. As we can see, at the same value
115
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
Table 5.9: BTEC: Trigram-based confusion network word error rate.
beam-width WGD NGD BGD NumPaths(log) GER
1400 57.6 9.47 6.6 24.4 13.5
1200 52.2 9.4 6.4 23.4 13.6
900 42.8 8.4 5.7 21.0 13.7
700 35.9 7.7 5.1 19.4 13.8
500 27.6 6.7 4.4 17.9 14.0
300 18.2 5.5 3.6 14.1 14.5
200 13.8 4.7 3.0 12.5 14.8
100 8.8 3.8 3.7 10.9 15.4
50 6.4 3.1 1.9 9.6 16.0
Table 5.10: BTEC: Bigram-based confusion network word error rate.
beam-width WGD NGD BGD NumPaths(log) GER
1400 88.7 18.1 8.8 81.6 6.6
1200 81.7 17.0 6.7 78.8 6.7
900 69.8 15.3 7.6 74.3 6.9
700 60.9 13.9 6.9 66.3 7.3
500 49.3 12.3 6.1 55.9 7.6
300 35.2 9.7 5.1 42.7 8.4
200 26.8 8.1 4.4 36.7 9.0
100 16.8 6.0 3.4 25.1 10.3
50 10.9 4.6 2.6 18.5 11.9
Table 5.11: IBNC: Trigram-based confusion network word error rate.
beam-width WGD NGD BGD NumPaths(log) GER
1400 35.97 6.0 3.7 33.3 12.1
1200 32.9 5.7 3.5 33.1 12.1
900 27.6 5.3 3.2 32.8 12.2
700 23.0 4.9 2.9 32.6 12.2
500 18.3 4.5 2.7 32.4 12.2
300 13.5 3.9 2.3 32.0 12.3
200 10.8 3.5 2.0 31.4 12.4
100 7.5 3.0 1.7 30.2 12.7
50 5.7 2.7 1.5 29.6 12.8
116
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
0
500
1000
1500
2000
2500
3.5 4 4.5 5 5.5 6 6.5 7 7.5
WG
D
GER
"IBNC_BTEC_WG.dat""IBNC_BTEC_CFN.dat"
Figure 5.7: IBNC: Confusion network size vs. word graph size.
117
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
17.5
18
18.5
19
19.5
20
20.5
21
0 200 400 600 800 1000 1200 1400
Wor
d E
rror
Rat
e
Thresholds
"BN.Consensus.B.dat""BN.Consensus.T.dat"
Figure 5.8: IBNC: Consensus decoding word error rate vs. the beam-width.
118
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
Table 5.12: IBNC: Bigram-based confusion network word error rate.
beam-width WGD NGD BGD NumPaths(log) GER
1400 103.2 16.7 8.0 185.1 3.6
1200 94.7 15.7 7.6 178.3 3.7
900 80.1 14.0 6.8 159.4 3.9
700 68.1 12.6 6.1 142.9 4.0
500 53.8 10.9 5.3 127.8 4.2
300 37.0 8.7 4.3 101.8 4.6
200 28.0 7.3 3.7 87.4 5.0
100 17.2 5.4 2.8 63.0 5.7
50 10.7 4.0 2.1 53.0 6.7
of GER, the WGD of word graphs is larger than the WGD of confusion
networks.
However, also in the confusion networks case a trade-off between quality
and computational costs has to be established. In fact, the smaller the
confusion network, the higher the WER. This fact is shown in Fig 5.8
which plots the WER as a function of the beam-width. The dotted line
refers to trigrams, while the solid line refers to the bigram case.
In summary, the following conclusions can be drawn for the forward-
backward based pruning algorithm.
• In the best case, the lower-bound of the error is 3.6%, with respect to
the GER of bigram-based confusion network for IBNC dataset. It is
a very positive result.
• In most cases, the word graph sizes can be at least halved, with a
relatively small effecting to their GERs.
• In all cases, the GERs of confusion networks are always smaller than
the GER of the corresponding word graphs, at the same pruning con-
dition. The same results are obtained for the WER.
119
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
5.3.5 Node-Compression Experiments
In the previous subsection, the experiment results of the forward-backward
based pruning on word graphs and confusion networks were presented.
Here, we show the experiments of the node-compression method. As men-
tioned in Chapter 3, the idea of this algorithm is to combine identical
(sub) paths in the word graph so that the redundant nodes and edges are
removed. In fact, if two nodes in the word graph have the same set of
incoming edges (or outgoing edges), they can be merged without changing
the language of the word graph, where the language of the word graph is
defined as the set of all the word strings starting at the initial node and
ending at the final node.
Table 5.13 reports on the reductions obtained by means of the node-
compression algorithm on bigram-based and trigram-based word graphs.
As we can see, the algorithm reduces word graph sizes, with respect to
WGD, by about 44% for bigram-cases and 30% for trigram-cases. Com-
pared to the results of the forward-backward based pruning in which word
graph sizes are halved, the reduction in word graph sizes by using this algo-
rithm is not significant. Moreover, if we do the forward-backward pruning
first and then apply the node compression for pruned word graphs, the
results are even less significant, as shown in Table 5.14. The first line of
Table 5.14 contains different measures of the size of the initial word graphs
and the corresponding GER, without using any pruning. In the second
and third lines, the resulting word graphs by using the forward-backward
pruning (at the beam-width value of 120, see also Table 5.8) and the node
compression are given. Finally the last line is the word graph sizes and
GER when two pruning methods are applied in the order mentioned above.
Clearly, the forward-backward pruning gives the most significant reduction.
In addition, the cost of the node merging procedure, which is required in
120
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
Table 5.13: BTEC: Node compression experiments.
Methods WGD NGD BGD NumPaths
Bigram-based WG before compression 1558.7 390.4 38.4 126.4
Bigram-based WG after compression 891.5 210.2 28.6 90.3
Trigram-based WG before compression 447.7 224.8 23.0 42.8
Trigram-based WG after compression 312.4 198.8 18.6 36.6
Table 5.14: IBNC: Forward-backward pruning and node compression.
Methods WGD NGD BGD NumPaths(log) GER
Initial word graph 2363.5 335.0 34.7 321.2 5.1
Fw-Bw based pruning 549.5 134.2 22.8 208.9 5.3
Node-compression 1456.2 274.5 28.1 281.7 5.1
Combined two pruning 493.6 114.7 18.3 188.4 5.3
the node-compression algorithm, is quite expensive.
5.3.6 Word Graph Expansion Experiments
Results reported in Table 5.5 and Table 5.8 show for sure that the GER
of bigram-based word graphs is always smaller than the GER of trigram-
based word graphs at the same pruning condition. In contrast, the WER
of trigram-based word graphs is always lower than the WER of bigram-
based word graphs (see Table 5.2 and Table 5.3). In fact, the language
model constraints, which were used for generating bigram word graphs,
are looser than the ones used for generating trigram-based word graphs.
Hence bigram word graphs contain more paths than trigram word graphs.
However, when constructing bigram word graphs, we have approximately
used the bigram language model scores instead of the real trigram language
model scores - the values that are actually used by decoder. This explains
the higher WER for bigram-based word graph rescoring, compared to the
WER for trigram-based word graph rescoring. It would be interesting to
construct the word graph using bigram constraints and then expand it to
121
5.3. EXPERIMENTAL RESULTS Vu Hai Quan
Table 5.15: BTEC: Bigram-based word graph expansion experiments.
Method WGD NGD BGD WER
Trigram-Based Word Graph 338.0 191.2 23.0 21.8
Bigram-Based Word Graph 781.0 237.2 32.9 22.2
Simple Method 7681.2 2064.7 30.2 22.0
Back-off Method 1681.5 824.2 30.2 22.0
put the trigram language scores on its edges. The word graph expansion
algorithms were presented in Chapter 3.
Table 5.15 shows the results for the bigram word graph expansion.
The first and second lines represent the word graph quality in terms of
WGD, NGD, BGD and WER for the original bigram-based word graphs
and trigram-based word graphs respectively. The third line contains the
corresponding results for the simple word graph expansion method. Finally,
the last line shows the results for the expansion method, which exploits
the back-off language model property. As we can see, the improvements
of WER is relatively small, 22.0% vs. 22.2% with respect to the expanded
word graph sizes, which are nearly ten times bigger, for the simple method,
and two times for the back-off based method, compared to the original
bigram-based word graph sizes. Moreover, as mentioned in Chapter 3, in
order to apply the word graph expansion all edges labeled @BG have to
be removed from word graphs.
A benefit of the word graph expansion is actually that the real trigram
language model scores are put on edges of bigram word graphs, which
have very low GER compared to the GER of trigram word graphs. If we
apply the same procedure for expanding trigram word graphs to four-gram
word graphs, we obtain the expanded word graphs which have four-gram
language model scores put on their edges and have the GERs of trigram
word graphs, and so on. It is a very efficient way to decode the word graph
using long-span language model.
122
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
5.4 N-Best Experiments
As shown in Chapter 4, once the word graph is generated, we can extract
the N−best sentence hypotheses directly from it. This N−best list can be
used for MAP decoder [Evermann, 2000] or rescored again with long-span
language models [Tran, 1996]. In our applications, we used the N−best
list for machine translation which gave promisingly results. This topic
will be mentioned in the next Chapter. The details of N−best decoding
algorithm were described in Chapter 4.
Similarly to the previous section, we also evaluated the N−best list on
BTEC and IBNC datasets by using the criteria named N−best word error
rate (see also Chapter 2). The N−best word error rate is calculated by
choosing the sentence with the minimum number of errors among the top
N−best sentence hypotheses, which is the oracle case (or the best); or the
maximum number of errors, which is the anti-oracle case (or the worst).
The N−best word error rate when N =∞ is interpreted as the GER.
Table 5.16 presents the experimental results on IBNC dataset.
The leftmost column of Table 5.16 contains values of N , which ranges
from 5 to 1000 of the top sentences. For each value of N , the N−best
hypotheses are ranked according to their scores. The second column, la-
beled WER-best, is the N−best word error rate or the oracle, while the
third column, named WER-worst, is the worst case or the anti-oracle. The
remaining columns have the same meaning, but for bigram-based word
graphs.
The top-most line in Table 5.16 shows the WER with N =∞. In fact it
is actually the GER of trigram word graphs with respect to IBNC dataset.
As we can see, just with the first top 30 best sentences, we can get a WER
of 15.2%, compared to the 18.0% WER of the 1−best word graph decoding.
With N = 400, the 14.4% WER, is very close to the 14.2% GER.
123
5.4. N-BEST EXPERIMENTS Vu Hai Quan
Table 5.16: IBNC: The N−best experiments.
Trigram-Based Bigram-Based
N WER (best) WER (worst) WER (best) WER (worst)
∞ 14.2 - 5.1 -
1000 14.4 37.9 10.2 53.1
500 14.4 37.6 10.4 49.0
400 14.4 37.4 10.4 47.6
300 14.5 37.2 10.5 46.2
200 14.6 36.8 10.6 43.9
100 14.8 35.6 11.1 40.2
50 14.9 32.0 11.7 36.8
30 15.2 29.7 12.5 34.5
20 15.6 27.4 13.0 32.4
10 16.1 23.9 14.2 29.2
5 17.0 20.8 16.6 24.7
The N-Different Best List Experiments
In this experiment, we investigate effects of the duplicate in the N−best
list. In the N−best decoding algorithm described in Chapter 4, all the
output sentences are ranked according to the scores. Therefore, it can
happen that two or more output sentences can contain the same word
sequence, we refer to them as duplicate. Among them, it is essential to
keep just one which has the highest score and eliminate all the others.
Experimentally with the BTEC dataset, it has been observed that among
the top 1000−best sentences, the maximum number of different sentences
is 532 while the minimum number is just 57.
Fig 5.9 shows the N−best experiments on the BTEC dataset. On the
x−axis, N values range from 5 to 50 and on the y−axis the N−best word
error rate is represented.
To illustrate relationships between the N -different best sentences and
the N -best sentences, four lines are drawn in Fig 5.9. The solid-line, which
124
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
10
12
14
16
18
20
22
0 5 10 15 20 25 30 35 40 45 50
Wor
d E
rror
Rat
e
N-Best sentences
"btec.B.dnbest.txt""btec.T.dnbest.txt""btec.B.nbest.txt""btec.T.nbest.txt"
Figure 5.9: BTEC: N− different best sentences and N−best sentences vs. WER.
125
5.4. N-BEST EXPERIMENTS Vu Hai Quan
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 5 10 15 20 25 30 35 40 45 50
Tim
e
N-Best sentences
"btec.B.dnbest.time.txt""btec.T.dnbest.time.txt""btec.B.nbest.time.txt""btec.T.nbest.time.txt"
Figure 5.10: BTEC: N− different best sentences and N−best sentences vs. time.
126
CHAPTER 5. IMPROVEMENTS OF SPEECH RECOGNITION
is located at the lowest position in the figure, represents the N− differ-
ent best word error rate for bigram-based word graphs. On the contrary,
the highest dotted-line is the generic N−best error rate for trigram-based
word graphs. As we can see, the lines representing the N−different best
sentences go down faster than the lines of generic N−best sentences. At
the same value of N = 50, for bigram-based word graphs, the WER on
different hypotheses is around 11.5% while the WER of the general case is
around 16.0%. Fig. 5.10 shows the corresponding time required to gener-
ate the sets of N−best sentences. Of course, the time needed to generate
the N−different best list is longer than the time needed for generating the
N−best list as we have to remove the duplicate sentences.
Moreover, eliminating the duplicates from the N−best list plays a cru-
cial role in machine translation. In fact, two source sentences that contain
the same word sequence are always translated into the same target, regard-
less of their scores. So, with N−different best list the machine translation
does not have to repeat useless translations.
127
5.4. N-BEST EXPERIMENTS Vu Hai Quan
128
Chapter 6
Speech Translation Experiments
In this chapter, we present speech translation experiments on the BTEC
dataset. The organization of this chapter is given as follows. Firstly, the
machine translation system recently developed at ITC-irst is briefly intro-
duced. Secondly, a survey of the state of the art in speech translation,
emphasized on the use of N−best lists and word graphs is presented. The
core section is about our current efforts, which exploit the applications of
N−best lists and word graphs to improve the translation quality. Specifi-
cally, the N−best lists and word graphs have been used to optimize model
parameters by a parameter tuning scheme. Finally, detail experimental
results are given in the last section.
6.1 ITC-irst Machine Translation System
The architecture of the ITC-irst statistical machine translation system at
run time is shown in Fig 6.1. After a preprocessing step, the sentence in
the source language is given as input to the decoder, which outputs the
best hypothesis in the target language; the actual translation is obtained
by a further post-processing.
Preprocessing and post-processing consist of a sequence of actions aim-
ing at normalizing text and are applied both for preparing training data
129
6.2. N-BEST AND WORD GRAPH Vu Hai Quan
and for managing text to be translated. The same steps can be applied
to both source and target sentences, accordingly with the language. Input
strings are tokenized, and put in lowercase. Text is labeled with few classes
including cardinal and ordinal numbers, week-day and month names, years
and percentages.
Parameters of the statistical translation model described in Section 2.3
can be divided into two groups: parameters of the basic phrase-based mod-
els and weights of their log-linear combination. Accordingly, the training
procedure of the system, shown in Figure 6.2, consists of two separate
phases.
• In the first phase, basic phrase-based models are estimated starting
from a parallel training corpus. After preprocessing, Viterbi align-
ments from source to target words, and vice-versa, are computed by
means of the GIZA++ toolkit [Och, 2000]. Phrase pairs are then ex-
tracted taking into account both direct and inverse alignments, and
finally phrase-based models are estimated.
• In the second phase, scaling factors of the log-linear model are esti-
mated by a minimum error training procedure. An iterative method
searches for a set of factors that minimizes a given error measure on
a development corpus. The simplex method is used to explore the
space of scaling factors. A detailed description of the minimum error
training approach is reported in [Cettolo, 2004].
6.2 N-Best List and Word Graphs for MT
Recently, experimental results reported in [Och, 2003a], [Shen, 04],
[Cettolo, 2004], [Zhang, 2004], [Ueffing, 2002] have shown that the use of
N−best lists and word graphs in machine translation, especially in speech
130
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
Figure 6.1: The architecture of the ITC-irst SMT system
src tgt
Preprocessing
Postprocessing
src
Preprocessing
src tgt
BESTTRANSLATION
src tgt
BESTHYPOTHESIS
Decoder
src tgtPREPROCESSED
.. ..w1#..#wkw1#..#wj
w1#..#wl w1#..#wm
src tgtPHRASES
Word Aligner
WORDALIGNMENTSsrc tgt
ExtractionPhrase
EstimationParameter
MODELPARAMETERS
− distortion distributions− phrases− LM
− fertility distributions− translation distributions
TRAIN TEST/RUN
PREPROCESSEDsrc
131
6.2. N-BEST AND WORD GRAPH Vu Hai Quan
Figure 6.2: Training of the ITC-irst SMT system
FACTORSSCALING
Evaluator
TRANSLATION
Decoder
Simplex
SCORE
ExtractionPhrase
Word Aligner
EstimationParameter
.. ..w1#..#wkw1#..#wj
w1#..#wl w1#..#wm
src tgtPHRASES
WORDALIGNMENTSsrc tgt
PHRASE−BASEDMODEL
PARAMETERS
Phase 1:Phrase−based Model Training
Phase 2:Minimum Error Training
src tgtTRAINING SET
PREPROCESSED
src tgt (ref)
PREPROCESSEDDEVELOPMENT SET
− λ3− λ2
− λ4
− λ1
− LM
− lexicon distributions− fertility "− distortion "
translation, can help to improve translation quality. Most of the works are
based on log-linear models in which parameters are trained and optimized
according to some given criteria by using N−best list. In principle, they
can be summarized as follows:
• The N−best list approach. Those systems usually work in at least
one of two following phases:
– In the first phase, the recognizer can either generates a list of
N−best hypotheses or just the best hypothesis in the source lan-
guage using the algorithms mentioned in Chapter 3 and Chap-
ter 4.
– In the second phase, the N− best hypotheses list is given as the
input to the text MT system. Then, the MT system produces
a list of N × M−best hypotheses in the target language as its
output. Some additional parameters are used in an additional
module named rescore in order to select out the best translation
hypothesis from the N ×M−best list.
132
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
• The word graph approach. Similarly to the above approach, those
systems require at least one of the following stages:
– In the first stage, the speech recognition system outputs a word
graph of hypotheses in the source language.
– In the second stage, this word graph is used as the input to the
text MT system. Then, the MT system produces a word graph
of hypotheses in the target language as its output.
In the following subsections, we will review current works in both ap-
proaches which were mentioned above.
6.2.1 N-Best based Speech Translation
In [Och, 2003a], the authors directly modeled the posterior probability
P (e|f) by using a log-linear model, as shown in Section 2.3.1. Indeed,
there is a set of F feature functions hm(e, f), m = 1, ..., F . For each feature
function, there exists a model parameter λm, m = 1, ..., F . The translation
probability is:
P (e|f) = pλF1
(e|f) (6.1)
=exp[
∑Fm=1 λmhm(e, f)]
∑e′ exp[
∑Fm=1 λmhm(e′, f)]
(6.2)
The modeling problem is how to choose suitable feature functions that
capture relevant properties of the translation task. The training problem
is how to obtain suitable parameter values λF1 . Parameter estimation was
performed by optimizing the error rate according to one of the following
criteria:
• mWER
133
6.2. N-BEST AND WORD GRAPH Vu Hai Quan
• mPER
• BLEU score
• NIST score
Error minimization relied on the availability of a set of M−best candidate
translation for each input sentence, produced by the search algorithm (Note
that we distinguish M−best list generated by text MT from N−best list
generated by speech recognition). During training, optimal parameters
were searched by using the Powell’s procedure. Since the M−best list can
significantly change by modifying the parameters, the procedure is iterated
until the M−best list remains stable.
The authors report that for a certain error criterion in training, they
obtained in most cases the best results using the same criterion as the
evaluation metric on the test data.
A similar approach for training parameters λF1 was proposed
in [Cettolo, 2004] with two main differences compared to [Och, 2003a]:
• The simplex algorithm was used instead of Powell’s algorithm.
• All possible solutions were exploited instead of just using M−best
candidates.
An other way of using M−best list in machine translation is given in
[Shen, 04], namely discriminative re-ranking for machine translation. In
this work, like the works mentioned above, the authors just experimented
with text translation. But in principle, the proposed algorithm can be
applied for speech translation as well. Informally the re-ranking approach
for machine translation is defined as follows. First, for each source sentence,
a baseline system generates M−best target sentence candidates. Features
that can potentially discriminate between good vs. bad translations are
extracted from these M−best candidates. These features are then used
134
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
Table 6.1: Results reported in [Shen, 04] comparing minimum error training with discrim-
inative re-ranking (BLEU%).
Algorithm Baseline BestFeat FeatComb
Minimum Error 31.6 32.6 32.9
Spliting 31.7 32.8 32.6
to determine a new ranking for the M−best list, by using the so-called
splitting algorithm. The new top ranked candidate in this M−best list
is the new best candidate translation. Formally, the splitting algorithm
searches a linear function f(x) = wf · x that successfully splits the top
R−ranked and bottom K−ranked translations for each sentence where
K + R ≤ N , w being the weight vector and x the feature vector. In fact,
the algorithm is a perceptron-like algorithm. The idea of the algorithm is
as follows. For every two translation x ij and xil if:
• the rank of xij is higher than or equal to R, yi,j ≤ R,
• the rank of xil is lower than R, yi,l ≥ N −K + 1,
• the weight vector w can not successfully separate (x i,j and wi,l) with
a learning margin τ , w · xi,j < w · xi,l + τ
then we need to update w with the addition of x i,j − xi,l. The updating is
finished when all the inconsistent pairs are found.
The authors present the experimental results of this algorithm by using
4 different kinds of feature combination, as shown in Table 6.1. It is clearly
that the improvements are not significant. We have also implemented this
algorithm and some initial results on the BTEC corpus are reported in
Table 6.2, which are consistent with the results of the authors.
The only work that used N ×M−best list for speech translation was
proposed in [Zhang, 2004]. In this paper, after introducing the log-linear
models for speech translation, the authors presented a method for training
135
6.2. N-BEST AND WORD GRAPH Vu Hai Quan
Table 6.2: Experiments of the splitting algorithm on BTEC data
Rank NIST BLEU MWER MPER MSER
baseline 9.7352 0.5208 35.3 30.2 75.7
Splitting 9.7832 0.5309 34.5 30.0 74.4
and optimizing parameters, which is almost similar to [Och, 2003a]. The
details of the method are given below.
The log-linear model used for translate an acoustic vector X (see Sec-
tion 2.3.3) gives the following optimization criterion
e∗ = arg maxe
F∑
i=1
λi log Pi(X, e) (6.3)
A total of 12 features (F = 12), which includes 2 features from speech
recognition, 5 features from machine translation and 5 additional features
has been used for experiments.
The Powell’s algorithm was used to optimize model parameters, λ F1 ,
based on different objective translation metrics. The authors used four
metrics mentioned in Chapter 3:
• BLEU score.
• NIST score.
• mWER
• mPER.
With the optimization scheme mentioned above, the authors built four
log-linear models in order to quantify translation improvement by features
from speech recognition and machine translation respectively. To opti-
mize λ parameters of log-linear models, they used the development data
of 510 speech utterances and adopted an M−best hypothesis approach
[Och, 2003a] to train λ. The experimental results reported from this work
show that:
136
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
• Models with optimized parameters performed better than models with
uniform parameters.
• Translation performance with N−best recognition is better than with
single-best recognition.
• Translation performance improves when more features are incorpo-
rated into the log-linear model.
6.2.2 Word Graph-based Speech Translation
As we have shown, the word graph offers a very compact way for repre-
senting competitive hypotheses. With the word graph, we can also extract
the exact N−best list for other experiments (see Chapter 3), while it is
difficult to keep the M−best list with a large value of M .
In [Ueffing, 2002], the authors proposed a method for generating word
graphs in statistical text MT decoder. Just with a small modification in
the beam search, we can obtain a word graph of candidate translation
sentences, given the source sentence f. The details of this method is given
as follows.
Word Graph Structure
During the search in statistical machine translation system, a bookkeeping
tree is kept, with the following information:
• the output target word, ei,
• the covered source sentence position, j,
• a backpointer to the preceding bookkeeping entry.
After the search finished, the best sentence is found by tracing-back
through this bookkeeping tree. If we want to generate a word graph, we
137
6.3. ITC-IRST WORKS Vu Hai Quan
have to store both alternatives in the bookkeeping when two hypotheses
are recombined. Thus an entry in the bookkeeping structure may have
several backpointers to different preceding entries. For the easily main-
taining purpose, the concept word graph was defined with nodes and edges
containing following information:
• node: the last covered source sentence position j.
• edge:
– the target word ei,
– the probabilities according to the different models: the language
model and the translation sub-models,
– the backpointer to the preceding bookkeeping entry.
Word Graph Pruning
After the pruning in beam search, all hypotheses that are no longer active
do not have to be kept in the bookkeeping structure. This reduces the size
of the bookkeeping structure significantly.
The generated word graph can be further pruned by using the beam-
search concept, which is a very similar to the one mentioned in Chapter
3. Specifically, the probability of the best sentence in the word graph is
determined first. Then, all the hypotheses in the graph in which their
probabilities are lower than this maximum probability multiplied with a
pruning threshold t, 0 ≤ t ≤ 1 are discarded.
6.3 ITC-irst Works
The use of N ×M−best lists for speech translation at ITC-irst was illus-
trated in Fig 1.1. As we mentioned in Chapter 1, there are two modules
138
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
for the text translation component, namely the N−best MT and the con-
fusion network MT. Both translation modules produce word graphs, which
contain alternative translation candidates. The best translation hypothe-
sis can be obtained by finding the path which has the highest score in the
word graph by applying the algorithm described in Chapter 3.
6.3.1 System Parameter Tuning
Similarly to [Och, 2003a], [Shen, 04], [Cettolo, 2004] and [Zhang, 2004],
the objective here is to find optimal parameter values {λ i} on the develop-
ment set. These values are then used to evaluate the performance of the
speech translation system on the test set by using the criteria mentioned
in Section 2.3.4.
ASR
optimization simplex
Ref sentences
1-best hyp
Figure 6.3: The estimation of parameters for speech recognition (the first stage).
Figs 6.3, Fig 6.4 and Fig 6.5 illustrate the scheme of our parameter
tuning method. It includes the following three stages:
• Estimation of parameters for speech recognition, λASR.
139
6.3. ITC-IRST WORKS Vu Hai Quan
MT
optimization simplex
ref text test set
1-best hyp
ref text dev . set
Figure 6.4: The estimation of parameters for machine translation (the second stage).
• Estimation of parameters for machine translation, λMT .
• Estimation of parameters for the whole system, λSY S.
In fact, there are totally 8 parameters used by the whole system. The
first two parameters, {λ1, λ2}, are for the ASR component, corresponding
to the acoustic model and the source language model weights respectively.
The remain six parameters {λ3, ..., λ8} are for the MT component, corre-
sponding to the lexicon(1), the fertility(2), the distortion(2) and the target
language model score(1).
As shown in Fig 6.3 and Fig 6.4, the first two stages use the similar
scheme. Specifically, the objective of the first two stages is to find the
optimal parameters for the speech recognition and the text machine trans-
lation on the development sets. Those values are then used as the initial
parameters for the last stage, which aims to search for the optimum param-
eters of the whole system. The details of the second stage, which optimizes
the {λ3, ..., λ8} parameters of the log-linear models on the development set,
140
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
re-ranking
best hypotheses
evaluator
simplex step
SCALING FACTORS
N x M - Best lists reference translation(s)
score
final
new
Figure 6.5: The whole system parameter tuning (the first stage).
are given in the MT-Tuning procedure. By using the initial values {λ 83}
(at line 1), the text MT module translates all the source sentences, sen i, in
the development set, Ds, (see line 3) to all the best translation hypotheses,
hypi in the target language. The optimize uses the simplex method to
optimize parameters according to the BLEU scores, given the best transla-
tion hypotheses and the reference translations in the target language. The
procedure is iterated until the optimal values of the parameters, λ ?MT , are
found (see also Fig 6.4).
In general, we can apply the above procedure for the first stage to esti-
mate the speech recognition parameters {λ1, λ2}, in which the optimization
is subjected to the WERs. However for simplicity, we just choose the values
of {λ1, λ2} empirically by setting different language model weights for the
141
6.3. ITC-IRST WORKS Vu Hai Quan
MT-tuning
1 Initialize λ3, ..., λ6
2 repeat
3 for each seni ∈ Ds
4 do hypi ← Translate(seni, λ8
3);
5 λ?MT = optimize(λ8
3,refs);
6 until convergence.
7 return λ?MT
speech recognition and take out the values (denoted as λ?ASR in Fig 6.3),
by which the speech recognizer achieves the lowest WER.
Finally, in the last stage the whole system parameter tuning is ap-
plied. Concretely, this stage estimates the {λ81} parameters on the devel-
opment set and uses the estimated values to rescore the best translations
on the test set. The detail implementation of this stage is given in the
sys-para-tuning procedure.
sys-para-tuning
1 N ×M−list ← text-MT(N−bestlist,λ?MT );
2 repeat
3 {hypi} ← re-ranking(N ×M−list,λ?ASR, λ?
MT );
4 BLEU ← evaluator({hypi},ref. trans.);
5 λ?SY S ← simplex-step(BLEU);
6 until convergence;
7 return λ?SY S ;
As also illustrated in Fig 6.5, the re-ranking takes the N ×M−best
list and the previous estimated values {λ81} as its inputs then rescores the
best translation hypotheses. Next, the evaluator computes the BLEU
score of the best translations hypotheses and the reference translations.
Finally, the simplex-step returns a new set of parameters {λ′
}. The loop
142
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
is finished when the final optimal values of parameters, {λ ?} are found.
In the following subsections, we present the experimental results of our
works that follow the scheme mentioned above. Specifically, in the first
subsection we mention about the data sets, which include both the devel-
opment and the test sets used for training and testing the whole system
performance. In the next subsection, we present the results of the first two
stages on the development sets. Finally, results of parameter tuning on the
development and test sets are given in the last subsection.
6.3.2 New BTEC Test and Development Sets
In Chapter 5, we mentioned about the BTEC test set, which contains 506
sentences in Italian and records by 10 persons (see Table 5.1). In these
experiments, a larger test set was used:
• 3006 sentences with one reference
• 10 speakers
• in addition, 50 sentences from each speaker for development set.
Table 6.3 shows the statistical information about the new data sets.
#sent. W | V | #spk speech (min)
dev set 500 3961 953 10 (5f+5m) 34.0
old test set 506 2985 940 10 (5f+5m) 28.7
added 2500 20527 2410 10 (5f+5m) 176.5
new test set 3006 23512 2768 17 (8f+9m) 205.2
Table 6.3: The new BTEC test and development sets
143
6.3. ITC-IRST WORKS Vu Hai Quan
6.3.3 The First Stage Results (for ASR)
As mentioned in the previous subsection, the first stage was done empir-
ically. Concretely, the development set was recognized by using different,
pre-defined language model weights.
20.8
21
21.2
21.4
21.6
21.8
22
22.2
22.4
22.6
22.8
7 7.5 8 8.5 9 9.5 10 10.5 11
WE
R (%
)
LM Weight
"WER_LMW.dat"
Figure 6.6: WER vs. LM weight on the development set.
Fig 6.6 draws the WER of the development set as a function of the
language model weight. As we can see, at the weight value of 9.25 the
lowest WER of 20.93 is obtained. Using this weight for recognizing the
test set, we obtain the WERs reported in Table 6.4.
144
CHAPTER 6. SPEECH TRANSLATION EXPERIMENTS
Test Set WER 95% Conf. Interval
3006 22.09 21.14 - 23.03
Table 6.4: WER of the speech recognition on the 3006−test set when the LM weight is
9.25.
6.3.4 The Second Stage Results ( for Pure Text MT)
Experiments for the second stage were carried out by the following steps:
1. translation of the manual transcriptions of the test set by using uni-
form parameter (all λ’s=1).
2. parameter optimization on the manual transcriptions of the develop-
ment set.
3. translation of the manual transcriptions of the test set with optimized
parameters.
Method BLEU 95% Conf. Interval
Baseline 52.63 51.45 - 53.75
MT tuning 53.15 52.00 - 54.30
Table 6.5: The pure text translation results of the second stage (BLEU score)
The results of the three steps are given in Table 6.5. As we can see,
the improvement is not statistical significant. Concretely, the BLEU score
of the MT-tuning is 53.15, which lies inside the confidence interval of the
baseline’s BLEU score, 51.45− 53.75.
6.3.5 The Third Stage Results (for the Speech Translation)
In this final subsection the results of the system parameter tuning, the
third stage, are presented. Concretely, it was done by the following steps:
145
6.3. ITC-IRST WORKS Vu Hai Quan
1. translate the test set, output by speech recognition, by using the uni-
form parameters. The translation results of this step is considered as
the baseline.
2. estimate the system parameters λ?SY S on the development set by using
the sys-para-tuning procedure.
3. produce the 100×100−best translation list for the test set and rescore
the best translations by using the estimated values from the previous
step.
Methods BLEU Conf. Interval
1asrbest (baseline) 39.66 38.49 - 40.79
sys.para.tuning 41.22 40.03 - 42.42
Table 6.6: The speech translation results of the third stage (BLEU score).
Table 6.6 reports the results of the third stage. The following observa-
tions can be draw:
• There is a big gap between the translation performance of the manual
transcripts and the transcripts resulting from the speech recognition.
Concretely the BLEU scores of the two tasks are 52.36 and 39.66
respectively.
• System tuning gives a statistically significant improvement to speech
translation performance.
146
Chapter 7
Conclusions and Future Works
As mentioned through the previous chapters, the main problem of the
work is to provide a new interface for speech translation in terms of word
graphs, N−best lists and confusion networks. With the new interface,
the machine translation can exploit deeper and more extensive knowledge
sources to improve translation quality. In particular, the following aspects
were studied:
7.1 Efficiency and Quality of Word Graph Genera-
tion
The word graph construction algorithm, which was implemented in Chap-
ter 3, can be fully integrated in the general m−gram language model speech
decoder. Moreover, by using the m−gram language model state constrains
for optimizing the word boundaries, the algorithm results in better word
boundaries and in enhanced capabilities to prune the word graphs. Con-
cretely, the bigram and trigram word graphs were generated and their
results were carefully reported in Chapter 5. Furthermore, we have also
implemented and evaluated various pruning methods, namely the beam
search, the forward-backward pruning, and the node compression algo-
147
7.1. EFFICIENT WG GENERATION Vu Hai Quan
Beam-Width WGD NGD BGD GER
300 1476.2 181.8 19.7 4.2
150 1415.9 175.9 19.3 4.2
100 684.5 104.2 14.5 4.2
90 460.4 77.4 12.3 4.3
80 269.2 51.8 9.8 4.5
70 137.4 31.4 7.4 4.8
50 25.2 9.0 3.8 5.8
Table 7.1: Word graph results of [Ortmanns, 1997] on the NAB’94 task.
rithm, in order to obtain quality word graphs. An important capability
is the word graph expansion which can be used to rescore the word graph
with a higher language model score.
To compare to other works, Table 7.1 contains the word graph results
of [Ortmanns, 1997] on the NAB’94 task.
As our word graph experiments were carried out on the different test
sets (the IBNC and the BTEC test sets), we only compare the results on
the relationship between the graph sizes, the GERs and the WERs. At the
WGD value of 684.5, [Ortmanns, 1997] reported the GER value of 4.2 and
the WER value of 14.3 respectively. In our case, on the IBNC test set,
at the GER value of 5.2 and the WER value of 19.8, the WGD is 991.3 as
shown in Table 5.6. That means that even working on the different data
tests and the WER of our speech recognizer was worse but both the graph
qualities are comparable. Moreover, as shown in Table 5.12, our GERs for
the confusion networks are even lower than the GERs of [Ortmanns, 1997]
on the same two data tests mentioned above. Concretely, at the WGD
value of 103.2, we achieve the GER value of 3.6, while the smallest value
of GER reported in [Ortmanns, 1997] is 4.2 at the WGD value of 684.5.
Finally evaluating the algorithms on two different data sets, one is a
large vocabulary speech corpus (IBNC) and one is spontaneous speech
148
CHAPTER 7. CONCLUSIONS AND FUTURE WORKS
Method WER
1-best 33.1
Consensus 32.5
Table 7.2: Word graph decoding results (WER) of [Mangu, 1999] on the DARPA Hub-4
task
corpus (BTEC) asserts that our algorithms are working properly and ef-
ficiently.
7.2 Efficiency and Quality of Word Graph Decoding
Three word graph decoding algorithms, namely the 1−best decoding,
the N−best decoding and the consensus decoding, have been developed
and evaluated. Table 7.2 reports the results of word graph decoding
of [Mangu, 1999] on the DARPA Hub-4 task. As noted in the previous
section, it is difficult compared the results on the different task, accept the
relationship between quantities.
Clearly the WERs reported by [Mangu, 1999] on the DARPA Hub-4
task are relatively higher than the WERs on the IBNC test set that we
reported in Table 5.3. However, as we can see, the consensus decoding
always achieves a lower WER than the WERs obtained by the 1−best
decoding on the same task. Concretely, on the IBNC test set, we have
the WER values of 17.8% and 18.0% for consensus decoding and 1− best
decoding respectively while the corresponding results of Mangu’s work are
32.5% and 33.1%.
Moreover, the capability of generating N−best lists and confusion net-
works from word graphs has played a crucial role in our speech translation
system. As presented in Chapter 6, the use of word graphs for speech
translation gains significant improvements in the translation quality. The
N−best decoding algorithms, which were presented in Chapter 4 and eval-
149
7.3. RESULTS ON PARAMETER TUNING Vu Hai Quan
uated in both BTEC and IBNC datasets, are robust and efficient. It
works much faster than the real time and produces the N−best list in a
general format which can be used by other groups.
7.3 New Results on the System Parameter Tuning
Finally, in Chapter 6 we reported the application of word graphs to improve
the speech translation quality. This is actually the extension of our previous
work [Cettolo, 2004]. Specifically, by exploiting the word graph generation,
the word graph decoding, we were able to produce the final N ×M -best
translation candidates as the output of the speech translation system. The
N×M -list was used in a parameter tuning scheme, which was proposed by
us, to optimize the system parameters. To evaluate the new tuning scheme,
an extension of BTEC corpus was recorded and transcribed, which includes
the 500−development set and 3006−test set. The experiments reported in
Chapter 6 showed the significant improvements in the translation quality,
compared to the baseline.
7.4 Future Works
There are some works that should be investigated more carefully in future.
• Efficient of word graph rescoring for high-order language models. This
can be easily achieves by the following two steps:
– in the first step, the word graph expansion algorithm will be ap-
plied to expand bigram or trigram word graphs to four-gram or
even higher word graphs.
– in the second step, the word graph decoding algorithms can be
used to rescore on the expanded word graphs. The algorithm that
150
CHAPTER 7. CONCLUSIONS AND FUTURE WORKS
we proposed and implemented can be naturally adopted for this
work.
• Estimation and use of confidence measures for speech recognition. The
properties derived from the confusion networks have shown a very
nice characteristics. Specifically, the confusion network defines a total
ordering among words and permits to associate posterior probabilities
to single words. Within the most probable path, these probabilities
can be interpreted as confidence scores. The availability of reliable
confidence scores is of paramount importance for many applications,
which make use of automatic transcripts, e.g. content based indexing,
text summarization, text classification, information extraction, etc.
• Experimenting the system parameter tuning scheme with additional
features. Currently our tuning scheme was only able to work with
the 8 features, which were used inside the whole system, as described
in Chapter 6. In addition to the 8 features mentioned above, several
features are also integrated into the log-linear models such as the
part-of speech language models, the length model, the jump weight,
the maximum entropy alignment model, the example matching score,
the dynamic example matching score, et. We expect that by adding
the additional features to the log-linear models and then applying the
system parameter tuning on these features can help to improve more
significant the speech translation quality.
151
Bibliography
[Amtrup, 1996] Jan. W. Amtrup, H. Heine, U. Jost, “What in a
Word Graph, Evaluation and Enhancement of Word
Lattices”, Technical Report, University of Hamburg,
1996.
[Antoniol, 1995] G. Antoniol, F. Brugnara, M. Cettolo and M. Fed-
erico. “Language Model Representations for Beam-
Search Decoding”. Proceedings of ICASSP 95, Inter-
national Conference on Acoustics, Speech and Signal
Processing, Detroit, USA, pp.588-591, 8th-12th May
1995.
[Aubert, 1995] X. Aubert and H. Ney, “Large vocabulary continu-
ous speech recognition using word graphs”, in Pro-
ceedings IEEE International Conference on Acous-
tics, Speech, and Signal Processing 1995, Detroit,
MI, USA, May 1995, vol. 1, pp. 49-52.
[Bertoldi, 2004] Bertoldi, Nicola; Cattoni, Roldano; Cettolo, Mauro;
Federico, Marcello. “The ITC-irst Statistical Ma-
chine Translation System for IWSLT-2004”. In Pro-
ceedings of the International Workshop on Spoken
Language Translation (IWSLT). pp. 51-58. Septem-
ber, 2004. Kyoto, Japan.
153
BIBLIOGRAPHY Vu Hai Quan
[Brown, 1993] P. F. Brown, S. A. Della Pietra, V. J. Della Pietra,
and R. L. Mercer, “The Mathematics of Statistical
Machine Translation: Parameter Estimation,” Com-
putational Linguistics, vol. 19, no. 2, pp. 263–313,
1993.
[Berger, 1996] A. Berger, S. Della Pietra, and V. Della Pietra, “A
Maximum Entropy Approach to Natural Language
Processing,” Computational Linguistics, vol. 22,
no. 1, pp. 39–71, 1996.
[Brugnara, 2000] F.Brugnara, M. Cettolo, M. Federico, D. Giuliani,
“A Baseline for the Transcription of Italian Broad-
cast News”, Proceedings IEEE International Con-
ference on Acoustics, Speech, and Signal Processing
2000, Istanbul, Turkey, June 2000.
[Chase, 1997] L. Chase, “Word and acoustic confidence annotation
for large vocabulary speech recognition”, in Proceed-
ings ISCA European Conference on Speech Com-
munication and Technology 1997, Rhodes, Greece,
September 1997, vol. 2, pp. 815-818.
[Cettolo, 2004] M. Cettolo, M.Federico, “Minimum Error Train-
ing of Log-Linear Translation Model”, In Proceed-
ings of the International Workshop on Spoken Lan-
guage Translation (IWSLT), September, 2004. Ky-
oto, Japan.
[De Mori, 1998] R. de Mori el al, “Spoken Dialogues with Comput-
ers”, Academic Press, San Diego, CA, USA, 1998.
154
BIBLIOGRAPHY
[Darroch, 1972] J. Darroch and D. Ratcliff, “Generalized Iterative
Scaling for Log-Linear Models,” The Annals of Math-
ematical Statistics, vol. 43, no. 5, pp. 1470–1480,
1972.
[Pietra, 1997] S. Della Pietra, V. Della Pietra, and J. Lafferty, “In-
ducing features of random fields,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 19,
no. 4, pp. 380–393, 1997.
[Dempster, 1977] A. P. Dempster, N. M. Laird, and D. B. Rubin,
“Maximum-likelihood from incomplete data via the
EM algorithm,” Journal of the Royal Statistical So-
ciety, B, vol. 39, pp. 1–38, 1977.
[Eppstein, 1998a] David Eppstein, ”K shortest paths and other
”k best” problems,” http://www1.ics.uci.edu/ epp-
stein/bibs/ kpath.bib.
[Eppstein, 1998b] David Eppstein, ”Finding the k shortest paths,”
SIAM J.Computing, vol. 28, no. 2, pp. 652-673, 1998.
[Evermann, 1999] G. Everman, “Minimum Word Error Rate Decod-
ing”, MPhil Computer Speech and Language Pro-
cessing, University of Cambridge, 1999.
[Evermann, 2000] G. Evermann and P. C. Woodland, “Large vocabu-
lary decoding and confidence estimation using word
posterior probabilities”, in Proceedings IEEE Inter-
national Conference on Acoustics, Speech, and Sig-
nal Processing 2000, Istanbul, Turkey, June 2000,
vol. 3, pp. 1655-1658.
155
BIBLIOGRAPHY Vu Hai Quan
[Fetter, 1996] P. Fetter, F. Dandurand, and P. Regel-Brietzmann,
“Word graph rescoring using confidence measures”,
in Proceedings International Conference on Spoken
Language Processing 1996, Philadelphia, PA, USA,
October 1996, vol. 1, pp. 10-13.
[Federico, 2000] M. Federico, ”A Baseline System for the Retrieval
of Italian Broadcast News”, Speech Communication,
Special Issue on ”Accessing Information in Spoken
Audio”, 32:37-47, 2000.
[Federico, 1995] M. Federico, M. Cettolo, F. Brugnara and G. An-
toniol, ”Language Modelling for Efficient Beam-
Search”, Computer Speech and Language, 9:353-379,
1995.
[Goel, 1999] V. Goel and W. Byrne, “Task dependent loss func-
tions in speech recognition: A* search over recogni-
tion lattices”, in Proceedings ISCA European Con-
ference on Speech Communication and Technology
1999, Budapest, Hungary, September 1999, vol. 3,
pp. 1243-1246.
[Jelinek, 1998] F. Jelinek, “Statistical Methods for Speech Recogni-
tion”, The MIT Press, 1998.
[Johnson, 2000] .T. Johnson, “Incorporating prosodic information
and language structure into speech recognition sys-
tems”, Ph.D. Thesis, Purdue University , 2000.
[Kemp, 1997] T. Kemp and T. Schaaf, “Estimating confidence us-
ing word lattices”, in Proceedings ISCA European
156
BIBLIOGRAPHY
Conference on Speech Communication and Technol-
ogy 1997, Rhodes, Greece, September 1997, vol. 2,
pp. 827-830.
[Lee, 1989] C. H. Lee and L. R. Rabiner, “ Frame synchronous
network search algorithm for connected word recog-
nition”, IEEE Transactions on Acoustics, Speech,
and Signal Processing, vol. 27, no. 11, pp. 1649-1658,
November 1989.
[Lee, 1995] C. H. Lee, F. K. Soong, and K. K. Paliwal, editors,
“Automatic Speech and Speaker Recognition, Ad-
vanced Topics”, pages 1-30. Kluwer Academic Pub-
lishers, 1996.
[Mangu, 1999] L. Mangu, E. Brill, and A. Stolcke, “Finding con-
sensus among words: Lattice-based word error min-
imization”, in Proceedings ISCA European Con-
ference on Speech Communication and Technology
1999, Budapest, Hungary, September 1999, vol. 1,
pp. 495-498.
[Ney, 1994] H. Ney and X. Aubert, “Word graph algorithm
for large vocabulary continuous speech recognition”,
in Proceedings International Conference on Spo-
ken Language Processing 1994, Yokohama, Japan,
September 1994, vol. 3, pp. 1355-1358.
[Ney, 1993] H. Ney and M. Oerder, “Word graphs: An efficient
interface between continuous speech recognition and
language understanding”, in Proceedings IEEE In-
ternational Conference on Acoustics, Speech, and
157
BIBLIOGRAPHY Vu Hai Quan
Signal Processing 1993, Minneapolis, MN, USA,
April 1993, vol. 2, pp. 119-122.
[Ney, 1987] H. Ney, D. Mergel, A. Noll, and A. Paeseler, “Data-
driven organization of the dynamic programming
beam search for continuous speech recognition”,
in Proceedings IEEE International Conference on
Acoustics, Speech, and Signal Processing 1987, Dal-
las, TX, USA, April 1987, pp. 833-836.
[Ney, 1997] H. Ney, S. Ortmanns, and I. Lindam, “Extensions to
the word graph method for large vocabulary continu-
ous speech recognition”, in Proceedings IEEE Inter-
national Conference on Acoustics, Speech, and Sig-
nal Processing 1997, Munich, Germany, April 1997,
vol. 4, pp. 1787-1790.
[Neukirchen,2001] C. Neukirchen, D. Klakow, X. Aubert, “Generation
and expansion of Word Graph using Long Span Con-
text Information”, ICASSP 2001, pp. 41-44.
[Noord, 2001] G. van Noord., “Robust Parsing of Word Graphs”,
Robustness in Language and Speech Processing,
Kluwer Academic Publishers, 2001.
[Och, 2000] F. J. Och and H. Ney, “Improved statistical align-
ment models,” in Proc. of the 38th Annual Meet-
ing of the Association for Computational Linguistics,
Hongkong, China, October 2000, pp. 440–447.
[Och, 2002] F. Och and H. Ney, “Discriminative training and
maximum entropy models for statistical machine
158
BIBLIOGRAPHY
translation,” in ACL02: Proc. of the 40th Annual
Meeting of the Association for Computational Lin-
guistics, PA, Philadelphia, 2002, pp. 295–302.
[Och, 2003] F. J. Och and H. Ney, “A systematic comparison of
various statistical alignment models,” Computational
Linguistics, vol. 29, no. 1, pp. 19–51, 2003.
[Och, 2003a] F.J. Och, ”Minimum Error Rate Training in Statis-
tical Machine Translation”, In Proc. of ACL’2003,
pages 160-167.
[Odell, 1995] J. J. Odell, “The Use of Context in Large Vocab-
ulary Speech Recognition”, Ph.D. thesis, University
of Cambridge, Cambridge, UK, 1995.
[Ortmanns, 1997] S. Ortmanns, H. Ney, and X. Aubert, “A word graph
algorithm for large vocabulary continuous speech
recognition”, Computer, Speech, and Language, vol.
11, no. 1, pp. 43-72, January 1997.
[Paul,2004] M. Paul, H.Nakaiwa, M.Federico, “Towards Innova-
tive Evaluation Methodologies for Speech Transla-
tion”. In Working notes of the NTCIR-4 2004 Meet-
ing. pp. 17-21. 2004. Tokyo.
[Rabiner, 1993] L. Rabiner and B. H. Juang, “Fundamentals of
Speech Recognition”, Prentice Hall Publishers, 1993.
[Schwartz, 1990] R. Schwartz, Y.L. Chow, “The N-Best Algorithm:
An Efficient and Exact Procedure for Finding the
N Most Likely Sentence Hypotheses” in Proceedings
IEEE International Conference on Acoustics, Speech
159
BIBLIOGRAPHY Vu Hai Quan
and Signal Processing 1990, pp. 81-84, Albuquerque,
April 1990.
[Schwartz, 1991] R. Schwartz and S. Austin, “A comparison of several
approximate algorithms for finding multiple (N-best)
sentence hypotheses”, in Proceedings IEEE Interna-
tional Conference on Acoustics, Speech, and Signal
Processing 1991, Toronto, Canada, May 1991, vol. 1,
pp. 701-704.
[Sixtus, 2001] A. Sixtus and H. Ney, “From within-word model
search to acrossword model search in large vocab-
ulary continuous speech recognition”, submitted to
Computer, Speech, and Language, 2001.
[Sixtus, 1999] A. Sixtus and S. Ortmanns, “High quality of word
graphs using forward - backward pruning”, ICASSP
1999, pp. 593-596.
[Shen, 04] L.Shen, A.Sarkar, F.J.Och, “Discriminative Rerank-
ing for Machine Translation”, 2004.
[Soong, 1991] F.K. Soong, E.F. Huang: “A Tree Trellis Based Fast
Search for Finding the N Best Sentence Hypothesis
in Continuous Speech Recognition”, in Proceedings
IEEE International Conference on Acoustics, Speech
and Signal Processing 1991, pp. 705–708, Toronto,
May 1991.
[Stolcke, 1997] A. Stolcke, Y. Konig, and M. Weintraub, “Explicit
word error rate minimization in N-best list rescor-
ing”, in Proceedings ISCA European Conference
160
BIBLIOGRAPHY
on Speech Communication and Technology 1997,
Rhodes, Greece, September 1997, vol. 2, pp. 163-166.
[Stolcke, 2002] A. Stolcke, “SRILM - an extensible language model-
ing toolkit”, ICSLP 2002.
[Thomas, 1984] Thomas H. Byers, Michael S. Waterman, “Determin-
ing All Optimal and Near-Optimal Solutions when
Solving Shortest Path Problems by Dynamic Pro-
gramming”, Operations Research, Vol. 32, No. 6,
Nov-Dec, 1984.
[Thomas, 2001] Thomas H. Cormen, “Introduction to Algorithms”,
MIT Press, 2001.
[Tran, 1996] B. H. Tran, F. Seide, V. Steinbiss, “A word graph
based N-Best search in continuous speech recogni-
tion”, in Proceedings of ICSLP 1996.
[Valtchev, 1997] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J.
Young, “MMIE training of large vocabulary recogni-
tion systems”, EURASIP/ISCA Speech Communi-
cation, vol. 22, no. 4, pp. 303-314, September 1997.
[Weng, 1999] F. Weng, A. Stolcke, A. Sankar, “Efficient Lattice
Representation and Generation”in Proceedings of
ICSLP, (Sydney, Australia), 1998.
[Wessel, 2002] F. Wessel, “Word Posterior Probabilities for Large
Vocabulary Continuous Speech Recognition”, Ph.D.
thesis, Aachen University of Technology.
[Ueffing, 2002] N. Ueffing, F.J. Och, H. Ney. ”Generation of Word
Graphs in Statistical Machine Translation”. In Proc.
161
BIBLIOGRAPHY Vu Hai Quan
Conference on Empirical Methods for Natural Lan-
guage Processing, pp. 156-163, Philadelphia, PA,
July 2002.
[Zhang, 2004] R.Zang el al, “A Unified Approach in Speech-to-
Speech Translation: Integrating Features of Speech
Recognition and Machine Translation”, 2004
162