spoken language translation

Spoken Language Translation



1Intelligent Robot Lecture Note


2


Intelligent Robot Lecture Note



• Spoken language translation (SLT) is to directly translate spoken utterances into another language.

• Major components Automatic Speech Recognition (ASR) Machine Translation (MT) Text-to-Speech (TTS)

3

ASRASR MTMT TTSTTSSourceSpeech

SourceSentence

TargetSentence

TargetSpeech

버스 정류장이어디에 있나요 ?

Where isthe bus stop?




• In comparison with written language, Speech and especially spontaneous speech poses additional

difficulties for the task of automatic translation. Typically, these difficulties are caused by errors of the speech

recognition step, which is carried out before the translation process. As a result, the sentence to be translated is not necessarily well-

formed from a syntactic point-of-view.

• Why a statistical approach for machine translation? Even without recognition errors, structures of spontaneous speech

differ from those of written language. The statistical approach

Avoid hard decisions at any level of the translation process For any source sentence, a translated sentence in the target language is

guaranteed to be generated.



Coupling ASR to MT

• Motivation ASR cannot secure an error-free system

One best of ASR could be wrong SLT must be designed robust to speech recognition errors MT could be benefited from wide range of supplementary information

provided by ASR MT quality may depend on WER of ASR

Strong correlation between recognition and translation quality WER of ASR decreases in a set of hypotheses Idea : Exploitation of more transcriptions

• SLT systems vary in the degree to which SMT and ASR are integrated within the overall translation process.



Coupling ASR to MT

• Loose coupling SMT uses ASR output (1-best, N-best, lattice, or confusion network)

as input for 1-way module communication

• Tight coupling The whole search space of ASR and MT is integrated

6

ASRASR SMTSMT TTSTTSSourceSpeech

1-best,N-best,Lattice,or CN

TargetSentence

TargetSpeech

ASR + SMTASR + SMT TTSTTSSourceSpeech

TargetSentence

TargetSpeech



Coupling ASR to MT

• Statistical spoken language translation Given a speech input x in the source language, find the best

translation e

F(o) is a set of possible transcriptions Loose coupling : 1-best, N-best, lattice, or confusion network Tight coupling : full search space

Pr(f,e|x) : speech translation model Acoustic and translation features

7

)|Pr(maxargˆ 11,

ˆ

11

TI

eI

I xeeI

)|,Pr(maxmaxarg 111)(, 11

TIJ

oFfeI

xefJI



Coupling ASR to MT

• Loose coupling vs. Tight couplings

8

Loose Coupling Tight Coupling

Modularity of Knowledge Sources

Each KS in stand-alone module

All KSs integrated in single model

Inter-module Communication

Typically one-way (pipelined)

N/A

Scalability Easy Not easy

Complexity FeasibleFeasible only for

very small domains



ASR Outputs

• Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words.

• Architecture

9

FeatureExtraction

Decoding

AcousticModel

PronunciationModel

LanguageModel

Speech Signals ASR outputs( 1-best, N-best,Lattice, or CN )Network

Construction

SpeechDB

TextCorpora

HMMEstimation

G2P

LMEstimation



ASR Outputs

• Network Structure

• Decoding of HMM-based ASR Searching the best path in a huge HMM-state lattice

10

ONE TWO ONETHREEONE TWO THREE ONESentence HMM

W AH N

1 2 3

ONEWord HMM

Phone HMM W



ASR Outputs

• 1-best The best path could find from back tracking Why a 1-best “word” sequence?

Storing the backtracking pointer table for state sequence takes a lot of memory

Usually a backtrack pointer storing : The previous words before the current word

• N-best Traceback not only from the 1st-best, also from the 2nd best and 3rd

best, etc. Methods

Directly from search backtrack pointer table – Exact N-best algorithm, Word pair N-best algorithm, A* search using Viterbi score as heuristic

Generate lattice first, then generate N-best from lattice



ASR Outputs

• Lattice A word-based lattice

A compact representation of state-lattice Only word node are involved

From the decoding backtracking pointer table Only record all the links between word nodes

From N-best list Become a compact representation of N-best



ASR Outputs

• Confusion Network (L. Mangu et al., 2000) Or “Sausage Network” Or “Consensus Network” A weighted directed graph with a start node, an end node, and word

labels over its edges Each path from the start node to the end node goes through all the

other nodes From lattice

Multiple alignment



Loose Coupling : 1-best

• The best hypothesis produced by the ASR system is passed as a text to the MT system.

Baseline Simple structure Fast translation

• The speech recognition module and translation module are running rather independently

Lacks joint optimality

• No use of multiple transcriptions Supplementary information easily available from the ASR system were

not exploited in the translation process



Loose Coupling : 1-best

• Structure

Recognition

Translation

15

ASRASR SMTSMT TTSTTSSourceSpeech

1-best TargetSentence

TargetSpeech

)()|(maxargˆ111

,

ˆ

11

JJT

fJ

J fPfxPfJ

)()|ˆ(maxargˆ 11

ˆ

1,

ˆ

11

IIJ

eI

I ePefPeI

Ieˆ

1Jfˆ

1Tx1



Loose Coupling : N-best

• N hypotheses are translated by a text MT decoder and re-ranked according to ASR & SMT scores (R. Zhang et al., 2004)

• Structure

16

ASRASR SMTSMT RescoreRescoreSourceSpeech

N-best NxMtranslation

Besttranslation

Tx1

],1[ˆ,],2,1[ˆ],1,1[ˆ ˆ

1

ˆ

1

ˆ

1 Meee III

],2[ˆ,],2,2[ˆ],1,2[ˆ ˆ

1

ˆ

1

ˆ

1 Meee III

],[ˆ,],2,[ˆ],1,[ˆ ˆ

1

ˆ

1

ˆ

1 MNeNeNe III

]1[ˆ ˆ

1Jf

]2[ˆ ˆ

1Jf

][ˆ ˆ

1 Nf J

][ˆ ˆ

1 nf J ],[ˆ ˆ

1 nne I Ieˆ

1




• ASR module To generate N-best speech recognition hypotheses : n-th best speech recognition hypothesis

• SMT module To generate M-best translation hypotheses : m-th best translation hypotheses produced from

• Rescore module To rescore all NXM translations Key component Log linear model

Features derived from ASR and SMT are combined in this module to rescore translation candidates.

17

][ˆ ˆ

1 nf J

],[ˆˆ

1 mne I ][ˆ ˆ

1 nf J




• Rescore : Log-linear models

: all possible translation hypotheses : m-th feature in log value

ASR features : acoustic model, source language model SMT features : target language model, phrase translation model, distortion

model, length model, … : weight of each feature

18

M

mmm

EEXPE

1

),(logmaxarg

E

M

mmm

M

mmm

EXf

EXfXEP

1

1

)),(exp(

)),(exp()|(

),( EXPm

E




• Parameter optimization (F.J. Och, 2003) Objective function

: translation output after log-linear model rescoring : references of English sentences : automatic translation quality metrics

BLUE : A weighted geometric mean of the n-gram matches between test and reference sentences plus a short sentence penalty

NIST : An arithmetic mean of the n-gram matches between test and reference sentences

mWER : multiple reference word error rate mPER : multiple reference position independent word error rate

19

),( 1 ssM ERDoptimize

sE

sR),( ss ERD




• Parameter optimization : Direction Set Methods

20

Change initial lambda

Local optimization

Change Direction

Local lambda

Best lambda

),( ss ERD



Loose Coupling : Lattice

• Lattice-based MT Input

Word lattices produced by the ASR system Directly integrate all models in the decoding process

Phrase based lexica, single word based lexica, recognition features Problem

How to translate the word lattices?

• Approach Joint probability approach

WFST (E. Matusov et al., 2005) Phrase-based approach

Log-linear model (E. Matusov et al., 2005) WFST (L. Mathias et al., 2006)




• Structure

22

ASRASR RescoreRescoreSourceSpeech

Besttranslation

Tx1Jfˆ

1Ieˆ

1

Word lattice




• From the derived decision rule : Standard acoustic model

: Target language model

: Translation model

• Source language model? To take into account requirement for the well-formedness of the

source sentence, the translation model has to include context dependency on the previous source words

This dependency for the whole sentence can be approximated by including a source language model



Loose Coupling : Lattice(Joint Probability Approach : WFST)

• Joint probability approach The conditional probability term and can be

rewritten when using a joint probability translation model

This simplifies coupling the systems The joint probability translation model can be used instead of the usual LM

in ASR

24

)|Pr(maxargˆ 11,

ˆ

11

TI

eI

I xeeI

)|Pr()|Pr(max)Pr(maxarg 11111, 11

TTIT

f

I

eI

fxefeJI

)|Pr(),Pr(maxmaxarg 1111, 11

TTIT

feI

fxefJI

)|Pr( 11IT ef)Pr( 1

Ie




• WFST-Based Joint Probability System The joint probability MT system is implemented with WFST First, the training corpus is transformed based on a word alignment

Then, a statistical m-gram model is trained on the bilingual corpus This language model is represented as a finite-state transducer

which is the final translation model

25

vorrei|I’d_like del|some gelato|ice_creamper|ε favore|please




• WFST-Based Joint Probability System Searching for the best target sentence is done in the composition of

the input represented as a WFST and the translation transducer . Coupling the FSA system with ASR is simple

The output of the ASR represented as WFST can be used directly as input to the MT search

Feature– Only Acoustic, translation probability

The Source LM scores are not included– The joint m-gram translation probability serve as a source LM



Loose Coupling : Lattice(Phrase-based Approach : Log-linear Model)

• Probability distributions are represented as features in a loglinear model

The translation model probability is decomposed into several probabilities

Acoustic model and source langue model probabilities are also included

For a hypothesized recognized source sentence f1J and a

hypothesized translation e1I, let k → (jk, ik), k = 1,…,K be a monotone

segmentation of the sentence pair into K bilingual phrases



Loose Coupling : Lattice (Phrase-based Approach : Log-linear Model)

• Features The m-gram target langue model

The phrasal lexicon models The phrase translation probabilities are computed as a log-linear

interpolation of the relative frequencies

The single word based lexicon models




• Features (con’t) c1, c2 : word, phrase penalty feature The recognition model

The acoustic model probability The m-gram source langue model probability

• Optimization All features are scaled with a set of exponents λ = λ1,…,λ7 and μ =

μ1,μ2. The scaling factors are optimized in a minimum error training

framework iteratively by performing 100 to 200 translations of a development set

The criterion : WER, BLEU, mWER, mPER29Intelligent Robot Lecture Note



• Practical aspects of lattice translation Generation of Word Lattices

In a first step, We mapped all entities that were not spoken words onto the empty arc label ε

The time information is not used - Remove it from the lattices The structure is compressed by applying ε-removal, determinization, and

minimization This step significantly reduced runtime without changing the results

Phrase Extraction The number of different phrase pairs is very large Candidate phrase pairs have to be kept in main memory In case of ASR word lattice input, the lattice for each test utterance is

traversed, and only phrases which match sequences of arcs in the lattice are extracted

Thus only phrases which can be used in translation will be loaded




• Practical aspects of lattice translation (Con’t) Pruning

A word lattice of high density as input → an enormous search space → pruning is necessary

Coverage pruning and histogram pruning Based on the total costs of a hypothesis It may also be necessary to prune the input word lattices

• Advantage The utilization of multiple features The direct optimization for an objective error measure

• Disadvantage A less efficient search Heavy pruning unavoidable



Loose Coupling : Lattice (Phrase-based Approach : WFST)

• Statistical Modeling for Text Translation

Ω : All foreign phrase sequences that could have generated the foreign text

The translation system effectively translates phrase sequences, rather than word sequences

This is done by first mapping the sentence into all its phrase sequences




• Phrase Sequence Lattice contains the phrase sequence that can be extracted from the text

All phrase sequences correspond to the unique foreign sentence Here, a phrase is a sequence of word which can be translated Different phrase sequences lead to different translations The lattice is unweighted




• Statistical Modeling for Speech Translation

The Target Phrase Mapping transducer is applied to the foreign language ASR word lattice

L·Ω : The likely foreign phrase sequences that could have generated the foreign speech

The translation system still effectively translates phrase sequences, rather than word sequences

These are extracted from the ASR lattice, with ASR score, rather than from a text sentence




• Phrase Sequence Lattice contains the phrase sequences that can be extracted from the text

Phrase sequences correspond to the translatable word sequences in the lattice

The lattice contains weights from the ASR system Translating this foreign phrase lattice is MAP translation of the foreign

speech under the generative model




• Spoken language translation is recast as an ASR analysis problem in which the goal is to extract translatable foreign language phrases from ASR word lattices

Step 1. Perform foreign language ASR to generate a foreign language word lattice L

Step 2. Analyze the foreign language word lattice and extract the phrases to be translated

Step 3. Build the target language phrase mapping transducer Ω Step 4. Compose L and to create the foreign language ASR Phrase

Lattice Ω Step 5. Translate the foreign language phrase lattice

• ASR and MT must be very compatible for this approach



Loose Coupling : Confusion Network

• CN-based decoder (N. Bertoldi et al., 2005) Input

Confusion network represented as a matrix Text vs. CN

– Text

– CN

Problem How to translate confusion network input?

37

나 는 소년 입니다 .

나 1.0 는 0.7은 0.3

소녀 0.6소년 0.4

입니다 0.5입니까 0.3합니다 0.2

. 0.8? 0.2




• Solution Simple! CN-based SLT decoder can be developed starting from phrase-based

SMT decoder CN-based SLT decoder is substantially the same as the phrase-based

SMT decoder apart from the way the input is managed

• Compare to N-best methods N-best Decoder

Does not advantage from overlaps among N-best CN Decoder

Exploits overlaps among hypotheses




• Phrase-based Translation Model Phrase

Sequence of consecutive words Alignment

Map between CN and target phrases one word per column aligned with a target phrase

Search criterion

is a log-linear phrase-based model




• Log-Linear Phrase-based Translation Model The conditional distribution is determined through suitable

real valued feature functions , and takes the parametric form:

Feature functions Language model Fertility models Distortion models Lexicon model Likelihood of the path within CN True length of the path




• Step-wise translation process Translation is performed with a step-wise process Each step translates a sub-CN and produces a target phrase The process starts with a empty translations After each step, we get a partial translation A partial translation is complete if the whole input CN is translated

• Complexity Reduction Recombining theories Beam search Reordering constraints Lexicon pruning Confusion network pruning




• Algorithms




• Step-wise translation process



Loose Coupling

1-Best N-Best Lattice CN

Multiple hypotheses? X O O O

ASR features into MT decoding?

X X O O

Overlaps among hypotheses?

X O O X

Approximation for word lattice?

X O X O



Tight Coupling

• Theory (H. Ney, 1999)

Three factors Pr(e) : target language model Pr(f|e) : translation model Pr(x|f) : acoustic model

45

)|Pr(maxargˆ 11,

ˆ

11

TI

eI

I xeeI

)|Pr()Pr(maxarg 111, 1

ITI

eI

exeI

)|,Pr()Pr(maxarg11

1111,

JI

f

ITJI

eI

exfe

)|Pr()|Pr(max)Pr(maxarg 11111, 11

TTIT

f

I

eI

fxefeJI

),|Pr()|Pr()Pr(maxarg11

111111,

JI

f

IJTIJI

eI

efxefe

)|Pr()|Pr()Pr(maxarg11

11111,

JI

f

JTIJI

eI

fxefe

Baye’s Rule

Introduce fas hidden variable

Baye’s Rule

Assume x doesn’tdepend on target language

Sum to Max



Tight Coupling

• ASR vs. Tight Coupling (SLT)

Brute Force Method Instead of incorporating LM into standard Viterbi algorithm, incorporating

P(e) and P(f|e) Very complicated Not feasible

46

ASR

vs

SLT

AcousticModel

AcousticModel

SourceLM

SourceLM

AcousticModel

AcousticModel

TargetLM

TargetLM

TranslationModel

TranslationModel



Tight Coupling

• WFST-Based Joint Probability System (Fully integration) The ASR search network

A composition of WFSTs : the HMM topology : the context-dependency : the lexicon : the LM Only need to replace the source LM by the translation model

Speech translation search network ST

Result Small improvement of translation quality But, very slow

47

CH

LG

GrT

))det()det(det( 11 TrLCHST



Tight Coupling

• Bleu scores against lattice density (S.Saleem et al, 2004)

Improvements from tighter coupling may only be observed when ASR lattices are sparse, i.e. when there are only few hypothesized words per spoken word in the lattice

This would mean that a fully integrated speech translation would not work at all.



Tight Coupling

• Possible issues of tight coupling In ASR, source n-gram LM is very closed to the best configuration The complexity of the algorithm is too high, approximation is still

necessary to make it work The current approaches still haven’t really implement tight-coupling

• Conclusion The approach seem to be haunted by very high complexity of search

algorithm construction



Reading List

• L. Mangu, E. Brill, A. Stolcke. 2000. Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language 14(4), 373-400.

• V. H. Quan, M. Federico, M. Cettolo. 2005. Integrated N-best Re-ranking for Spoken Language Translation. EuroSpeech.

• R. Zhang, G. Kikui, H. Yamamoto, T. Watanabe, F. Soong, and W. K. Lo. 2004. A unified approach in speech-to-speech translation: Integrating features of speech recognition and machine translation. In Proc. of Coling 2004.

• F.J. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. of ACL.

• E. Matusov, S. Kanthak, and H. Ney. 2005. On the Integration of Speech Recognition and Statistical Machine Translation. in Proc. Interspeech 2005.

• E. Matusov, H. Ney, R. Schluter. 2005. Phrase-based Translation of Speech Recognizer Word Lattices Using Loglinear Model Combination. ASRU 2005.



Reading List

• E. Matusov, H. Ney, R. Schluter. 2006. Integrating Speech Recognition And Machine Translation : Where Do We Stand. ICASSP 2006.

• L. Mathias, W. Byrne.2006. Statistical Phrase-based Speech Translation. ICASSP 2006.

• N. Bertoldi, M. Federico. 2005. A new decoder for spoken language translation based on confusion networks. in IEEE ASRU Workshop.

• H. Ney. 1999. Speech translation: Coupling of recognition and translation. in Proc. ICASSP.

• S.Saleem, S. C. Jou, S. Vogel, and T. Schultz, 2004. Using word lattice information for a tighter coupling in speech translation systems. in Proc. ICSLP, 2004..


spoken language translation

Documents