stephan vogel - machine translation1 stephan vogel spring semester 2011 machine translation minimum...
Post on 30-Mar-2015
224 Views
Preview:
TRANSCRIPT
Stephan Vogel - Machine Translation 1
Stephan VogelSpring Semester 2011
Machine Translation
Minimum Error Rate Training
Stephan Vogel - Machine Translation 2
Overview
Optimization approaches Simplex MER
Avoiding local minima Additional considerations
Tuning towards different metrics Tuning on different development sets
Stephan Vogel - Machine Translation 3
Tuning the SMT System
We use different models in SMT system Models have simplifications Trained on different amounts of data
=> Models have different levels of reliability and scores have different ranges
=> Give different weight to different ModelsQ = c1 Q1 + c2 Q2 + … + cn Qn
Find optimal scaling factors (feature weights) c1 … cn
Optimal means: Highest score for chosen evaluation metric Mie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high
Metric M is our objective function
Stephan Vogel - Machine Translation 4
Problems
The surface of the objective function is not nice Not convex -> local minima (actually, many local minima) Not differentiable -> gradient descent methods not readily
applicable
There may be dangerousareas (‘boundary cliffs’)
Example: Tune on Dev set with
short reference translations Optimization leads towards
short translations New test set has long reference translations Translations are now too short ->length penalty
Small change
Big effect
Stephan Vogel - Machine Translation 5
Brute Force Approach – Manual Tuning
Decode with different scaling factors Get feeling for range of good values Get feeling for importance of models
LM is typically most important Sentence length (word count feature) to balance shortening
effect of LM Word reordering is more or less effective depending on
language
Narrow down range in which scaling factors are tested Essentially multi-linear optimization
Works good for small number of models Time consuming (CPU wise) if decoding takes long time
Stephan Vogel - Machine Translation 6
Automatic Tuning
Many algorithms to find (near) optimal solutions available Simplex Powell (line search) MIRA (Margin Infused Relaxed Algorithm) Specially designed minimum error training (Och 2003) Genetic algorithm
Note: models are not improved, only their combination
Note: some parameters change performance of decoder, but are not in Q Number of alternative translation Beam size Word reordering restrictions
Stephan Vogel - Machine Translation 7
Automatic Tuning on N-best List
Optimization algorithm need many iterations – too expensive to run full translations
=> Use n-best lists e.g. for each of 500 source sentences 1000 translations Change scaling factors results in re-ranking the n-best lists Evaluate new 1-best translations
Apply any of the standard optimization techniques Advantage: much faster Can pre-calculate the counts (e.g. n-gram matches)
for each translation to speed up evaluation
Stephan Vogel - Machine Translation 8
Simplex (Nelder-Mead)
Start with n+1 random configurations Get 1-best translation for each configuration ->
objective function Sort points xk according to objective function:
f(x1) < f(x2) < … < f(xn+1)
Calculate x0 as center of gravity for x1 … xn
Replace worst point with a point reflected through the centroid
xr = x0 + r * (x0 – xn+1)
Stephan Vogel - Machine Translation 9
Demo
Obviously, we need to change the size of the simplex to enforce convergence
Also, want to adjust the step size If new point is best point – increase step size If new point is worse then x1 … xn – decrease step size
11
9
127
86
9
Stephan Vogel - Machine Translation 10
Expansion and Contraction
Reflection:Calculate xr = x0 + r * (x0 – xn+1)if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration
Expansion:If reflected point is better then best, i.e. f(xr) < f(x1)
Calculate xe = x0 + e * (x0 – xn+1)
If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr
Next iterationelse Contract
Contraction:Reflected point f(xr) >= f(xn)Calculate xc = xn+1 + c * (x0 – xn+1)If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink
Shrinking:For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1)Next iteration
Stephan Vogel - Machine Translation 11
Changing the Simplex
xn+1x1
reflectionxn+1
x0
expansionxn+1
x0
contraction
xn+1
x0
shrinking
Stephan Vogel - Machine Translation 12
Powell Line Search
Select directions in search space, then
Loop until convergence Loop over directions d Perform line search for direction d until convergence
Many variants Select directions
Easiest is to use the model scores Or combine multiple scores
Step size in line search
MER (Och 2003) is line search along models with smart selection of steps
Stephan Vogel - Machine Translation 13
Minimum Error Training
For each hypothesis we haveQ = ck*Qk
Select oneQ\k = ck Qk + n\k cn*Qn = ck Qk + QRest
ck
Metric ScoreWER = 8
TotalModelScore
QRest
Qk
Individual model scoregives slope
1
Stephan Vogel - Machine Translation 14
Minimum Error Training
Source sentence 1 Depending on scaling factor ck, different hyps are in 1-best position Set ck to have metric-best hyp also being model-best
ck
h11: WER = 8
h12 : WER = 5
h13 : WER = 4
best hyp:h11
h12 h13
8 5 4
ModelScore
Stephan Vogel - Machine Translation 15
Minimum Error Training
Select minimum number of evaluation points Calculate intersection point Keep only if hyps are minimum at that point Choose evaluation points between intersection points
ck
h11: WER = 8
h12 : WER = 5
h13 : WER = 4
best hyp:h11
h12 h13
8 5 4
ModelScore
Stephan Vogel - Machine Translation 16
Minimum Error Training
Source sentence 1, now different error scores Optimization would find a different ck
=> Different metrics lead to different scaling factors
ck
ModelScore
h11: WER = 8
h12 : WER = 2
h13 : WER = 4
best hyp:h11
h12 h13
8 2 4
Stephan Vogel - Machine Translation 17
Minimum Error Training
Sentence 2 Best ck in a different range No matter which ck, h22 would newer be 1-best
ckbest hyp:
h21: WER = 2
h22 : WER = 0
h23 : WER = 5
h21h23
2 5
ModelScore
Stephan Vogel - Machine Translation 18
Minimum Error Training
Multiple sentences
ck
h11: WER = 8
h12 : WER = 5
h13 : WER = 4
best hyp:h11
h12 h13
h21: WER = 2
h22 : WER = 0
h23 : WER = 5
h21h22
10 7 10 9
ModelScore
Stephan Vogel - Machine Translation 19
Iterate Decoding - Optimization
N-best list is (very restricted) substitute for search space With updated feature weights we may have generated other
(better) translations Some of the hyps in the n-best list would have been pruned
Iterate Re-translate with new feature weights Merge new translations with old translations (increases
stability) Run optimizer over larger n-best lists Repeat until no new translations, or improvement < epsilon, or
just k times (typically 5-10 iterations)
Stephan Vogel - Machine Translation 20
Avoiding Local Minima
Optimization can get stuck in local minimum Remedies
Fiddle around with the parameters of your optimization algorithm Larger n-best list -> more evaluation points Combine with Simulated Annealing type approach (Smith & Eisner,
2007) Restart multiple times
Stephan Vogel - Machine Translation 21
Random Restarts
Comparison Simplex/Powell (Alok, unpublished) Comparison Simplex/ext. Simplex/MER (Bing Zhao,
unpubl.)
Observations: Alok: Simplex ‘jumpier’ then Powell Bing: Simplex better than MER Both: you need many restarts
Stephan Vogel - Machine Translation 22
Optimizing NOT Towards References
Ideally, we want system output identical to reference translations
But there is not guarantee that system can generate reference translations (under realistic conditions) E.g. we restrict reordering window We have unknown words Reference translations may have words unknow to the system
Instead of forcing decoder towards reference translations optimize towards best translations generated by the system Find hypotheses with best metric score Use those as pseudo references Optimize towards the pseudo references
Stephan Vogel - Machine Translation 23
Optimizing Towards Different Metrics
Automatic metrics have different characteristics Optimizing towards one does not mean that other metric
scores will also go up Esp. Different metrics prefer shorter or longer translations
Typically: TER < BLEU < METEOR (< means ‘shorter translation’)
Mauser et al (2007) on Ch-En NIST 2005 test set Reasonably well behaved Resulting length of translation differs by more than 15%
Stephan Vogel - Machine Translation 24
Generalization to other Test Sets
Optimize on one set, test on multiple other sets Again Mauser et al, Ch-En
Shown is behavior overSimplex optimization iterations
Nice, nearly parallel developmentof metric scores
However, we had also observed brittle behavior Esp. when ratio src_length / ref_length is very different
between dev and eval test sets
Stephan Vogel - Machine Translation 25
Large Weight = Important Feature?
Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2
Which feature is most important?
Cannot say!!! We want to re-rank the n-best lists Feature weights scale feature values such that they can
compete
Example: Variation in LM and TM larger
then for WC Need large weight for WC to make
small differences effective
To know if feature is important, remove it and look at drop in metric score
QLM QTM QWC Q
H1 22 83 7 112
H2 29 77 8 116
H3 26 85 9 120
Stephan Vogel - Machine Translation 26
Open Issues
Should not all optimizers get the same results, if done right The models are the same, it’s just finding the right mix If local minima can be avoided, then similar good optima
should be found
How to stay save Avoid good optima close to ‘cliffs’ Different configurations give very similar metric scores, pick
one which is more stable
One hat fits all? Why one set of feature weights? How about different sets for
Good/bad translations (tuning on tail: mixed results so far) Short/long sentences Begin/middle/end of sentence ...
Stephan Vogel - Machine Translation 27
Summary
Optimizing system by modifying scaling factors (feature weights)
Different optimization approaches can be used Simplex, Powell most common MERT (Och) is similar to Powell, with pre-calculation of grid
points
Many local optima, avoid getting stuck early Most effective: many restarts
Generalization To unseen test data: mostly ok, sometimes selection of dev
set has big impact (length penalty!) To different metrics: reasonably stable (metrics are
reasonably correlated in most cases)
Still open questions => more research needed
top related