june 2004 d arpa tides mt workshop measuring confidence intervals for mt evaluation metrics ying...
Post on 19-Dec-2015
216 Views
Preview:
TRANSCRIPT
June 2004DARPA TIDES MT Workshop
Measuring Confidence Intervals for MT Evaluation Metrics
Ying ZhangStephan Vogel
Language Technologies InstituteCarnegie Mellon University
June 2004DARPA TIDES MT Workshop
MT Evaluation Metrics
• Human Evaluations (LDC)– Fluency and Adequacy
• Automatic Evaluation Metrics– mWER: edit distance between the hypothesis and the closest reference
translation
– mPER: position independent error rate
– BLEU:
– Modified BLEU:
– NIST:
)logexp(1
N
nnn pwBPBLEU
N
nnn pwBPBLEUM
1
N
nhypinwwall
occurcothatwwalln
n
n
wwInfo
BPNIST1
__..._
__..._1
1
1
)1(
)...(
June 2004DARPA TIDES MT Workshop
Measuring the Confidence Intervals
• One score per test set
• How accurate is this score?
• To measure the confidence interval a population is required
• Building a test set with multiple human reference translations is expensive
• Bootstrapping (Efron 1986)– Introduced in 1979 as a computer-based method for estimating
the standard errors of a statistical estimation
– Resampling: creating an artificial population by sampling with replacement
– Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics
June 2004DARPA TIDES MT Workshop
A Schematic of the Bootstrapping Process
Score0
June 2004DARPA TIDES MT Workshop
An Efficient Implementation
• Translate and evaluate on 2,000 test sets?– No Way!
• Resample the n-gram precision information for the sentences– Most MT systems are context independent at the sentence level;– MT evaluation metrics are based on information collected for each testing
sentences– E.g. for BLEU and NIST
RefLen: 61 52 56 59ClosestRefLen 561-gram: 56 46 428.41
– Similar for human judgment and other MT metrics
• Approximation for NIST information gain• Scripts available at: http://projectile.is.cs.cmu.edu
/research/public/tools/bootStrap/tutorial.htm
June 2004DARPA TIDES MT Workshop
Confidence Intervals
• 7 MT systems from June 2002 evaluation
• Observations:– Relative confidence interval: NIST<M-Bleu<Bleu
– I.e. NIST scores have more discriminative powers than BLEU
June 2004DARPA TIDES MT Workshop
Are Two MT Systems Different?
• Comparing two MT systems’ performance– Using the similar method as for single system
– E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]
– If the confidence intervals overlap with 0, two systems are not significantly different
– M-Bleu and NIST have more discriminative power than Bleu
– Automatic metrics have pretty high correlations with the human ranking
– Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not
June 2004DARPA TIDES MT Workshop
How much testing data is needed
NIST Scores
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of Testing Data Size
NIS
T S
co
re
A B C D E F G
BLEU Scores
0
0.05
0.1
0.15
0.2
0.25
0.3
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of testing data size
BL
EU
Sc
ore
A B C D E F G
M-Bleu Scores
0.05
0.07
0.09
0.11
0.13
0.15
0.17
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of testing data size
M-B
leu
sc
ore
A B C D E F G
F+A Human Judgments based on Different Size of Testing Data
4
4.2
4.4
4.6
4.8
5
5.2
5.4
5.6
5.8
6
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of Testing Data
Hu
ma
n J
ud
gm
en
t
A B C D E F G
June 2004DARPA TIDES MT Workshop
How much testing data is needed
• NIST scores increase steadily with the growing test set size
• The distance between the scores of the different systems remains stable when using 40% or more of the test set
• The confidence intervals become narrower for larger test set
* System A, (Bootstrap Size B=2000)
June 2004DARPA TIDES MT Workshop
How many reference translations are sufficient?
• Confidence intervals become narrower with more reference translations
• [100%](1-ref)~[80~90%](2-ref)~[70~80%](3-ref)~[60%~70%](4-ref)
• One additional reference translation compensates for 10~15% of testing data
* System A, (Bootstrap Size B=2000)
June 2004DARPA TIDES MT Workshop
Bootstrap-t interval vs. normal/t interval
• Normal distribution / t-distribution
• Student’s t-interval (when n is small)
• Bootstrap-t interval– For each bootstrap sample, calculate
– The alpha-th percentile is estimated by the value , such that
– Bootstrap-t interval is
– e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval
)1,0(~ˆ .
^ Nse
Z
]ˆ,ˆ[^
)(^
)1( sezsez Assuming that
1
.
^ ~ˆ
ntse
Z]ˆ,ˆ[
^
1
^
1
)()1(
setset nn
Assuming that
^*
**
)(
)(ˆ)(
bse
bbZ
)(ˆ t BtbZ /}ˆ)({# )(*
]ˆ,ˆ[^^ )()1(
setset
June 2004DARPA TIDES MT Workshop
Bootstrap-t interval vs. Normal/t interval (Cont.)
• Bootstrap-t intervals assumes no distribution, but– It can give erratic results
– It can be heavily influenced by a few outlying data points
• When B is large, the bootstrap sample scores are pretty close to normal distribution
• Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)– STDEV=0.27 for bootstrap-t interval
– STDEV=0.14 for normal/student-t interval
Historgram of 2000 BLEU Scores
0
20
40
60
80
100
120
140
160
BLEU Score
Fre
q
June 2004DARPA TIDES MT Workshop
The Number of Bootstrap Replications B
• Ideal bootstrap estimate of the confidence interval takes• Computational time increases linearly with B • The greater the B, the smaller of the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative
confidence interval– STDEV = 0.60 when B=100– STDEV = 0.27 when B=500
• Two rules of thumb:– Even a small B, say B=100 is usually informative– B>1000 gives quite satisfactory results
B
June 2004DARPA TIDES MT Workshop
References
• Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.
• F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.
• M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.
• G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO.
• I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada.
• King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.
• S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece.
• NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf
• Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL.
• Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.
top related