tsd2013 ppt.automatic machine translation evaluation with part-of-speech information

Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, and Liangye He

Open source code: https://github.com/aaronlifenghan/aaron-project-hlepor

May 16th, 2012

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

TSD 2013, LNAI Vol. 8082, pp. 121-128. Springer Verlag Berlin Heidelberg 2013

https://github.com/aaronlifenghan/aaron-project-hlepor






Introduction and some related work in MT Evaluation

Problem and designed idea for MT evaluation

Employed linguistic feature

Designed measuring formula

Evaluation method of evaluation metric

Experiment on WMT corpora

Conclusion

Reference

• The machine translation (MT) began as early as in the 1950s (Weaver, 1955)

• big progress science the 1990s due to the development of computers (storage capacity and computational power) and the enlarged bilingual corpora (Marino et al. 2006)

• Some recent works of MT:

• (Och 2003) presented MERT (Minimum Error Rate Training) for log-linear SMT

• (Su et al. 2009) used the Thematic Role Templates model to improve the translation

• (Xiong et al. 2011) employed the maximum-entropy model etc.

• The rule-based and data-driven methods including example-based MT (Carl and Way 2003) and statistical MT (Koehn 2010) became mainly approaches in MT literature.

• Due to the wide-spread development of MT systems, the MT evaluation becomes more and more important to tell us how well the MT systems perform and whether they make some progress.

• However, the MT evaluation is difficult:

• language variability results in no single correct translation

• the natural languages are highly ambiguous and different languages do not always express the same content in the same way (Arnold 2003)

• Human evaluation:

• the intelligibility (measuring how understandable the sentence is)

• fidelity (measuring how much information the translated sentence retains compared to the original) used by the Automatic Language Processing Advisory Committee (ALPAC) around 1966 (Carroll 1966)

• adequacy (similar as fidelity), fluency (whether the sentence is well-formed and fluent) and comprehension (improved intelligibility) by Defense Advanced Research Projects Agency (DARPA) of US (White et al. 1994).

• Problem in manual evaluations :

• time-consuming and thus too expensive to do frequently.

• automatic evaluation metrics :

• word error rate WER (Su et al. 1992) (edit distance between the system output and the closest reference translation)

• position independent word error rate PER (Tillmann et al. 1997) (variant of WER that disregards word ordering)

• BLEU (Papineni et al. 2002) (the geometric mean of n-gram precision by the system output with respect to reference translations)

• NIST (Doddington 2002) (adding the information weight)

• GTM (Turian et al. 2003)

• Recently, many other methods:

• METEOR (Banerjee and Lavie 2005) metric conducts a flexible matching, considering stems, synonyms and paraphrases.

• The matching process involves computationally expensive word alignment. There are some parameters such as the relative weight of recall to precision, the weight for stemming or synonym that should be tuned. Meteor-1.3 (Denkowski and Lavie 2011), an modified version of Meteor, includes ranking and adequacy versions and has overcome some weaknesses of previous version such as noise in the paraphrase matching, lack of punctuation handling and discrimination between word types.

• Snover (Snover et al. 2006) discussed that one disadvantage of the Levenshtein distance was that mismatches in word order required the deletion and re-insertion of the misplaced words.

• They proposed TER by adding an editing step that allows the movement of word sequences from one part of the output to another. This is something a human post-editor would do with the cut-and-paste function of a word processor.

• However, finding the shortest sequence of editing steps is a computationally hard problem.

• AMBER (Chen and Kuhn 2011) including AMBER-TI and AMBER-NL declare a modified version of BLEU and attaches more kinds of penalty coefficients, combining the n-gram precision and recall with the arithmetic average of F-measure.

• Before the evaluation, it provides eight kinds of preparations on the corpus by whether the words are tokenized or not, extracting (the stem, prefix and suffix) on the words, and splitting the words into several parts with different ratios.

• F15 (Bicici and Yuret 2011) and F15G3 per-form evaluation with the F1 measure (assigning the same weight on precision and recall) over target features as a metric for evaluating translation quality.

• The target features they defined include TP (be the true positive), TN (the true negative), FP (the false positive), and FN (the false negative rates) etc. To consider the surrounding phrase for a missing token in the translation they employed the gapped word sequence kernels (Taylor and Cristianini 2004) approach to evaluate translations.

• Other related works：

• (Wong and Kit 2008), (Isozaki et al. 2010) and (Talbot et al. 2011) about the discussion of word order

• ROSE (Song and Cohn 2011), MPF and WMPF (Popovic 2011) about the employing of POS information

• MP4IBM1 (Popovic et al. 2011) without relying on reference translations, etc.

• The evaluation methods proposed previously suffer from several main weaknesses more or less:

• perform well in certain language pairs but weak on others, which we call the language-bias problem;

• consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;

• present incomprehensive factors (e.g. BLEU focus on precision only).

• What to do?

• This paper: to address some of above problems

• How？

• Enhanced factors

• Tunable parameters

• Organic and scientific factors combinations (mathematical)

• Concise linguistic features

• To address the variability phenomenon, researchers used to employ the synonyms, paraphrasing or text entailment as auxiliary information. All of these approaches have their advantages and weaknesses, e.g.

• the synonyms are difficulty to cover all the acceptable expressions.

• Instead, in the designed metric, we perform the measuring on the part-of-speech (POS) information (also applied by ROSE (Song and Cohn 2011), MPF and WMPF (Popovic 2011)).

• If the translation sentence of system outputs is a good translation then there is a potential that the output sentence has a similar semantic information (the two sentence may not contain exactly the same words but with the words that have similar semantic meaning).

• For example, “there is a big bag” and “there is a large bag” could be the same expression since “big” and “large” has the similar meaning (with POS as adjective).

How to measure?

• 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑋1, 𝑋2, … , 𝑋𝑛 =𝑛

1

𝑋𝑖

𝑛𝑖=1

(1)

• 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤𝑋1𝑋1, … , 𝑤𝑋𝑛𝑋𝑛 = 𝑤𝑋𝑖𝑛𝑖=1

𝑤𝑋𝑖𝑋𝑖

𝑛𝑖=1

(2)

• ℎ𝐿𝐸𝑃𝑂𝑅 =𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤𝐿𝑃𝐿𝑃,𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙, 𝑤𝐻𝑃𝑅𝐻𝑃𝑅

• = 𝑤𝑖𝑛𝑖=1

𝑤𝑖

𝐹𝑎𝑐𝑡𝑜𝑟𝑖

𝑛𝑖=1

=𝑤𝐿𝑃+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙+𝑤𝐻𝑃𝑅𝑤𝐿𝑃𝐿𝑃

+𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙

+𝑤𝐻𝑃𝑅𝐻𝑃𝑅

(3)

• 𝐿𝑃 =

exp 1 −𝑟

𝑐: 𝑐 < 𝑟

1 ∶ 𝑐 = 𝑟

exp 1 −𝑐

𝑟: 𝑐 > 𝑟

(4)

• 𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙 = exp −𝑁𝑃𝐷 (5)

• 𝑁𝑃𝐷 =1

𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡 |𝑃𝐷𝑖|𝐿𝑒𝑛𝑔𝑡ℎ𝑜𝑢𝑡𝑝𝑢𝑡𝑖=1

(6)

Fig. 1. N-gram POS alignment algorithm

Fig. 3. Example of NPD calculation

Fig. 2. Example of n-gram POS alignment

• 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝛼𝑅, 𝛽𝑃 =𝛼+𝛽𝛼

𝑅+𝛽

𝑃

(7)

• 𝑃 =𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚

𝑠𝑦𝑠𝑡𝑒𝑚𝑙𝑒𝑛𝑔𝑡ℎ (8)

• 𝑅 =𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑛𝑢𝑚

𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑙𝑒𝑛𝑔𝑡ℎ (9)

• ℎ𝐿𝐸𝑃𝑂𝑅𝐴 =1

𝑆𝑒𝑛𝑡𝑁𝑢𝑚 ℎ𝐿𝐸𝑃𝑂𝑅𝑖𝑆𝑒𝑛𝑡𝑁𝑢𝑚𝑖=1 (10)

• ℎ𝐿𝐸𝑃𝑂𝑅𝐵 =

𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑤𝐿𝑃𝐿𝑃,𝑤𝑁𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑃𝑜𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦, 𝑤𝐻𝑃𝑅𝐻𝑃𝑅 (11)

How to evaluate the effectiveness of the algorithms?

• Spearman correlation coefficient:

• 𝜌∅ 𝑋𝑌 = 1 −6 𝑑𝑖

2𝑛𝑖=1

𝑛(𝑛2−1) (12)

• 𝑋 = 𝑥1, … , 𝑥𝑛 , 𝑌 = {𝑦1, … , 𝑦𝑛}

Experiment on authoritative corpora

International workshop on

STATISTICAL MACHINE TRANSLATION

• Parameters tuned on WMT08 and tested on WMT11:

• evaluation metric based on mathematical weighted harmonic mean

• tunable weights

• Enhanced factors

• employs concise linguistic feature the POS of the word

• Better performance than the similar metrics using POS, such as ROSE, MPF, WMPF.

• Performance can be further enhanced by the increasing of POS tools and the adjusting of the parameter values

• BLEU uses n-gram, other researchers count the number of POS, e.g. (Avramidis et al. 2011), we combine the n-gram and POS information together.

• Evaluation methods without the need of reference perform low, e.g. the MP4IBM1 metric ranked near the bottom in the experiments.

• More language pairs will be tested

• Combination of both word and POS will be explored

• Parameter tuning will be achieved automatically

• Evaluation without golden references will be developed

• 1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors,

• Machine Translation of Languages: Fourteen Essays. John Wiley and Sons, New

• York, pages 15{23 (1955)

• 2. Marino B. Jose, Rafael E. Banchs, Josep M. Crego, Adria de Gispert, Patrik Lambert,

• Jose A. Fonollosa, Marta R. Costa-jussa: N-gram based machine translation,

• Computational Linguistics, Vol. 32, No. 4. pp. 527-549, MIT Press (2006)

• 3. Och, F. J.: Minimum Error Rate Training for Statistical Machine Translation. In

• Proceedings of (ACL-2003). pp. 160-167 (2003)

• 4. Su Hung-Yu and Chung-Hsien Wu: Improving Structural Statistical Machine Translation

• for Sign Language With Small Corpus Using Thematic Role Templates as

• Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE

• PROCESSING, VOL. 17, NO. 7, SEPTEMBER (2009)

• 5. Xiong D., M. Zhang, H. Li: A Maximum-Entropy Segmentation Model for Statistical

• Machine Translation, Audio, Speech, and Language Processing, IEEE Transactions

• on, Volume: 19, Issue: 8, 2011 , pp. 2494- 2505 (2011)

• 6. Carl, M. and A. Way (eds): Recent Advances in Example-Based Machine Translation.

• Kluwer Academic Publishers, Dordrecht, The Netherlands (2003)

• 7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge

• University Press (2010)

• 8. Arnold, D.: Why translation is dicult for computers. In Computers and Translation:

• A translator's guide. Benjamins Translation Library (2003)

• 9. Carroll, J. B.: Aan experiment in evaluating the quality of translation, Pierce, J.

• (Chair), Languages and machines: computers in translation and linguistics. A report

• by the Automatic Language Processing Advisory Committee (ALPAC), Publication

• 1416, Division of Behavioral Sciences, National Academy of Sciences, National Research

• Council, page 67-75 (1966)

• 10. White, J. S., O'Connell, T. A., and O'Mara, F. E.: The ARPA MT evaluation

• methodologies: Evolution, lessons, and future approaches. In Proceedings of the

• Conference of the Association for Machine Translation in the Americas (AMTA

• 1994). pp193-205 (1994)

• 11. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality

• Measure for Machine Translation Systems. In Proceedings of the 14th International

• Conference on Computational Linguistics, pages 433{439, Nantes, France, July

• (1992)

• 12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf:

• Accelerated DP Based Search For Statistical Translation. In Proceedings of the 5th

• European Conference on Speech Communication and Technology (EUROSPEECH97)

• (1997)

• 13. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic

• evaluation of machine translation. In Proceedings of the (ACL 2002), pages 311-318,

• Philadelphia, PA, USA (2002)

• 14. Doddington, G.: Automatic evaluation of machine translation quality using ngram

• co-occurrence statistics. In Proceedings of the second international conference

• on Human Language Technology Research(HLT 2002), pages 138-145, San Diego,

• California, USA (2002)

• 15. Turian, J. P., Shen, L. and Melanmed, I. D.: Evaluation of machine translation

• and its evaluation. In Proceedings of MT Summit IX, pages 386-393, New Orleans,

• LA, USA (2003)

• 16. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with

• high levels of correlation with human judgments. In Proceedings of ACL-WMT,

• pages 65-72, Prague, Czech Republic (2005)

• 17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization

• and evaluation of machine translation systems. In Proceedings of (ACL-WMT),

• pages 85-91, Edinburgh, Scotland, UK (2011)

• 18. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of

• translation edit rate with targeted human annotation. In Proceedings of the Conference

• of the Association for Machine Translation in the Americas (AMTA), pages

• 223-231, Boston, USA (2006)

• 19. Chen, B. and Kuhn, R.: Amber: A modied bleu, enhanced ranking metric. In

• Proceedings of (ACL-WMT), pages 71-77, Edinburgh, Scotland, UK (2011)

• 20. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination,

• and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh,

• Scotland, UK (2011)

• 21. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge

• University Press 2004.

• 22. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT

• evaluation. In Workshop: MetricsMATR of the Association for Machine Translation

• in the Americas (AMTA), short paper, 3 pages, Waikiki, Hawai'I, USA (2008)

• 23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation

• of translation quality for distant language pairs. In Proceedings of the 2010

• Conference on (EMNLP), pages 944{952, Cambridge, MA (2010)

• 24. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A

• Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings

• of the Sixth (ACL-WMT), pages 12-21, Edinburgh, Scotland, UK (2011)

• 25. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence

• level MT evaluation. In Proceedings of the (ACL-WMT), pages 123-129, Edinburgh,

• Scotland, UK (2011)

• 26. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In

• Proceedings of (ACL-WMT), pages 104-107, Edinburgh, Scotland, UK (2011)

• 27. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references:

• IBM1 scores as evaluation metrics. In Proceedings of the (ACL-WMT),

• pages 99-103, Edinburgh, Scotland, UK (2011)

• 28. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate,

• compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages

• 433{440, Sydney, July (2006)

• 29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011

• Workshop on Statistical Machine Translation. In Proceedings of (ACL-WMT), pages

• 22-64, Edinburgh, Scotland, UK (2011)

• 30. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M. and Zaidan,

• O. F.: Findings of the 2010 Joint Workshop on Statistical Machine Translation and

• Metrics for Machine Translation. In Proceedings of (ACL-WMT), pages 17-53, PA,

• USA (2010)

• 31. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Findings of the 2009

• Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages

• 1-28, Athens, Greece (2009)

• 32. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Further meta-evaluation

• of machine translation. In Proceedings of (ACL-WMT), pages 70-106, Columbus,

• Ohio, USA (2008)

• 33. Avramidis E., Popovic, M., Vilar, D., Burchardt, A.: Evaluate with Condence

• Estimation: Machine ranking of translation outputs using grammatical features. In

• Proceedings of the Sixth Workshop on Statistical Machine Translation, Association

• for Computational Linguistics (ACL-WMT), pages 65-70, Edinburgh, Scotland, UK

• (2011)

Aaron L.-F. Han, Derek F. Wong, Lidia S. Chao, and Liangye He

Open source code: https://github.com/aaronlifenghan/aaron-project-hlepor

Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory

Department of Computer and Information Science

University of Macau

TSD 2013, LNAI Vol. 8082, pp. 121-128. Springer Verlag Berlin Heidelberg 2013