transliteration extraction from classical chinese buddhist literature

Transliteration Extraction from Classical Chinese Buddhist LiteratureUsing Conditional Random Fields

Yu-Chun WangDepartment of Computer Science

and Information Engineering,National Taiwan University, TaiwanTelecommunication Laboratories,

Chunghwa Telecom, [email protected]

Richard Tzong-Han Tsai∗Department of Computer Science

and Information Engineering,National Central University,

Zhongli City, [email protected]

Abstract

Extracting plausible transliterations fromhistorical literature is a key issues in his-torical linguistics and other resaech fields.In Chinese historical literature, the charac-ters used to transliterate the same loanwordmay vary because of different translationeras or different Chinese language prefer-ences among translators. To assist historicallinguiatics and digial humanity researchers,this paper propose a transliteration extrac-tion method based on the conditional ran-dom field method with the features basedon the characteristics of the Chinese char-acters used in transliterations which are suit-able to identify transliteration characters. Toevaluate our method, we compiled an evalu-ation set from the two Buddhist texts, theSamyuktagama and the Lotus Sutra. Wealso construct a baseline approach with suf-fix array based extraction method and pho-netic similarity measurement. Our methodoutperforms the baseline approach a lot andthe recall of our method achieves 0.9561and the precision is 0.9444. The resultsshow our method is very effective to extracttransliterations in classical Chinese texts.

1 Introduction

Cognates and loanwords play important roles inthe research of language origins and cultural in-terchange. Therefore, extracting plausible cog-nates or loanwords from historical literature is akey issues in historical linguistics. The adoption ofloanwords from other languages is usually throughtransliteration. In Chinese historical literature,the characters used to transliterate the same loan-word may vary because of different translationeras or different Chinese language/dialect prefer-ences among translators. For example, in classical

Chinese Buddhist scriptures, the translation pro-cess of Buddhist scriptures from Sanskrit to classi-cal Chinese occurred mainly from the 1st centuryto 10th century. In these works, the same San-skrit words may transliterate into different Chi-nese loanword forms. For instance, the surname ofthe Buddha, Gautama, is transliterated into severaldifferent forms such as “瞿曇” (qu-tan) or “喬答摩” (qiao-da-mo), and the name “Culapanthaka”has several different Chinese transliterations suchas “朱利槃特” (zhu-li-pan-te) and “周利槃陀伽” (zhou-li-pan-tuo-qie). In order to assist re-searchers in historical linguistics and other digitalhumanity research fields, an approach to extracttransliterations in classical Chinese texts is neces-sary.

Many transliteration extraction methods requirea bilingual parallel corpus or text documents con-taining two languages. For example, (Sherif andKondrak, 2007) proposed a method for learningthe string distance measurement function from asentence-aligned English-Arabic parallel corpusto extract transliteration pairs. (Kuo et al., 2007)proposed a transliteration pair extraction methodusing a phonetic similarity model. Their approachis based on the general rule that when a new En-glish term is transliterated into Chinese (in modernChinese texts, e.g. newswire), the English sourceterm usually appears alongside the transliteration.To exploit this pattern, they identify all the En-glish terms in a Chinese text and measure the pho-netic similarity between those English terms andtheir surrounding Chinese terms, treating the pairswith the highest similarity as the true translitera-tion pairs. Despite its high accuracy, this approachcannot be applied to transliteration extraction inclassical Chinese literature since the prerequisite(of the source terms alongside the transliteration)does not apply.

PACLIC-27

260Copyright 2013 by Yu-Chun Wang and Richard Tzong-Han Tsai

27th Pacific Asia Conference on Language, Information, and Computation pages 260－266

Some researchers have tried to extract translit-erations from a single language corpus. (Ohand Choi, 2003) proposed a Korean translitera-tion identification method using a Hidden MarkovModel (HMM) (Rabiner, 1989). They trans-formed the transliteration identification probleminto a sequential tagging problem in which eachKorean syllable block in a Korean sentence istagged as either belonging to a transliterationor not. They compiled a human-tagged Ko-rean corpus to train a hidden Markov model withpredefined phonetic features to extract translit-eration terms from sentences by sequential tag-ging. (Goldberg and Elhadad, 2008) proposedan unsupervised Hebrew transliteration extrac-tion method. They adopted an English-Hebrewphoneme mapping table to convert the Englishterms in a named entity lexicon into all the pos-sible Hebrew transliteration forms. The Hebrewtransliterations are then used to train a Hebrewtransliteration identification model. However, Ko-rean and Hebrew are alphabetical writing system,while Chinese is ideographic. These identificationmethods heavily depend on the phonetic character-istics of the writing system. Since Chinese char-acters do not necessarily reflect actual pronunci-ations, these methods are difficult to apply to thetransliteration extraction problem in classical Chi-nese.

This paper proposes an approach to extracttransliterations automatically in classical Chinesetexts, especially Buddhist scriptures, with super-vised learning models based on the probability ofthe characters used in transliterations and the an-guage model features of Chinese characters.

2 Method

To extract the transliterations from the classicalChinese Buddhist scriptures, we adopt a super-vised learning method, the conditional randomfields (CRF) model. The features we use in theCRF model are described in the following subsec-tions.

2.1 Probability of each Chinese character intransliterations

According to our observation, in the classical Chi-nese Buddhist texts, the Chinese characters chosenbe used in transliteration show some characteris-tics. Translators tended to choose the charactersthat do not affect the comprehension of the sen-

tences. The amount of the Chinese characters ishuge, but the possible syllables are limited in Chi-nese. Therefore, one Chinese character may sharethe same pronunciation with several other charac-ters. Hence, the translators may try to choose therarely used characters for transliteration.

From our observation, the probability of eachChinese character used to be transliterated is animportant feature to identify transliteration fromthe classical Buddhist texts. In order to measurethe probability of every character used in translit-erations, we collect the frequency of all the Chi-nese characters in the Chinese Buddhist Canon.Furthermore, we apply the suffix array method(Manzini and Ferragina, 2004) to extract the termswith their counts from all the texts of the Chi-nese Buddhist Canon. Next, the extracted termsare filtered out by the a list of selected translitera-tion terms from the Buddhist Translation Lexiconand Ding Fubao’s Dictionary of Buddhist Studies.The extracted terms in the list are retained and thefrequency of each Chinese character can be cal-culated. Thus, the probability of a given Chinesecharacter c in transliteration can be defined as:

Prob(c) = logfreqtrans(c)

freqall(c)

where freqtrans(c) is c’s frequency used intransliterations, and freqall(c) is c’s frequencyappearing in the whole Chinese Buddhist Canon.The logarithm in the formula is designed for CRFdiscrete feature values.

2.2 Language model of the transliteration

Transliterations may appear many times in oneBuddhist sutra. The preceding character and thefollowing character of the transliteration may bedifferent. For example, for the phrase “於憍薩羅國” (yu-jiao-sa-luo-guo, in Kosala state), if wewant to identify the actual transliteration, “憍薩羅” (jiao-sa-luo, Kosala), from the extra charac-ters “於” (yu, in) and “國” (guo, state), we mustfirst use an effective feature to identify the bound-aries of the transliteration.

In order to identify the boundaries of translit-erations, we propose a language-model-based fea-ture. A language model assigns a probability toa sequence of m words P (w1, w2, . . . , wm) bymeans of a probability distribution. The probabil-ity of a sequence of m words can be transformed

PACLIC-27

261

into a conditional probability:

P (w1, w2, · · · , wm) = P (w1)P (w2|w1)P (w3|w1, w2) · · ·P (wm|w1, w2, · · · ,wm−1)

=m∏i=1

P (wi|w1, w2, · · · ,

wi−1)

In practice, we can assume the probability of aword only depends on its previous word (bi-gramassumption). Therefore, the probability of a se-quence can be approximated as:

P (w1, w2, · · · , wm) =m∏i=1

P (wi|w1, w2,

· · · , wi−1)

≈m∏i=1

P (wi|wi−1)

We collect person and location names fromthe Buddhist Authority Database1 and the knownBuddhist transliteration terms from The BuddhistTranslation Lexicon (翻譯名義集)2 to create adataset with 4,301 transliterations for our bi-gramlanguage model.

After building the bi-gram language model, weapply it as a feature for the supervised model. Fol-lowing the previous example, “於憍薩羅國” (yu-jiao-sa-luo-guo, in Kosala state), for each charac-ter in the sentence, we first compute the probabil-ity of the current character and its previous char-acter. For the first character “於” , since there isno previous word, the probability is P (於). Forthe second character “憍”, the probability of thetwo characters is P (於憍) = P (於)P (憍|於). Wethen compute the probability of the second andthird characters: P (憍薩) = P (憍)P (薩|憍), andso on. If the probability changes sharply from thatof the previous bi-gram, the previous bi-gram maybe the boundary of the transliteration. Because thecharacter “於” rarely appears in transliterations,P (於憍) is much lower than P (憍薩). We mayconclude that the left boundary is between the firsttwo characters “於憍”.

2.3 Functional WordsWe take the classical Chinese functional wordsinto consideration. These characters have spe-

1http://authority.ddbc.edu.tw/2http://www.cbeta.org/result/T54/

T54n2131.htm

cial grammatical functions in classical Chinese;thus, they are seldom used to transliterate foreignnames. This is a binary feature which records thecharacter is a functional word or not. The func-tional words are listed as follows: 之 (zhi ), 乎(hu),且 (qie),矣 (yi ),邪 (ye),於 (yu),哉 (zai ),相 (xiang),遂 (sui ),嗟 (jie),與 (yu), and噫 (yi ).

2.4 Appellation and Quantifier WordsAfter observing the transliterations appearing inclassical Chinese literature, we note that there aresome specific patterns of the characters follows thetransliteration terms. Most of the characters fol-lowing the transliteration are appellation or quan-tifier words, such as 山 (san, mountain), 海 (hai,sea),國 (guo, state),洲 (zhou, continent). For ex-ample, there are some cases like 耆闍崛山 (qi-du-jui-san, Vulture mountain), 拘薩羅國 (ju-sa-luo-guo, Kosala state), and瞻部洲 (zhan-bu-zhou,Jambu continent). Therefore, we collect the Chi-nese characters that are usually used as appellationor quantifiers following transliterations and thendesign this feature. This is also a binary featurethat records the character is used as an appellationor quantifier word or not.

2.5 CRF Model TrainingWe adopt the supervised learning models, condi-tional random field (CRF) (Lafferty et al., 2011),to extract the transliterations in classical Buddhisttexts. For CRF model, we formulate the translit-eration extraction problem as a sequential taggingproblem.

2.5.1 Conditional Random FieldsConditional random fields (CRFs) are undi-

rected graphical models trained to maximize aconditional probability (Lafferty et al., 2011). Alinear-chain CRF with parameters Λ = λ1, λ2, . . .defines a conditional probability for a state se-quence y = y1 . . .yT , given that an input se-quence x = x1 . . .xT is

PΛ(y|x) =1

Zxexp

(T∑t=1

∑k

λkfk(yt−1,yt,x, t)

)

where Zx is the normalization factor that makesthe probability of all state sequences sum to one;fk(yt−1,yt,x, t) is often a binary-valued featurefunction and λk is its weight. The feature func-tions can measure any aspect of a state transition,yt−1 → yt, and the entire observation sequence,

PACLIC-27

262

x, centered at the current time step, t. For exam-ple, one feature function might have the value 1when yt−1 is the state B, yt is the state I, and xt isthe character “國” (guo). Large positive values forλk indicate a preference for such an event; largenegative values make the event unlikely.

The most probable label sequence for x,

y∗ = arg maxyPΛ(y|x)

can be efficiently determined using the Viterbi al-gorithm.

2.5.2 Sequential Tagging and FeatureTemplate

The classical Buddhist texts are separated intosentences by the Chinese punctuation. Then, eachcharacter in the sentences is taken as a data row forCRF model. We adopt the tagging approach mo-tivated by the Chinese segmentation (Tsai et al.,2006) which treat Chinese segmentation as a tag-ging problem. The characters in a sentence aretagged in B class if it is the first character of atransliteration word or in I class if it is in a translit-eration word but not the first character. The char-acters that do not belong to a transliteration wordsare tagged in O class. We adopt the CRF++ open-source toolkit3. We train our CRF models with theunigram and bigram features over the input Chi-nese character sequences. The features are shownas follows.

• Unigram: s−2, s−1, s0, s1, s2

• Bigram: s−1s0, s0s1

where current substring is s0 and si is other char-acters relative to the position of the current char-acter.

3 Evaluation

3.1 Data setWe choose one Buddhist scripture as our data setfor evaluation from the Chinese Buddhist Canonmaintained by Chinese Buddhist Electronic TextAssociation (CBETA). The scripture we choose tocompile the training and test sets is the Samyuk-tagama (雜阿含經). The Samyuktagama is one ofthe most important scriptures in Early Buddhismand contains a lot of transliterations because it de-tailedly records the speech and the lives of theBuddha and many of his diciples.

3http://crfpp.googlecode.com

The Samyuktagama is an early Buddhist scrip-ture collected shortly after the Buddha’s death.The term agama in Buddhism refers to a collectionof discourses, and the name Samyuktagama means“connected discourses.” It is among the most im-portant sutras in Early Buddhism. The authorshipof the Samyuktagama is traditionally regarded asthe most early sutra collected by the Mahakssyapa,the Buddha’s disciple, and five hundred Arhatsthree months after the Buddha’s death. An In-dian monk, Gunabhadra, translated this sutra intoclassical Chinese in Liu Song dynasty around 443C.E. The classical Chinese Samyuktagama has 50volumes containing about 660,000 characters. Be-cause the amount of Samyuktagama is too tremen-dous, we take the first 20 volumes as the trainingset, and the last 10 volumes as the test set.

In addition, we want to evaluate if the su-pervised learning model trained by one Buddhistscripture can be applied to another Buddhist scrup-ture translated in different era. Therefore, wechoose another scripture, the Lotus Sutra (妙法蓮華經), to create another test set. The Lotus sutrais a famous Mahayana Buddhist scripture proba-bly written down between 100 BC and 100 C.E.The earliest known Sanskrit title for the sutra isthe Saddharma Pundarika Sutra, which translatesto “the Good Dharma Lotus Flower Sutra.” InEnglish, the shortened form Lotus Sutra is com-mon. The Lotus Sutra has also been highly re-garded in a number of Asian countries where Ma-hayana Buddhism has been traditionally practiced,such as China, Japan, and Korea. The Lotus Sutrahas several classical Chinese translation versions.The most widely used version is translated by Ku-marajiva (“鳩摩羅什” in Chinese) in 406 C.E.It has eight volumes and 28 chapters containingmore then 25,000 characters. We select the first5 chapters as a different test set to evaluate ourmethod.

3.2 Baseline Method

There are a few reseaches focusing on transliter-ation extraction from classical Chinese literature.However, in order to compare and show the bene-fits of our method, we construct a baseline systemwith widely used information extraction methods.Because many previous researches on translitera-tion extraction are based on phonetic similarity orphoneme mapping approaches, we also use thesemethods to construct the baseline system. First,

PACLIC-27

263

Table 1: Evaluation Results of Tranliteration Extraction

Precision Recall F1-score

Our ApproachThe Samyuktagama test set 0.8810 0.9561 0.9170

The Lotus Sutra test set 0.9444 0.9474 0.9459

BaselineThe Samyuktagama test set 0.0399 0.7771 0.0759

The Lotus Sutra test set 0.0146 0.5789 0.2848

the baseline system use the suffix array methodto extract all the possible terms for the classicalChinese Buddhsit scriptures. Then, the extractedterms are converted into Pinyin sequences by amodern Chinese pronunication dictionary. Wealso adopt the collected transliteation list used insection 2.1 and also convert the transliterationsinto Pyinyin sequences. Next, for each extractedterms, the baseline system measures the Leven-shtein distance between the Pinyin sequences ofthe extracted terms and all the transliterations asthe phonetic similairy. If the extracted term has aLevenshtein distance less than threshold (distance≤ 3 in our baseline) from one of the transliter-ations we collect, the extracted term will be re-garded as a transliteration; otherwise, the term willbe dropped.

3.3 Evaluation MetricsWe use two evaluation metrics, recall and preci-sion, to estimate the performance of our system.Recall and precision are widely used measure-ments in many research fields, sucn as informationretrieval and information extraction. (Manning etal., 2008) In the digital humanities research field, akey issue is the coverage of the extraction method.To maximize usefulness to researchers, a methodshould be able to extract as many potential translit-erations from literature as possible. Therefore, inour evaluation, we use recall, defined as follows:

Recall =|Correctly extracted transliterations||Transliterations in the data set|

In addition, the correctness of the extractedtransliterations are also important. To avoid wast-ing time on the useless information, a methodshould be able to extract correct transliterationsfrom literature as possible. Thus, we also use pre-cision, defined as follows:

Precision =|Correctly extracted transliterations||All extracted transliterations|

With precision and recall, the F-score measure-ment is also adopted as a weighted average of the

precision and recall. The F1-score is defineds asfollows:

F1-score =2× recision× recallprecision+ recall

3.4 Evaluation ResultsTable 1 shows the results of our method and thebaseline system on different test sets. The goldstandards of these two test sets are compiled byhuman experts who examine all the sentences inthe test sets and regconize each transliterations forevaluation. The results show that our method canextract 95.61% transliterations on the Sumyuk-tagama and 94.74% on the Lotus Sutra. On theprecision measurement, our method also achievespretty good results, which show that most of theterms our method extract are actual translitera-tions. Our method outperforms the baseline sys-tem and the precision of the baseline system isvery poor. The baseline system cannot extractmost transliterations due to the limit of the suffixarray method since the suffix array method onlyextracts the terms that appear twice or more in thecontext. Besides, the phonetic similarity is not ef-fective to filter the transliteartions; the problemcauses the low precision. These results demon-strate that our method can save a lot of labor-intensive work to examine the transliteration forthe historical and humanity researchers.

4 Discussion

4.1 Effectiveness of transliteration extractionOur method can extract many transliterations fromthe Samyuktagama such as “迦毘羅衛” (jia-pi-luo-wei, Kapilavastu, the name of an ancient king-dom where the Buddha was born and grew up),“尼拘律” (ni-ju-lu, Nyagro, the forest name inKapilavastu kingdom), and “摩伽陀” (muo-qie-tuo, Magadha, the name of an ancient Indiankingdom). These transliteration do not appear inthe training set, but our method can still iden-tify them. In addition, our method also findsout many transliterations in the Lotus Sutra which

PACLIC-27

264

are unseen in the Samyuktagama, such as “娑伽羅” (suo-qie-luo, Sagara, the name of the kingof the sea world in ancient Indian mythology),“鳩槃茶/鳩槃荼” (jiu-pan-cha/jiu-pan-tu, Kumb-handa, one of a group of dwarfish, misshapen spir-its among the lesser deities of Buddhist mythol-ogy), and “阿鞞跋致” (a-pi-ba-zhi, Avaivart, “notturn back” in Sanskrit). Since the characteristicsof the Lotus Sutra are different from the Samyuk-tagama in many aspects, it shows that the su-pervised learning model trained by one Buddhistscripture may apply to other Buddhist scripturestranslated in different eras and translators.

We also discovered that transliterations mayvary even in the same scripture. In the Samyuk-tagama, the Sanskrit term “Chandala” (someonewho deals with disposal of corpses, and is a Hindulower caste, formerly considered untouchables)has two different transliterations: “旃陀羅” (zhan-tuo-luo) and “栴陀羅” (zhan-tuo-luo). The San-skrit term “Magadha” (the name of an ancient In-dian kingdom) has three different transliterations:“摩竭陀” (muo-jie-tuo), “摩竭提” (muo-jie-ti ),and “摩伽陀” (muo-qie-tuo). The variations ofthe transliterations of the same word give the cluesof translators and translation progress. These vari-ations may help the study of historical Chinesephonology and philology.

4.2 Error cases

Although our method can extract and identifymost transliteration pairs, some transliterationpairs cannot be identified. The error cases canbe divided into several categories. The first oneis that a few terms cannot be extracted, such as“闍維” (she-wei, Jhapita, cremation, a monk’s fu-neral pyre). This transliteration is less used andonly appears three times in the final part of theSamyuktagama. The widely used transliteration ofthe term “Jhapita” is “荼毘” (tu-pi ). It may causethe difficulty for the supervised learning model toidentify these terms.

The other case is incorrect boundary of thetransliterations. Sometimes our method may ex-tract shorter terms, such as “韋提” (wei-ti, cor-rect transliteration is “韋提希”, wei-ti-xi, Vaidehi,a female person name), “波羅” (po-luo, correcttransliteration is “波羅柰”, po-luo-nai, Varanasi,a location name in northen India), “瞿利摩羅”(qu-li-muo-luo, correct transliteration is “央瞿利摩羅”, yang-qu-li-muo-luo, Angulimala, one of

the Buddha’s disciples). This problem is due tothe probability generated by the language model.For example, the probability of the first twocharactgers of the transliteration “央瞿利摩羅”,P (央瞿), is very low. It causes the CRF modelpredicts the first character “央” (yang) does notbelong to the transliteration. If more translitera-tions can be collected to build a better languagemodel, this problem can be overcome.

In some cases, our method extracts much longerterms, like “阿那律陀夜” (a-na-lu-tuo-ye, correcttransliteration is “阿那律陀”, a-na-lu-tuo, Anirud-dha, one of the Buddha’s closest disciples), and“兒富那婆藪” (er-fu-na-po-sou, correct translit-eration is “富那婆藪”, fu-na-po-sou, Punabbasu,a kind of ghost in Buddhist mythology). In thesecases, the previous or following characters are of-ten used in transliterations. Therefore, it is verydifficult to distinguish the boundary of the actualtransliteration. In addition, there are some casesthat a transliteration followed by another translit-eration immediately. For example, our method ex-tracts out the term “闡陀舍利” (chan-tuo-she-li ),which comprises two transliteration terms suchas “闡陀” (chan-tuo, Chanda, one of the Bud-dhist’s disciples) and “舍利” (she-li, Sarira, Bud-dhist relics). It is also difficult to separate themwithout any additional semantic clues. Althoughour method sometimes might extract incompletetransliterations with incorrect boundary, checkingthe boundary of a transliteration is not difficult toa human expert. Therefore, the extracted incor-rect transliteartions also have the benefits to helphumanity researchers quickly find and check plau-sible transliterations.

5 Conclusion

The transliteration extrction of foreign loanwordsis an important task in research fields such ashistorical linguistics and digital humanities. Wepropose an approach which can extract transliter-ation automatically from classical Chinese Bud-dhist scriptures. Our approach comprises the con-ditional random fields method with designed fea-tures which are suitable to identify transliterationcharacters. The first feature is the probability ofeach Chinese character used in transliterations.The second feature is probability of the sequen-tial bigram characters measured by the languagemodel method. In addiition, the functional words,appellation and quantifier words also be regarded

PACLIC-27

265

as binary features. Next, the transliteration extra-tion problem is formulated as a sequential taggingproblem and the CRF method is used to train amodel to extract the transliterations from the in-put classical Chinese sentences. To evaluate ourmethod, we constructed an evaluation set from thetwo Buddhist texts, the Samyuktagama and theLotus Sutra, which were translated into Chinesein different eras. We also construct a baseline sys-tem with proach with suffix array based extractionmethod and phonetic similarity measurementforcomparison. The recall of our method achieves0.9561 and the precision is 0.9444. The resultsshow our method outperforms the baseline systema lot and is effective to extract transliterations fromclassical Chinese texts. Our method can find thetransliterations among the immense classical liter-atures to help many research fields such as histor-ical linguistics and philology.

ReferencesY. Goldberg and M. Elhadad. 2008. Identification of

transliterated foreign words in hebrew script. Com-putational Linguistics and Intelligent Text Process-ing.

J-S. Kuo, H. Li, and Y-K. Yang. 2007. A phonetic sim-ilarity model for automatic extraction of translitera-tion pairs. ACM Trans. Asian Language InformationProcessing, 6(2).

John Lafferty, Andrew McCallum, and Fernando CNPereira. 2011. Conditional random fields: Prob-abilistic models for segmenting and labeling se-quence data. Proc. of ICML, pages 282–289.

Christopher D Manning, Prabhakar Raghavan, andHinrich Schutze. 2008. Introduction to informationretrieval, volume 1. Cambridge University PressCambridge.

G. Manzini and P. Ferragina. 2004. Engineering alightweight suffix array construction algorithm. Al-gorithmica, 40(1):33–50.

J. Oh and K. Choi. 2003. A statistical model for au-tomatic extraction of korean transliterated foreignwords. International Journal of Computer Process-ing of Oriental Languages, 16(1):41–62.

L. Rabiner. 1989. tutorial on hidden markov modelsand selected applications in speech recognition. InProceedings of the IEEE, volume 77.

T. Sherif and G. Kondrak. 2007. Bootstrapping astochastic transducer for arabic-english translitera-tion extraction. Proceedings of Annual Meeting- As-sociation for Computational Linguistics.

Richard Tzong-Han Tsai, Hsieh-Chuan Hung, Cheng-Lung Sung, Hong-Jie Dai, and Wen-Lian Hsu.2006. On closed task of chinese word segmentation:

An improved crf model coupled with character clus-tering and automatically generated template match-ing. Proceedings of the Fifth SIGHAN Workshop onChinese Language Processing, pages 134–137.

PACLIC-27

266

Effects of Parsing Errors on Pre-reordering Performancefor Chinese-to-Japanese SMT

Dan Han1,2 Pascual Martınez-Gomez2,3 Yusuke Miyao1,2

Katsuhito Sudoh4 Masaaki Nagata4

1The Graduate University For Advanced Studies2National Institute of Informatics, 3The University of Tokyo

4NTT Communication Science Laboratories, NTT Corporation{handan,pascual,yusuke}@nii.ac.jp

{sudoh.katsuhito,nagata.masaaki}@lab.ntt.co.jp

Abstract

Linguistically motivated reordering meth-ods have been developed to improve wordalignment especially for Statistical Ma-chine Translation (SMT) on long dis-tance language pairs. However, since theyhighly rely on the parsing accuracy, it isuseful to explore the relationship betweenparsing and reordering. For Chinese-to-Japanese SMT, we carry out a three-stageincremental comparative analysis to ob-serve the effects of different parsing errorson reordering performance by combiningempirical and descriptive approaches. Forthe empirical approach, we quantify thedistribution of general parsing errors alongwith reordering qualities whereas for thedescriptive approach, we extract seven in-fluential error patterns and examine theircorrelation with reordering errors.

1 Introduction

Statistical machine translation is a challenging andwell established task in the community of compu-tational linguistics. One of the key componentsof statistical machine translation systems are wordalignment techniques, where the words from sen-tences in a source language are mapped to wordsfrom sentences in a target language. When esti-mating the most appropriate word alignments, itis unfeasible to explore every possible word cor-respondence due to the combinatorial complexity.Considering local permutations of words might beeffective to translate languages with a similar sen-tence structure, but these methods have a limitedperformance when translating sentences from lan-guages with different syntactical structures.

An effective technique to translate sentencesbetween distant language pairs is pre-reordering,

where words in sentences from the source lan-guage are re-arranged with the objective to resem-ble the word order of the target language. Re-arranging rules are automatically extracted (Xiaand McCord, 2004; Genzel, 2010), or linguisti-cally motivated (Xu et al., 2009; Isozaki et al.,2010; Han et al., 2012; Han et al., 2013). We workfollowing the latter strategy, where the source sen-tence is parsed to find its syntactical structure, andlinguistically-motivated rules are used in combi-nation with the structure of the sentence to guidethe word reordering. The language pair under con-sideration is Chinese-to-Japanese, which despitetheir common roots, it is a well known languagepair for their different sentence structure.

However, syntax-based pre-reordering tech-niques are sensitive to parsing errors, but insightinto their relationship has been elusive. The con-tribution of this work is two fold. First, we providean empirical analysis where we quantify the aggre-gated impact of parsing errors on pre-reorderingperformance. Second, we define seven patternsof the most common and influential parsing errorsand we carry out a descriptive analysis to exam-ine their relationship with reordering errors. Wecombine an empirical and descriptive approachto present a three-stage incremental comparativeanalysis to observe the effect of different parsingerrors on reordering performance.

In Section 2, after a brief description on the pre-reordering method that we use for experiments, wewill introduce some related works on parsing erroranalysis and analysis on the relation between pars-ing and machine translation. From a general per-spective, we describe our analysis methods for thiswork in Section 3. Then, we carry out the analysisand exhibit the results in Section 4 and Section 5.The last two sections are dedicated to discussion,future directions and summarize our findings.

PACLIC-27

267Copyright 2013 by Dan Han, Pascual Martínez-Gómez, Yusuke Miyao, Katsuhito Sudoh, and Masaaki Nagata

27th Pacific Asia Conference on Language, Information, and Computation pages 267－276

Vb-H VV VE VC VA PBEI LB SBRM-D NN NR NT PN OD CD M FW CC

ETC LC DEV DT JJ SP IJ ON

Table 1: Lists of POS tags for identifying wordsas Vb-H, RM-D, and BEI. (Han et al., 2013)

2 Background

2.1 Reordering Model

Since local reordering models which are inte-grated in phrase-based SMT systems do not per-form well for distant language pairs due to theirdifferent syntactic structures, pre-reordering meth-ods have been proposed to supply the need forimproving the word alignment. Han et al. (2013)described one of the latest pre-reordering meth-ods (DPC) which was based on dependency pars-ing. The authors were using an unlabeled depen-dency parser to extract the syntactic informationof Chinese sentences, and then by combining withpart-of-speech (POS) tags1, they defined a set ofheuristic reordering rules to guide the reordering.The essential idea of DPC is to move so-calledverbal block (Vb)2 to the right-hand side of itsright-most dependent (RM-D) for a Subject-Verb-Object (SVO) language to resemble a Subject-Object-Verb (SOV) language’s word order. Ta-ble 1 shows the POS tags that are used to identifywords as Vb-H, RM-D, or BEI (a Vb-H involvesin a bei-construction) in a sentence from Han et al.(2013).

Figure 1 shows an example of unlabeled de-pendency parse tree of a Chinese sentence alignedwith its Japanese translation. According to the re-ordering method, “went” will be reordered behindof “bookstore” while “buy -ed” will be reorderedto the right-hand side of “book”, and thus the sen-tence will follow a SOV word order as Japanese.However, if “book” was wrongly recognized as thedependent of “went” in the dependency structure,“went” will be wrongly reordered to the right-hand side of “book”. Therefore, syntactic struc-ture based reordering methods highly rely on theparsing accuracy. In order to further improve wordalignments or refine existing reordering models, it

1In this work, POS tag definitions follow the POS tagguidelines of the Penn Chinese Treebank v3.0.

2According to (Han et al., 2013), a Vb includes the headof the Vb (Vb-H) and an optional component (Vb-D).

....PinYin: ..ta1 ..qu4 ..shu1dian4 ..mai3 ..le5 ..yi1 ..ben3 ..shu1 ....Chinese: ..他 ..去 ..书店 ..买 ..了 ..一 ..本 ..书 ..。..English: ..He ..went (to) ..bookstore ..buy ..-ed ..a .. ..book ..

.

.

ROOT.

.

o.

.

o.

.

o .

.

o.

.

o.

.

o .

.

o.

.

o

.

Japanese:

.

彼 (は)

.

本屋 (に)

.

行って

.

本 (を)

.

買っ

.

た

.

。

Figure 1: Example of unlabeled dependency parsetree of a Chinese sentence (SVO) with wordaligned to its Japanese counterpart (SOV). Arrowsare pointing from heads to dependents.

is important to observe the effects of parsing errorson reordering performance.

In this analysis, we borrow this state-of-the-artpre-reordering model for our experiments since itis a rule-based pre-reordering method for a dis-tant language pair based on dependency parsingas well as its extensibility to other language pairs.

2.2 Related Work

Although there are studies on analyzing parsingerrors and reordering errors, as far as we know,there is not any work on observing the relationshipbetween these two types of errors.

One most relevant work to ours is observingthe impact of parsing accuracy on a SMT systemintroduced in Quirk and Corston-Oliver (2006).They showed the general idea that syntax-basedSMT models are sensitive to syntactic analysis.However, they did not further analyze concreteparsing error types that affect task accuracy.

Green (2011) explored the effects of nounphrase bracketing in dependency parsing in En-glish, and further on English to Czech machinetranslation. But the work focused on using nounphrase structure to improve a machine translationframework. In the work of Katz-Brown et al.(2011), they proposed a training method to im-prove a parser’s performance by using reorderingquality to examine the parse quality. But theydid not study the relationship between reorderingquality and parse quality.

There are more works on parsing error analy-sis. For instance, Hara et al. (2009) defined sev-eral types of parsing error patterns on predicateargument relation and tested them with a Head-driven phrase structure grammar (HPSG) (Pol-lard and Sag, 1994) parser (Miyao and Tsujii,2008). McDonald and Nivre (2007) explored pars-ing errors for data-driven dependency parsing by

PACLIC-27

268

comparing a graph-based parser with a transition-based parser, which are representing two domi-nant parsing models. At the same time, Dredze etal. (2007) provided a comparison analysis on dif-ferences in annotation guidelines among treebankswhich were suspected to be responsible for depen-dency parsing errors in domain adaptation tasks.Unlike analyzing parsing errors, authors in Yu etal. (2011) focused on the difficulties in Chinesedeep parsing by comparing the linguistic proper-ties between Chinese and English.

There are also works on reordering error analy-sis like Han et al. (2012) which examined an ex-isting reordering method and refined it after a de-tailed linguistic analysis on reordering issues. Al-though they discovered that parsing errors affectthe reordering quality, they did not observe theconcrete relationship. On the other hand, Gimenezand Marquez (2008) proposed an automatic erroranalysis method of machine translation output, bycompiling a set of metric variants. However, theydid not provide insight on what SMT componentcaused low translation performance.

3 Analysis Method

We combine an empirical approach with a de-scriptive approach to observe the effects of pars-ing errors on pre-reordering performance in threestages: preliminary experiment stage, POS taglevel stage, and dependency type level stage. First,we provide a general idea of the sensitiveness ofparsing errors on reordering method. Then, we usePOS tags to identify parsing errors and quantifythe aggregate impact on reordering performance.Finally, we define several concrete error patternsand examine their effects on reordering qualities.

In order to test for an upper bound of the re-ordering performance and examine the specificparsing errors that affect reordering, one way isto contrast the reordering based on error-free parsetrees with the reordering based on auto-parse trees.Error-free parse trees are considered as gold trees.

In the preliminary experiment stage, we setup two benchmarks in two scenarios. For sce-nario 1, the benchmark is manually reordered Chi-nese sentence on the basis of Japanese reference.By measuring the word order similarities betweenthe benchmark and the gold-tree based reorderedsentence as well as between the benchmark andthe auto-parse tree based reordered sentence sepa-rately, we quantify the extent of parsing errors that

influence reordering. Meanwhile, the former mea-surement shows additionally the general figure ofthe upper bound of the reordering method. How-ever, since it is not only time-consuming but alsolabor-intensive to set up the benchmark in scenario1, we use the Japanese reference as the benchmarkin scenario 2 and follow the same strategies as inscenario 1 to calculate the word order similarities.More detailed description on the preliminary ex-periment is given in Section 4.

In POS tag level stage, we compare the gold-tree with auto-parse tree along with reorderingquality to explore the relationship between generalparsing errors and reordering from two aspects:the percentages of top three most frequent depen-dent’s POS tags that point to wrong heads and thepercentages of top two most frequent head’s POStags that are recognized wrongly. The percentagesof other POS tags are not provided because theyare negligible. Our objective is to profile generalparsing errors’ distribution. However, this doesnot imply that those errors are the cause of the re-ordering errors. Section 5.1 includes more con-crete analysis results.

In dependency type level stage, we classify themost influential parsing errors on reordering intothree superclasses and seven subclasses accordingto the methodology of the reordering method. Wethen plot the distribution of these parsing errorsfor various reordering qualities. In Section 5.2, weillustrate these parsing errors with examples.

4 Preliminary Experiment

4.1 Gold DataIn order to build up gold parse tree sets for com-parison, we used the annotated sentences fromChinese Penn Treebank ver. 7.0 (CTB-7) whichis a well known corpus that consists of parsedtext in five genres. They are Chinese newswire(NS), magazine news (NM), broadcast news (BN),broadcast conversation programs (BC), and webnewsgroups, weblogs (NW).

We first randomly selected 517 unique sen-tences (hereinafter set-1) from all five genres indevelopment set of CTB-7 which is split accord-ing to (Wang et al., 2011). However, we foundthat sentences in BC and NW are mainly from spo-ken language, which tend to have faults like rep-etitions, incomplete sentences, corrections, or in-correct sentence segmentation. Therefore, we ran-domly selected another 2, 126 unique sentences

PACLIC-27

269

BN BC NM NS NW Totalset-1 100 100 100 117 100 517set-2 797 - 578 751 - 2, 126

Total 897 100 678 868 100 2, 643

AL 29.8 20.0 33.5 28.4 25.9 29.8Voc. 5.5K 690 5K 5.1K 972 9.5K

Table 2: Statistics of selected sentences in fivegenres of CTB-7. AL stands for the average lengthof sentences, while Voc. for vocabulary.

(hereinafter set-2) within a limit to three genres:NS, NM, and BN. Table 2 shows the statistics ofall selected sentences in five genres respectively.

For converting CTB-7 parsed text to depen-dency parse trees, we used an open utilityPenn2Malt3 which converts Penn Treebank intoMaltTab format containing dependency informa-tion. Since the head rules that Penn2Malt rec-ommended for converting on its website do notcontain three new annotation types in CTB-7, weadded three new ones for them as follows: FLR(Fillers) and DFL (Disfluency) head on right-handbranch; INC (Incomplete sentences) follows thesame head rule as FRAG (Fragment).

Meanwhile, professional human translatorstranslated all Chinese sentences in both set-1 andset-2 into Japanese. Thereafter, according to theJapanese references, Chinese sentences in set-1have been manually reordered as the same wordorders as their Japanese counterparts by a bilin-gual speaker of Chinese and Japanese for the ex-periments in scenario 1. For example, the Chinesesentence in Figure 1 is following the word orderof “He bookstore went (to) a book buy (-ed) .”in the handcrafted reordered set since it resemblesthe Japanese word order.

4.2 Evaluation

We use Kendall’s tau (τ ) rank correlation coeffi-cient (Isozaki et al., 2010) to measure word or-der similarities between sentences in two differentscenarios. In the first scenario, we use the set ofmanually reordered Chinese sentences from set-1as benchmark and compare it with the set of au-tomatically reordered Chinese sentences. In thesecond scenario, we combine set-1 and set-2 toobtain a larger data set. The set of Japanese ref-erences plays the role of benchmark and is com-pared with the set of automatically reordered Chi-

3http://stp.lingfil.uu.se/ nivre/research/Penn2Malt.html

Baseline Gold-DPC Auto-DPCM-reordered 0.82 0.90 0.88Gold-DPC - - 0.95

Table 3: The average value of Kendall’s tau (τ ) of517 Chinese sentences by comparing manually re-ordered sentences, unreordered sentences, and au-tomatically reordered sentences. M-reordered isshort for manually reordered.

nese sentences. Word alignments are produced byMGIZA++ (Gao and Vogel, 2008).

In both scenarios, we carry out the reorder-ing method DPC (See Section 2.1). Auto-parsetrees are generated by an unlabeled Chinese de-pendency parser, Corbit4 (Hatori et al., 2011).Gold trees5 are converted from CTB-7 parsed textwhich are created by human annotators. Morespecifically, we refer to auto-parse tree based re-ordering system as Auto-DPC and to gold-treebased reordering system as Gold-DPC. Baselinesystem uses unreordered Chinese sentences.

Scenario 1 Preliminary observation about theeffects of parsing errors on reordering perfor-mance is to compare word order similarities be-tween manually reordered Chinese sentences andautomatically reordered Chinese sentences fromset-1. Table 3 shows the average τ value.

For baseline system, the average τ value showshow similar these 517 Chinese sentences betweenmanually reordered ones and non-reordered onesare. Comparing with manually reordered Chinese,both Auto-DPC and Gold-DPC achieved higheraverage τ value than baseline, which imply thatthe reordering method DPC positively reorderedthe Chinese sentences and improved the wordalignment. Nevertheless, a slightly lower averageτ value of Auto-DPC shows that DPC is sensitiveon parsing errors. This assumption is also con-firmed by the average τ value between Auto-DPCand Gold-DPC. However, the difference of τ val-ues are limited. We hence increase the test data byadding set-2 for further experiments in scenario 2.

Scenario 2 Since we do not have manually re-ordered Chinese sentences as benchmark for set-2, we calculate the Kendall’s tau between Chi-nese sentences and their Japanese counterparts forboth data sets by using the MGIZA++ alignment

4http://triplet.cc/software/corbit5Note that Corbit was tuned with the development set of

CTB-7.

PACLIC-27

270

0%

5%

10%

15%

20%

25%

30%

35%

40%

10.90.80.70.60.5

Perc

ent

of

Sen

ten

ces

Kendall's tau (τ)

Baseline

Gold-DPC

Auto-DPC

Figure 2: The distribution of Kendall’s tau valuesfor 2, 236 bilingual sentences (Chinese-Japanese)in which the Chinese is from three systems ofbaseline, Auto-DPC, and Gold-DPC.

file, ch-ja.A3.final. The comparison im-plies how monotonically the Chinese sentenceshave been reordered to align with Japanese. Weuse MeCab6 (Kudo and Matsumoto, 2000) to seg-ment Japanese sentences and also filter out sen-tences with more than 64 tokens. There are 2, 236valid Chinese-Japanese bilingual sentences in to-tal. Figure 2 shows the distribution of Kendall’stau from three systems in which the baseline isbuilt up by using ordinary Chinese.

In Figure 2, baseline system contains a largenumbers of non-monotonic aligned sentences,whereas both Auto-DPC and Gold-DPC increasedthe amount of sentences that achieved high τ val-ues. Reordering based on gold-tree reduced morepercentage of low τ sentences than reorderingbased on automatically parsed trees. Especially,the amount of sentence difference in 0.9 < τ <=1 between Gold-DPC and Auto-DPC shows thatreordering method DPC has a high sensitivity onparsing errors, which enhances the conclusionsfrom the preliminary observation in scenario 1.Furthermore, the performance of reordering sys-tem Gold-DPC sketches the figure of upper boundof the reordering method.

5 Analysis on Causes of ReorderingErrors

Preliminary experiments in Section 4 provide ageneral idea of the effects of parsing errors on re-ordering. In order to achieve more explicit rela-tionship between specific parsing errors and re-ordering issues, we first identify concrete pars-ing errors by comparing gold-trees with auto-parse

6http://mecab.googlecode.com

....Chinese: ..他 ..去 ..书店 ..买 ..了 ..一 ..本 ..书 ..。..English: ..He ..went (to) ..bookstore ..buy ..-ed ..a .. ..book .....POS tag: ..PN ..VV ..NN ..VV ..AS ..CD ..M ..NN ..PU

.

.

ROOT

.

.

o.

.

o.

.

o

.

.

o.

.

o.

.

o .

.

o .

.

o

Figure 3: A possible wrong dependency parse treeof the example in Figure 1.

trees. Since the syntactic information that guidesreordering in DPC is limited to dependency struc-ture and POS tags, for analysis on the causes ofreordering errors, we examine parsing errors fromthese two linguistic categories. In this section, thevalue of Kendall’s tau measures the word ordersimilarity between Gold-DPC and Auto-DPC.

5.1 Part-of-Speech Tag Error

There are two types of parsing errors to a tokenin a dependency parse tree. One is that the to-ken points to a wrong head, namely dependent-error, and another one is that the token is recog-nized wrongly as a head of other tokens, namelyhead-error. For example, Figure 3 presents apossible wrong parse tree of the example shownin Figure 1. By comparing with the gold-tree inFigure 1, tokens (POS tag) of “he (PN)”, “went(VV)”, “bookstore (NN)”, “buy (VV)”, “a (CD)”,and “. (PU)” in the dependency tree in Fig-ure 3 all point to different wrong heads, which aredependent-errors. Concurrently, tokens (POS tag)of “went (VV)”, “buy (VV)”, and “book (NN)” arewrongly recognized as heads of other tokens (e.g.,“he”, “bookstore”, “a”), which are head-errors.According to the definition, every head-error hasat least one corresponding dependent-error. How-ever, in the case that a token is not the root in agold-tree but is root in the wrong tree, this tokenis a dependent-error corresponding with no head-error. An example is the dependent-error “went(VV)” in Figure 3.

We count the number of POS tag mis-recognitions separately for dependent- and head-errors. In the example of Figure 3, dependent-error counts are for VV, 2 errors, and PN, NN,CD, PU each 1 error. The number of POS tagmis-recognitions for head-errors are VV with 2 er-rors, and NN with 1 error. In our analysis, wewill compute these counts for all POS tags at everysentence in our data set. However, our reorderingmethod performed differently at each sentence inour data set, and the reordering quality varied from

PACLIC-27

271

0%

10%

20%

30%

40%

50%

60%

70%

80%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nt

of

Erro

rs

Kendall's tau (τ)

VV

PU

NN

2 per. Mov. Avg. (VV)

2 per. Mov. Avg. (PU)

2 per. Mov. Avg. (NN)

Figure 4: The distribution of top three dependent-error POS tags and their tendency lines.

sentence to sentence. With the objective of observ-ing the correlation between reordering quality andeach type of error, we will first group sentences ac-cording to their Kendall’s τ values. Then, we willcompute proportions of POS tag errors at each τvalue, for every type of POS tag error.

Figure 4 shows the distribution of top threedependent-error POS tags, which means that theyare the three most frequent POS tags that point toa wrong head in auto-parse trees. VV representsall verbs except predicative adjective (VA), copula(VC), and you37 as the main verb (VE). PU rep-resents punctuation and NN represents all nounsexcept proper noun (NR), temporal noun (NT),and the ones for locations which cannot modifyverb phrases with or without de08. The dependent-error on VV accounts for a larger proportion in lowreordering accuracy sentences whereas more NNdependent-error occurred in high reordering accu-racy sentences. On the other hand, the proportionof PU dependent-error is more consistent.

Figure 5 shows the distribution of top two head-error POS tags, which means that they are thetwo most frequent POS tags that are recognizedwrongly as heads in auto-parse trees. Comparingto Figure 4, the tendency of both VV and NN isthe same but distincter.

The analysis results on the proportion distribu-tions of dependent-error POS tags and head-errorPOS tags in different reordering quality sentencegroups exhibit that there are more parsing errorson verbs than nouns in low reordering accuracysentences and thus the parsing errors on verbsinfluence more on the reordering performance.However, it is still difficult to reveal the effects ofmore concrete parsing errors on reordering consid-

7A Chinese character expresses possession and existence.8A Chinese character is specially used to connect the verb

phrase and its modifier.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nt

of

Erro

rs

Kendall's tau (τ)

VV

NN

2 per. Mov. Avg. (VV)

2 per. Mov. Avg. (NN)

Figure 5: The distribution of top two head-errorPOS tags and their tendency lines.

ering that not all verb parsing errors influence thereordering. As an illustration, in Figure 3, if thehead of “bookstore” were “went”, the VV head-error of “went” would not cause any reordering er-ror since it would be reordered consistently to theright-hand side of its RM-D “bookstore”. Conse-quently, we use a descriptive approach to analyzedependency types to explore the effects from moreconcrete parsing errors in the next section.

5.2 Dependency Type ErrorAs introduced in Section 2.1, DPC first identifiesVb, RM-D, and then reorders necessary words.Thus, DPC reorders not only Vb-H, but also Vb-Din a Vb, which means that the failure on identi-fying Vbs may also cause unexpected reorderingon particles, such as aspect markers. However, inthis work, we only focus on reordering issues ofVb-H candidates9. To discover the effects of moreconcrete parsing errors on reordering, we distin-guish three categories of dependency types, i.e.,ROOT, RM-D, and BEI. Among them, ROOTdenotes whether the Vb-H candidate is the root ofthe sentence or not, RM-D is the right-most ob-ject dependent of the Vb-H candidate if it has one,and BEI denotes whether the Vb-H candidate isinvolved in a bei-construction.

According to the methodology of the reorder-ing method DPC, we define seven patterns of pars-ing error phenomena and classify them into threetypes by comparing the gold-tree (GT) with auto-parse tree (Corbit-tree, CT). Table 4 lists all pars-ing error patterns in three error types, ROOT error,RM-D error and BEI error by considering threedependency types ROOT, RM-D and BEI. Sym-bols of “

√”, “×”, “?” represent the status of a cer-

9We use “Vb-H candidate” in this work for the reason thatif the Vb-H is involved into a bei-construction, then it can notbe Vb-H according to (Han et al., 2013).

PACLIC-27

272

BEI ROOT RM-DGT CT GT CT GT CT

ROOT ErrorRoot-C × × ×

√× ×

Root-G × ×√

× × ×RM-D ErrorRM D-C × × × × ×

√

× × ×√

×√

× ×√

× ×√

× ×√ √

×√

RM D-G × × × ×√

×× × ×

√ √×

× ×√

×√

×× ×

√ √ √×

RM D-D × × × ×√

diff.× × ×

√ √diff.

× ×√

×√

diff.× ×

√ √ √diff.

BEI ErrorBEI-C ×

√ √? × ?

×√

× ?√

?×

√ √?

√?

BEI-G√

× ?√

? ×√× ? × ?

√√

× ?√

?√

Table 4: Seven error patterns (Root-C, Root-G, RM D-C, RM D-G, RM D-D, BEI-C, BEI-G)that cause three types of reordering issues (ROOTerror, RM-D error, and BEI error). GT stands forgold-tree, and CT stands for Corbit-tree. Symbols“√

”, “×”, “?” represent the status of True, False,and Unknown, respectively. “diff.” means that theRM-Ds exist in both GT and CT but are different.

tain dependency type in gold-tree or Corbit-tree.For every Vb-H candidate, the 6 status are condi-tions to match the error pattern. For example, tomatch a Root-C error pattern, the Vb-H candidateneeds to satisfy the following conditions: in gold-tree, it is not the root, and does not have any RM-Dor bei dependent; in Corbit tree, it does not haveany RM-D or bei dependent, but it is the root.

Root-C is the case where a Vb-H candidate hasbeen wrongly parsed as the root of the sentence.However, it only affects the reordering with twoconstrains, namely that RM-D of the Vb-H candi-date does not exist and Vb-H is not involved in abei-construction. For instance, the Vb-H “should”in the example of Figure 6 was recognized as rootin auto-parse tree in Figure 6b. However, the ac-tual root is the Vb-H “is” in gold tree of Figure 6a.Therefore, since “should” does not have any de-pendent as either BEI or RM-D in both GT andCT, it will be reordered incorrectly to the end of

....应该 ..说 ..军舰 ..加入 ..海军 ..对 ..战力 ..是 ..有 ..提升 ..。..should ..say ..warship(s) ..join ..navy ..combat ..power ..is .. ..improve(d) .....VV ..VV ..NN ..VV ..NN ..P ..NN ..VC ..VV ..NN ..PU

.

.

ROOT

.

.

o .

.

o

.

.

o.

.

o.

.

o .

.

o.

.

o.

.

o.

.

o.

.

o

.

結論的 (に)

.

言う (と)

.

軍艦 (が)

.

海軍 (に)

.

加わって

.

戦力 (は)

.

向上

.

する

.

でしょう

.

。

(a) Gold tree

....应该 ..说 ..军舰 ..加入 ..海军 ..对 ..战力 ..是 ..有 ..提升 ..。..should ..say ..warship(s) ..join ..navy ..combat ..power ..is .. ..improve(d) ...

.

.

ROOT

.

.o

.

.

o

.

.

o.

.

o.

.

o.

.o.

.o.

.o

.

.

o

.

.o

(b) A possible wrong parse tree.

Figure 6: An example for parsing error patternsof Root-C and RM D-D. English translation: Oneshould say that, the additions of warships will helpto improve the navy’s combat power.

the sentence according to the CT whereas it willnot be reordered according to GT, which is alreadyin the same position as its Japanese counterpart.

Root-G is the opposite case of Root-C where aVb-H candidate is the root of the sentence but wasnot parsed as the root in CT. This affects the re-ordering under the two same constraints as Root-C. Figure 7b shows an example of Root-G. In Fig-ure 7a, the word alignment shows that the Vb-H“agree” should be reordered to the end of the sen-tence. However, it will not be reordered for thewrong parse tree shown in Figure 7b.

RM D-C is the case where the RM-D of a Vb-H candidate exists in a CT but not in GT. In otherwords, a RM-D candidate was parsed wrongly onits head. There are four varieties of combinationwith the status of ROOT, BEI of the Vb-H candi-date that lead to incorrect reorderings. The Vb-H“agree” in Figure 7c matches the last combinationof RM D-C, which will be reordered right after“journalist” instead of at the end of the sentence.

RM D-G is the opposite case of RM D-Cwhere the RM-D of a Vb-H candidate was missedin a CT. There are also four cases of reorderingerrors according to the status of BEI, ROOT andRM-D. Vb-H “went” in Figure 3 matches the sec-ond combination of RM D-G so that it will not beable to reorder after “bookstore”.

RM D-D is the case where a bei-construction-free Vb-H candidate obtains two different RM-Dcandidates in CT and GT, which causes the re-ordering issue. In Figure 6, Vb-H “join” receiveddifferent RM-Ds in two trees. According to the

PACLIC-27

273

....Chinese: ..他 ..同意 ..记者 ..为 ..他 ..拍照 ..。..English: ..He ..agree ..journalist ..for ..him ..photo .....POS tag: ..PN ..VV ..NN ..P ..PN ..VV ..PU

.

.

ROOT

.

.

o

.

.

o.

.

o

.

.

o.

.

o.

.

o

.

Japanese:

.

彼 (は)

.

記者 (に)

.

写真 (を)

.

取る (こと)

.

許可

.

した

.

。

(a) Gold-tree.

....Chinese: ..他 ..同意 ..记者 ..为 ..他 ..拍照 ..。..English: ..He ..agree ..journalist ..for ..him ..photo ...

.

.

ROOT

.

.o

.

.

o.

.o.

.o

.

.

o.

.

o

(b) A possible erroneous parse tree.

....Chinese: ..他 ..同意 ..记者 ..为 ..他 ..拍照 ..。..English: ..He ..agree ..journalist ..for ..him ..photo ...

.

.

ROOT

.

.o

.

.

o.

.

o

.

.o.

.

o .

.

o

(c) Another possible erroneous parse tree.

Figure 7: An example for parsing error patternsof Root-G and RM D-C. English translation: Heagreed to the journalist to take a picture of him.

word alignment, it should be reordered next to“navy” instead of “combat power”.

BEI-C is the case where a Vb-H candidate re-ceived a wrong BEI dependent in CT. This willprevent reordering independently on whether theVb-H candidate has RM-D or is the root.

BEI-G is the opposite case of BEI-C, whereVb-H in GT will not be reordered but in CT it will.

After defining seven patterns of parsing errorsand classifying them into three types, we calculatethe average frequency proportions of each type indifferent τ value groups of sentences.

Figure 8 shows the distribution of the threetypes of parsing errors and their tendencies. In lowτ value sentences, there are higher proportions ofROOT errors, and relatively lower proportions inhigh τ value sentences. RM-D errors follow theopposite tendency. This implies that the effects ofROOT errors on reordering are stronger than theeffects from RM-D errors. The reason could bethat ROOT errors cause long distance reorderingfailure while RM-D errors lead to more local re-ordering errors. Since there are very few BEI er-rors, it was difficult to capture their trends.

Figure 9 and Figure 10 provide the correlationsbetween parsing error patterns and reordering ac-curacy. In ROOT errors types, Root-C had a largerpercentage than Root-G in low reordering accu-racy sentences which shows that the Vb-H can-

0%

20%

40%

60%

80%

100%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nt

of

Erro

rs

Kendall's tau (τ)

ROOT RM-D BEI

2 per. Mov. Avg. (ROOT) 2 per. Mov. Avg. (RM-D) 2 per. Mov. Avg. (BEI)

Figure 8: Distribution of three types of parsing er-rors in different τ groups and their trend curves.

0%

20%

40%

60%

80%

100%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nt

of

Erro

rs

Kendall's tau (τ)

ROOT-G

ROOT-C

2 per. Mov. Avg. (ROOT-G)

2 per. Mov. Avg. (ROOT-C)

Figure 9: Distribution of patterns of ROOT errorin different τ groups and their trend curves.

didate that does not have any object dependenttends to be recognized as root by parser. Thisis consistent with the distribution results that areshown in Figure 10. The error pattern of RM D-Ghad larger percentage than the other two patterns,which also implies that a Vb-H candidate in a CTtends to have less or none object dependents.

5.3 Further Analysis Possibilities

Due to the time limitation, we only focused on an-alyzing parsing errors that cause reordering issueson Vb-H candidates while defining the error pat-terns. However, it is not only that Vb-H candidatesare reordered in DPC, but also other words likeVb-D candidates and particles will be reordered. Itis also meaningful to explore the parsing error pat-terns which cause unexpected reordering on thesewords and the correlation between them as well.

The current study on exploring influential pars-ing errors is not exhaustive, and another analysispossibility would be to explore what types of pars-ing errors do not affect reordering so that parserscan sacrifice their performance on those types ofissues in order to improve on influential types.

PACLIC-27

274

0%

10%

20%

30%

40%

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Pe

rce

nt

of

Erro

rs

Kendall's tau (τ)

RMD-C

RMD-G

RMD-diff.

2 per. Mov. Avg. (RMD-C)

2 per. Mov. Avg. (RMD-G)

2 per. Mov. Avg. (RMD-diff.)

Figure 10: The distribution of different patterns ofRM-D error in different τ groups.

6 Discussion and Future Work

Two important research directions concentrate oneither improving parsers or developing linguisti-cally motivated pre-reordering methods. We be-lieve that analyzing the link between those direc-tions can help us to refine future developments.

We observed relatively small effects on reorder-ing quality in response of parsing errors. However,reordering quality affect word alignments, whichin turn affect the quality of bilingual phrases thatare extracted. It would be interesting to extend thiswork to quantify the propagation of parsing andreordering errors in SMT pipelines, to observe thefactored effect on the overall MT quality.

We found that not all POS tagging and pars-ing errors correlate equally with reordering qual-ity. In the case of DPC reordering method, mis-recognitions of VV words correlate with low re-ordering performance, whereas mis-recognitionsof NN words had a smaller impact. Indeed, DPCheavily relies on detecting verbal blocks that arecandidates for reordering, and systems that use thesame strategy should choose POS taggers that dis-play high accuracy of VV recognition.

One of the key characteristics of DPC is its abil-ity to correctly reorder sentences with reportedspeech constructions. For that purpose, it is cru-cial for parsers to recognize the sentence root, andour analysis demonstrated that systems that followsimilar strategy should rely on parsers that have ahigh accuracy to recognize the sentence root.

In general, we believe that future developmentsof syntax-based pre-reordering methods wouldbenefit of preliminary analysis of POS tagging andparsing accuracies. In case of linguistically mo-tivated pre-reordering methods, reordering rulescould be designed to be more robust against unreli-able POS tags or unreliable dependency relations.For automatically learned reordering rules, those

systems could be designed to make use of N-bestlists of certain POS tags or dependencies that arecritical but that parsers cannot reliably provide.

There are other popular syntax-based pre-reordering methods that may use different types ofparsing grammars (i.e. Head-driven phrase struc-ture grammar), and similar analysis would alsobe interesting in those contexts, possibly with alarger set of gold parsed and reordered sentences.Additionally, researchers interested in developingPOS taggers and parsers with the objective to aidpre-reordering could attempt to maximize the ac-curacy of POS tags or dependencies that are rele-vant to the reordering task, maybe at the expenseof lower accuracies on other elements.

7 Conclusion

In this work, we carried out linguistically moti-vated analysis methods by combining empiricaland descriptive approaches in three analysis stagesto examine the effects of different parsing errorson pre-reordering performance. We achieved fourobjectives: (i) quantify effects of parsing errors onreordering, (ii) estimate upper bounds in perfor-mance of the reordering method, (iii) profile gen-eral parsing errors, and (iv) examine effects of spe-cific parsing errors on reordering.

In the first stage, we set up benchmarks in twoscenarios for reordered Chinese sentences. Bycalculating the word order similarity between thebenchmarks and the dependency parse tree basedauto-reordered Chinese sentences, we quantifiedthe correlation between parsing errors and reorder-ing accuracies as well as explored the upper boundin reordering quality of the reordering model.

In the second stage, we examined the effects oftwo types of parsing errors on reordering qualityby using POS tag information. The distributions ofparsing errors’ POS tags provide a general view ofthe influential parsing error types and an approxi-mation to the cause of the effects.

In the last stage, we defined several patternsof parsing errors that assuredly cause reorderingerrors by using the linguistic feature of depen-dency types based on a deep linguistic study ofthe syntactic structures and the reordering model.The analysis results assist us to achieve a betterand more explicit understanding on the relation-ship between parsing errors and reordering perfor-mance. Furthermore, we captured the effects ofmore concrete parsing errors on reordering.

PACLIC-27

275

References

Mark Dredze, John Blitzer, Partha Pratim Taluk-dar, Kuzman Ganchev, Joao Graca, and Fer-nando Pereira. 2007. Frustratingly hard do-main adaptation for dependency parsing. InProc. of the CoNLL Shared Task Session ofEMNLP-CoNLL, pages 1051–1055.

Qin Gao and Stephan Vogel. 2008. Parallel im-plementations of word alignment tool. In Soft-ware Engineering, Testing, and Quality Assur-ance for Natural Language Processing, pages49–57.

Dmitriy Genzel. 2010. Automatically learningsource-side reordering rules for large scale ma-chine translation. In Proc. of COLING, pages376–384.

Jesus Gimenez and Lluis Marquez. 2008. To-wards heterogeneous automatic MT error analy-sis. In Proc. of the 6th International Conferenceon Language Resources and Evaluation, pages1894–1901.

Nathan Green. 2011. Effects of noun phrasebracketing in dependency parsing and machinetranslation. In Proc. of ACL-HLT, Student Ses-sion, pages 69–74.

Dan Han, Katsuhito Sudoh, Xianchao Wu, KevinDuh, Hajime Tsukada, and Masaaki Nagata.2012. Head finalization reordering for Chinese-to-Japanese machine translation. In Proc. of theACL 6th Workshop on SSST, pages 57–66.

Dan Han, Pascual Martınez-Gomez, YusukeMiyao, Katsuhito Sudoh, and Masaaki Nagata.2013. Using unlabeled dependency parsingfor pre-reordering for Chinese-to-Japanese sta-tistical machine translation. In Proc. of theACL Second Workshop on Hybrid Approachesto Translation, pages 25–33.

Tadayoshi Hara, Yusuke Miyao, and Jun’ichi Tsu-jii. 2009. Descriptive and empirical approachesto capturing underlying dependencies amongparsing errors. In Proc. of EMNLP, pages1162–1171.

Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, andJunichi Tsujii. 2011. Incremental joint POStagging and dependency parsing in Chinese. InProc. of 5th IJCNLP, pages 1216–1224.

Hideki Isozaki, Katsuhito Sudoh, HajimeTsukada, and Kevin Duh. 2010. Head final-ization: A simple reordering rule for SOVlanguages. In Proc. of WMTMetricsMATR,pages 244–251.

Jason Katz-Brown, Slav Petrov, Ryan McDonald,Franz Och, David Talbot, Hiroshi Ichikawa, andMasakazu Seno. 2011. Training a parser formachine translation reordering. In Proc. of the2011 Conference on EMNLP, pages 183–192.

Taku Kudo and Yuji Matsumoto. 2000. Japanesedependency structure analysis based on supportvector machines. In Proc. of the EMNLP/VLC-2000, pages 18–25.

Ryan McDonald and Joakim Nivre. 2007. Char-acterizing the errors of data-driven dependencyparsing models. In Proc. of the 2007 Joint Con-ference on EMNLP-CoNLL, pages 122–131.

Yusuke Miyao and Jun’ichi Tsujii. 2008. Featureforest models for probabilistic HPSG parsing.Computational Linguistics, 34:35–80.

Carl Jesse Pollard and Ivan A. Sag. 1994. Head-driven phrase structure grammar. The Univer-sity of Chicago Press and CSLI Publications.

Chris Quirk and Simon Corston-Oliver. 2006.The impact of parse quality on syntactically-informed statistical machine translation. InProc. of EMNLP, pages 62–69.

Yiou Wang, Junichi Kazama, Yoshimasa Tsu-ruoka, Wenliang Chen, Yujie Zhang, and Ken-taro Torisawa. 2011. Improving Chineseword segmentation and POS tagging with semi-supervised methods using large auto-analyzeddata. In Proc. of 5th IJCNLP, pages 309–317.

Fei Xia and Michael McCord. 2004. Improv-ing a statistical MT system with automaticallylearned rewrite patterns. In Proc. of COLING.

Peng Xu, Jaeho Kang, Michael Ringgaard, andFranz Och. 2009. Using a dependency parserto improve SMT for subject-object-verb lan-guages. In Proc. of NAACL, pages 245–253.

Kun Yu, Yusuke Miyao, Takuya Matsuzaki, Xian-gli Wang, and Junichi Tsujii. 2011. Analysisof the difficulties in Chinese deep parsing. InProc. of the 12th International Conference onParsing Technologies, pages 48–57.

PACLIC-27

276

transliteration extraction from classical chinese buddhist literature

Documents