transliteration extraction from classical chinese buddhist literature
Post on 04-Jan-2017
Embed Size (px)
Transliteration Extraction from Classical Chinese Buddhist LiteratureUsing Conditional Random Fields
Yu-Chun WangDepartment of Computer Science
and Information Engineering,National Taiwan University, TaiwanTelecommunication Laboratories,
Chunghwa Telecom, Taiwand97023@csie.ntu.edu.tw
Richard Tzong-Han TsaiDepartment of Computer Science
and Information Engineering,National Central University,
Zhongli City, Taiwanthtsai@csie.ncu.edu.tw
Extracting plausible transliterations fromhistorical literature is a key issues in his-torical linguistics and other resaech fields.In Chinese historical literature, the charac-ters used to transliterate the same loanwordmay vary because of different translationeras or different Chinese language prefer-ences among translators. To assist historicallinguiatics and digial humanity researchers,this paper propose a transliteration extrac-tion method based on the conditional ran-dom field method with the features basedon the characteristics of the Chinese char-acters used in transliterations which are suit-able to identify transliteration characters. Toevaluate our method, we compiled an evalu-ation set from the two Buddhist texts, theSamyuktagama and the Lotus Sutra. Wealso construct a baseline approach with suf-fix array based extraction method and pho-netic similarity measurement. Our methodoutperforms the baseline approach a lot andthe recall of our method achieves 0.9561and the precision is 0.9444. The resultsshow our method is very effective to extracttransliterations in classical Chinese texts.
Cognates and loanwords play important roles inthe research of language origins and cultural in-terchange. Therefore, extracting plausible cog-nates or loanwords from historical literature is akey issues in historical linguistics. The adoption ofloanwords from other languages is usually throughtransliteration. In Chinese historical literature,the characters used to transliterate the same loan-word may vary because of different translationeras or different Chinese language/dialect prefer-ences among translators. For example, in classical
Chinese Buddhist scriptures, the translation pro-cess of Buddhist scriptures from Sanskrit to classi-cal Chinese occurred mainly from the 1st centuryto 10th century. In these works, the same San-skrit words may transliterate into different Chi-nese loanword forms. For instance, the surname ofthe Buddha, Gautama, is transliterated into severaldifferent forms such as (qu-tan) or (qiao-da-mo), and the name Culapanthakahas several different Chinese transliterations suchas (zhu-li-pan-te) and (zhou-li-pan-tuo-qie). In order to assist re-searchers in historical linguistics and other digitalhumanity research fields, an approach to extracttransliterations in classical Chinese texts is neces-sary.
Many transliteration extraction methods requirea bilingual parallel corpus or text documents con-taining two languages. For example, (Sherif andKondrak, 2007) proposed a method for learningthe string distance measurement function from asentence-aligned English-Arabic parallel corpusto extract transliteration pairs. (Kuo et al., 2007)proposed a transliteration pair extraction methodusing a phonetic similarity model. Their approachis based on the general rule that when a new En-glish term is transliterated into Chinese (in modernChinese texts, e.g. newswire), the English sourceterm usually appears alongside the transliteration.To exploit this pattern, they identify all the En-glish terms in a Chinese text and measure the pho-netic similarity between those English terms andtheir surrounding Chinese terms, treating the pairswith the highest similarity as the true translitera-tion pairs. Despite its high accuracy, this approachcannot be applied to transliteration extraction inclassical Chinese literature since the prerequisite(of the source terms alongside the transliteration)does not apply.
260Copyright 2013 by Yu-Chun Wang and Richard Tzong-Han Tsai
27th Pacific Asia Conference on Language, Information, and Computation pages 260266
Some researchers have tried to extract translit-erations from a single language corpus. (Ohand Choi, 2003) proposed a Korean translitera-tion identification method using a Hidden MarkovModel (HMM) (Rabiner, 1989). They trans-formed the transliteration identification probleminto a sequential tagging problem in which eachKorean syllable block in a Korean sentence istagged as either belonging to a transliterationor not. They compiled a human-tagged Ko-rean corpus to train a hidden Markov model withpredefined phonetic features to extract translit-eration terms from sentences by sequential tag-ging. (Goldberg and Elhadad, 2008) proposedan unsupervised Hebrew transliteration extrac-tion method. They adopted an English-Hebrewphoneme mapping table to convert the Englishterms in a named entity lexicon into all the pos-sible Hebrew transliteration forms. The Hebrewtransliterations are then used to train a Hebrewtransliteration identification model. However, Ko-rean and Hebrew are alphabetical writing system,while Chinese is ideographic. These identificationmethods heavily depend on the phonetic character-istics of the writing system. Since Chinese char-acters do not necessarily reflect actual pronunci-ations, these methods are difficult to apply to thetransliteration extraction problem in classical Chi-nese.
This paper proposes an approach to extracttransliterations automatically in classical Chinesetexts, especially Buddhist scriptures, with super-vised learning models based on the probability ofthe characters used in transliterations and the an-guage model features of Chinese characters.
To extract the transliterations from the classicalChinese Buddhist scriptures, we adopt a super-vised learning method, the conditional randomfields (CRF) model. The features we use in theCRF model are described in the following subsec-tions.
2.1 Probability of each Chinese character intransliterations
According to our observation, in the classical Chi-nese Buddhist texts, the Chinese characters chosenbe used in transliteration show some characteris-tics. Translators tended to choose the charactersthat do not affect the comprehension of the sen-
tences. The amount of the Chinese characters ishuge, but the possible syllables are limited in Chi-nese. Therefore, one Chinese character may sharethe same pronunciation with several other charac-ters. Hence, the translators may try to choose therarely used characters for transliteration.
From our observation, the probability of eachChinese character used to be transliterated is animportant feature to identify transliteration fromthe classical Buddhist texts. In order to measurethe probability of every character used in translit-erations, we collect the frequency of all the Chi-nese characters in the Chinese Buddhist Canon.Furthermore, we apply the suffix array method(Manzini and Ferragina, 2004) to extract the termswith their counts from all the texts of the Chi-nese Buddhist Canon. Next, the extracted termsare filtered out by the a list of selected translitera-tion terms from the Buddhist Translation Lexiconand Ding Fubaos Dictionary of Buddhist Studies.The extracted terms in the list are retained and thefrequency of each Chinese character can be cal-culated. Thus, the probability of a given Chinesecharacter c in transliteration can be defined as:
Prob(c) = logfreqtrans(c)
where freqtrans(c) is cs frequency used intransliterations, and freqall(c) is cs frequencyappearing in the whole Chinese Buddhist Canon.The logarithm in the formula is designed for CRFdiscrete feature values.
2.2 Language model of the transliteration
Transliterations may appear many times in oneBuddhist sutra. The preceding character and thefollowing character of the transliteration may bedifferent. For example, for the phrase (yu-jiao-sa-luo-guo, in Kosala state), if wewant to identify the actual transliteration, (jiao-sa-luo, Kosala), from the extra charac-ters (yu, in) and (guo, state), we mustfirst use an effective feature to identify the bound-aries of the transliteration.
In order to identify the boundaries of translit-erations, we propose a language-model-based fea-ture. A language model assigns a probability toa sequence of m words P (w1, w2, . . . , wm) bymeans of a probability distribution. The probabil-ity of a sequence of m words can be transformed
into a conditional probability:
P (w1, w2, , wm) = P (w1)P (w2|w1)P (w3|w1, w2) P (wm|w1, w2, ,wm1)
P (wi|w1, w2, ,
In practice, we can assume the probability of aword only depends on its previous word (bi-gramassumption). Therefore, the probability of a se-quence can be approximated as:
P (w1, w2, , wm) =mi=1
P (wi|w1, w2,
We collect person and location names fromthe Buddhist Authority Database1 and the knownBuddhist transliteration terms from The BuddhistTranslation Lexicon ()2 to create adataset with 4,301 transliterations for our bi-gramlanguage model.
After building the bi-gram language model, weapply it as a feature for the supervised model. Fol-lowing the previous example, (yu-jiao-sa-luo-guo, in Kosala state), for each charac-ter in the sentence, we first compute the probabil-ity of the current character and its previous char-acter. For the first character , since there isno previous word, the probability is P (). Forthe second character , the probability of thetwo characters is P () = P ()P (|). Wethen compute the probability of the second andthird characters: P () = P ()P (|), andso on. If the probability changes sharply from thatof the previous bi-gram, the previous bi-gram ma