Transliteration Extraction from Classical Chinese Buddhist Literature ...

Download Transliteration Extraction from Classical Chinese Buddhist Literature ...

Post on 04-Jan-2017

227 views

Category:

Documents

7 download

TRANSCRIPT

  • Transliteration Extraction from Classical Chinese Buddhist LiteratureUsing Conditional Random Fields

    Yu-Chun WangDepartment of Computer Science

    and Information Engineering,National Taiwan University, TaiwanTelecommunication Laboratories,

    Chunghwa Telecom, Taiwand97023@csie.ntu.edu.tw

    Richard Tzong-Han TsaiDepartment of Computer Science

    and Information Engineering,National Central University,

    Zhongli City, Taiwanthtsai@csie.ncu.edu.tw

    Abstract

    Extracting plausible transliterations fromhistorical literature is a key issues in his-torical linguistics and other resaech fields.In Chinese historical literature, the charac-ters used to transliterate the same loanwordmay vary because of different translationeras or different Chinese language prefer-ences among translators. To assist historicallinguiatics and digial humanity researchers,this paper propose a transliteration extrac-tion method based on the conditional ran-dom field method with the features basedon the characteristics of the Chinese char-acters used in transliterations which are suit-able to identify transliteration characters. Toevaluate our method, we compiled an evalu-ation set from the two Buddhist texts, theSamyuktagama and the Lotus Sutra. Wealso construct a baseline approach with suf-fix array based extraction method and pho-netic similarity measurement. Our methodoutperforms the baseline approach a lot andthe recall of our method achieves 0.9561and the precision is 0.9444. The resultsshow our method is very effective to extracttransliterations in classical Chinese texts.

    1 Introduction

    Cognates and loanwords play important roles inthe research of language origins and cultural in-terchange. Therefore, extracting plausible cog-nates or loanwords from historical literature is akey issues in historical linguistics. The adoption ofloanwords from other languages is usually throughtransliteration. In Chinese historical literature,the characters used to transliterate the same loan-word may vary because of different translationeras or different Chinese language/dialect prefer-ences among translators. For example, in classical

    Chinese Buddhist scriptures, the translation pro-cess of Buddhist scriptures from Sanskrit to classi-cal Chinese occurred mainly from the 1st centuryto 10th century. In these works, the same San-skrit words may transliterate into different Chi-nese loanword forms. For instance, the surname ofthe Buddha, Gautama, is transliterated into severaldifferent forms such as (qu-tan) or (qiao-da-mo), and the name Culapanthakahas several different Chinese transliterations suchas (zhu-li-pan-te) and (zhou-li-pan-tuo-qie). In order to assist re-searchers in historical linguistics and other digitalhumanity research fields, an approach to extracttransliterations in classical Chinese texts is neces-sary.

    Many transliteration extraction methods requirea bilingual parallel corpus or text documents con-taining two languages. For example, (Sherif andKondrak, 2007) proposed a method for learningthe string distance measurement function from asentence-aligned English-Arabic parallel corpusto extract transliteration pairs. (Kuo et al., 2007)proposed a transliteration pair extraction methodusing a phonetic similarity model. Their approachis based on the general rule that when a new En-glish term is transliterated into Chinese (in modernChinese texts, e.g. newswire), the English sourceterm usually appears alongside the transliteration.To exploit this pattern, they identify all the En-glish terms in a Chinese text and measure the pho-netic similarity between those English terms andtheir surrounding Chinese terms, treating the pairswith the highest similarity as the true translitera-tion pairs. Despite its high accuracy, this approachcannot be applied to transliteration extraction inclassical Chinese literature since the prerequisite(of the source terms alongside the transliteration)does not apply.

    PACLIC-27

    260Copyright 2013 by Yu-Chun Wang and Richard Tzong-Han Tsai

    27th Pacific Asia Conference on Language, Information, and Computation pages 260266

  • Some researchers have tried to extract translit-erations from a single language corpus. (Ohand Choi, 2003) proposed a Korean translitera-tion identification method using a Hidden MarkovModel (HMM) (Rabiner, 1989). They trans-formed the transliteration identification probleminto a sequential tagging problem in which eachKorean syllable block in a Korean sentence istagged as either belonging to a transliterationor not. They compiled a human-tagged Ko-rean corpus to train a hidden Markov model withpredefined phonetic features to extract translit-eration terms from sentences by sequential tag-ging. (Goldberg and Elhadad, 2008) proposedan unsupervised Hebrew transliteration extrac-tion method. They adopted an English-Hebrewphoneme mapping table to convert the Englishterms in a named entity lexicon into all the pos-sible Hebrew transliteration forms. The Hebrewtransliterations are then used to train a Hebrewtransliteration identification model. However, Ko-rean and Hebrew are alphabetical writing system,while Chinese is ideographic. These identificationmethods heavily depend on the phonetic character-istics of the writing system. Since Chinese char-acters do not necessarily reflect actual pronunci-ations, these methods are difficult to apply to thetransliteration extraction problem in classical Chi-nese.

    This paper proposes an approach to extracttransliterations automatically in classical Chinesetexts, especially Buddhist scriptures, with super-vised learning models based on the probability ofthe characters used in transliterations and the an-guage model features of Chinese characters.

    2 Method

    To extract the transliterations from the classicalChinese Buddhist scriptures, we adopt a super-vised learning method, the conditional randomfields (CRF) model. The features we use in theCRF model are described in the following subsec-tions.

    2.1 Probability of each Chinese character intransliterations

    According to our observation, in the classical Chi-nese Buddhist texts, the Chinese characters chosenbe used in transliteration show some characteris-tics. Translators tended to choose the charactersthat do not affect the comprehension of the sen-

    tences. The amount of the Chinese characters ishuge, but the possible syllables are limited in Chi-nese. Therefore, one Chinese character may sharethe same pronunciation with several other charac-ters. Hence, the translators may try to choose therarely used characters for transliteration.

    From our observation, the probability of eachChinese character used to be transliterated is animportant feature to identify transliteration fromthe classical Buddhist texts. In order to measurethe probability of every character used in translit-erations, we collect the frequency of all the Chi-nese characters in the Chinese Buddhist Canon.Furthermore, we apply the suffix array method(Manzini and Ferragina, 2004) to extract the termswith their counts from all the texts of the Chi-nese Buddhist Canon. Next, the extracted termsare filtered out by the a list of selected translitera-tion terms from the Buddhist Translation Lexiconand Ding Fubaos Dictionary of Buddhist Studies.The extracted terms in the list are retained and thefrequency of each Chinese character can be cal-culated. Thus, the probability of a given Chinesecharacter c in transliteration can be defined as:

    Prob(c) = logfreqtrans(c)

    freqall(c)

    where freqtrans(c) is cs frequency used intransliterations, and freqall(c) is cs frequencyappearing in the whole Chinese Buddhist Canon.The logarithm in the formula is designed for CRFdiscrete feature values.

    2.2 Language model of the transliteration

    Transliterations may appear many times in oneBuddhist sutra. The preceding character and thefollowing character of the transliteration may bedifferent. For example, for the phrase (yu-jiao-sa-luo-guo, in Kosala state), if wewant to identify the actual transliteration, (jiao-sa-luo, Kosala), from the extra charac-ters (yu, in) and (guo, state), we mustfirst use an effective feature to identify the bound-aries of the transliteration.

    In order to identify the boundaries of translit-erations, we propose a language-model-based fea-ture. A language model assigns a probability toa sequence of m words P (w1, w2, . . . , wm) bymeans of a probability distribution. The probabil-ity of a sequence of m words can be transformed

    PACLIC-27

    261

  • into a conditional probability:

    P (w1, w2, , wm) = P (w1)P (w2|w1)P (w3|w1, w2) P (wm|w1, w2, ,wm1)

    =mi=1

    P (wi|w1, w2, ,

    wi1)

    In practice, we can assume the probability of aword only depends on its previous word (bi-gramassumption). Therefore, the probability of a se-quence can be approximated as:

    P (w1, w2, , wm) =mi=1

    P (wi|w1, w2,

    , wi1)

    mi=1

    P (wi|wi1)

    We collect person and location names fromthe Buddhist Authority Database1 and the knownBuddhist transliteration terms from The BuddhistTranslation Lexicon ()2 to create adataset with 4,301 transliterations for our bi-gramlanguage model.

    After building the bi-gram language model, weapply it as a feature for the supervised model. Fol-lowing the previous example, (yu-jiao-sa-luo-guo, in Kosala state), for each charac-ter in the sentence, we first compute the probabil-ity of the current character and its previous char-acter. For the first character , since there isno previous word, the probability is P (). Forthe second character , the probability of thetwo characters is P () = P ()P (|). Wethen compute the probability of the second andthird characters: P () = P ()P (|), andso on. If the probability changes sharply from thatof the previous bi-gram, the previous bi-gram maybe the boundary of the transliteration. Because thecharacter rarely appears in transliterations,P () is much lower than P (). We mayconclude that the left boundary is between the firsttwo characters .

    2.3 Functional WordsWe take the classical Chinese functional wordsinto consideration. These characters have spe-

    1http://authority.ddbc.edu.tw/2http://www.cbeta.org/result/T54/

    T54n2131.htm

    cial grammatical functions in classical Chinese;thus, they are seldom used to transliterate foreignnames. This is a binary feature which records thecharacter is a functional word or not. The func-tional words are listed as follows: (zhi ), (hu), (qie), (yi ), (ye), (yu), (zai ), (xiang), (sui ), (jie), (yu), and (yi ).

    2.4 Appellation and Quantifier WordsAfter observing the transliterations appearing inclassical Chinese literature, we note that there aresome specific patterns of the characters follows thetransliteration terms. Most of the characters fol-lowing the transliteration are appellation or quan-tifier words, such as (san, mountain), (hai,sea), (guo, state), (zhou, continent). For ex-ample, there are some cases like (qi-du-jui-san, Vulture mountain), (ju-sa-luo-guo, Kosala state), and (zhan-bu-zhou,Jambu continent). Therefore, we collect the Chi-nese characters that are usually used as appellationor quantifiers following transliterations and thendesign this feature. This is also a binary featurethat records the character is used as an appellationor quantifier word or not.

    2.5 CRF Model TrainingWe adopt the supervised learning models, condi-tional random field (CRF) (Lafferty et al., 2011),to extract the transliterations in classical Buddhisttexts. For CRF model, we formulate the translit-eration extraction problem as a sequential taggingproblem.

    2.5.1 Conditional Random FieldsConditional random fields (CRFs) are undi-

    rected graphical models trained to maximize aconditional probability (Lafferty et al., 2011). Alinear-chain CRF with parameters = 1, 2, . . .defines a conditional probability for a state se-quence y = y1 . . .yT , given that an input se-quence x = x1 . . .xT is

    P(y|x) =1

    Zxexp

    (Tt=1

    k

    kfk(yt1,yt,x, t)

    )

    where Zx is the normalization factor that makesthe probability of all state sequences sum to one;fk(yt1,yt,x, t) is often a binary-valued featurefunction and k is its weight. The feature func-tions can measure any aspect of a state transition,yt1 yt, and the entire observation sequence,

    PACLIC-27

    262

  • x, centered at the current time step, t. For exam-ple, one feature function might have the value 1when yt1 is the state B, yt is the state I, and xt isthe character (guo). Large positive values fork indicate a preference for such an event; largenegative values make the event unlikely.

    The most probable label sequence for x,

    y = arg maxyP(y|x)

    can be efficiently determined using the Viterbi al-gorithm.

    2.5.2 Sequential Tagging and FeatureTemplate

    The classical Buddhist texts are separated intosentences by the Chinese punctuation. Then, eachcharacter in the sentences is taken as a data row forCRF model. We adopt the tagging approach mo-tivated by the Chinese segmentation (Tsai et al.,2006) which treat Chinese segmentation as a tag-ging problem. The characters in a sentence aretagged in B class if it is the first character of atransliteration word or in I class if it is in a translit-eration word but not the first character. The char-acters that do not belong to a transliteration wordsare tagged in O class. We adopt the CRF++ open-source toolkit3. We train our CRF models with theunigram and bigram features over the input Chi-nese character sequences. The features are shownas follows.

    Unigram: s2, s1, s0, s1, s2

    Bigram: s1s0, s0s1

    where current substring is s0 and si is other char-acters relative to the position of the current char-acter.

    3 Evaluation

    3.1 Data setWe choose one Buddhist scripture as our data setfor evaluation from the Chinese Buddhist Canonmaintained by Chinese Buddhist Electronic TextAssociation (CBETA). The scripture we choose tocompile the training and test sets is the Samyuk-tagama (). The Samyuktagama is one ofthe most important scriptures in Early Buddhismand contains a lot of transliterations because it de-tailedly records the speech and the lives of theBuddha and many of his diciples.

    3http://crfpp.googlecode.com

    The Samyuktagama is an early Buddhist scrip-ture collected shortly after the Buddhas death.The term agama in Buddhism refers to a collectionof discourses, and the name Samyuktagama meansconnected discourses. It is among the most im-portant sutras in Early Buddhism. The authorshipof the Samyuktagama is traditionally regarded asthe most early sutra collected by the Mahakssyapa,the Buddhas disciple, and five hundred Arhatsthree months after the Buddhas death. An In-dian monk, Gunabhadra, translated this sutra intoclassical Chinese in Liu Song dynasty around 443C.E. The classical Chinese Samyuktagama has 50volumes containing about 660,000 characters. Be-cause the amount of Samyuktagama is too tremen-dous, we take the first 20 volumes as the trainingset, and the last 10 volumes as the test set.

    In addition, we want to evaluate if the su-pervised learning model trained by one Buddhistscripture can be applied to another Buddhist scrup-ture translated in differe...