transliteration pair extraction from classical chinese buddhist literature using phonetic similarity...

New Generation Computing, 31(2013)265-283Ohmsha, Ltd. and Springer

Transliteration Pair Extraction from Classical ChineseBuddhist Literature Using Phonetic Similarity Measure-ment

Yu-Chun WANGDepartment of Computer Science and Information Engineering,National Taiwan University, TaiwanTelecommunication Laboratories, Chunghwa Telecom, TAIWANChun-Kai WUDepartment of Computer Science and Engineering,Yuan Ze University, TAIWANRichard Tzong-Han TSAI∗Department of Computer Science and Information Engineering,National Central University, TaiwanJieh HSIANGDepartment of Computer Science and Information Engineering,National Taiwan University, [email protected], [email protected],[email protected], [email protected]∗corresponding author

Received 29 March, 2013Revised manuscript received 12 July 2013

Abstract Transliteration pair extraction, the identification of translit-erations of foreign loanwords in literature, is a challenging task in researchfields such as historical linguistics and digital humanities. In this paper, wefocus on one important type of historical literature: classical Chinese Bud-dhist texts. We propose an approach which can identify transliteration pairsautomatically in classical Chinese texts. Our approach comprises two stages:transliteration extraction and transliteration pair identification. In order toextract more possible transliterations without introducing too many false pos-itives, we adopt a hybrid method consisting of a suffix-array-based extractionstep and a language-model based filtering process. Using the ALINE algo-rithm, we then compare the extracted transliteration candidates for phoneticsimilarity based on their pronunciations in the middle Chinese rime bookGuangyun ( ). Pairs with similarity above a certain threshold are consid-

266 Y.-C. Wang, C.-K. Wu, R. T.-H. Tsai, J. Hsiang

ered transliteration pairs. To evaluate our method, we constructed an eval-uation set from several Buddhist texts such as the Samyuktagama and theMahavibhasa, which were translated into Chinese in different eras. Precisionand recall are used to measure and show the effectiveness of our method.

Keywords: Transliteartion Pair Extraction, Phonetic Similarity, ClassicalChinese Processing.

§1 IntroductionCognates and loanwords play important roles in the research of language

origins and cultural interchange. Therefore, extracting plausible cognate or loan-word pairs from historical literature is a key issues in historical linguistics. Theadoption of loanwords from other languages is usually through transliteration.In Chinese historical literature, the characters used to transliterate the sameloanword may vary because of different translation eras or different Chinese lan-guage/dialect preferences among translators. For example, in classical ChineseBuddhist scriptures, the translation process of Buddhist scriptures from Sanskritto classical Chinese occurred mainly from the 1st century to 10th century. Inthese works, the same Sanskrit words may transliterate into different Chineseloanword forms. For instance, the surname of the Buddha, Gautama, is translit-erated into several different forms such as “ ” (qu-tan) or “ ” (qiao-da-mo), and the name “Culapanthaka” has several different Chinese transliterationssuch as “ ” (zhu-li-pan-te) and “ ” (zhou-li-pan-tuo-qie). Al-though there are dictionaries for historians and Buddhist scholars to understandthe relationships among different loanwords, it is difficult for any dictionary tocover all possible loanwords due to the vast amount of classical Chinese litera-ture.

In order to assist researchers in historical linguistics and other digital hu-manity research fields, this paper proposes an approach to identify transliterationpairs automatically in classical Chinese texts.

§2 Related WorkBefore identifying loanwords, we toned to first extract plausible loanword

terms from classical Chinese literature texts. Shieh et al. (2011)1) proposedan appositional term extraction algorithm for extracting named entities fromclassical Chinese literature. Their method utilizes the appositional similarityof terms to extract potential candidates with similar appositional features inthe text. They describe “term clips” using five features (preamble, prefix, infix,suffix, and postamble) and adopt a semi-supervised procedure to learn theseappositional “term clips” from classical Chinese texts with a small initial numberof seed terms. Although their method can extract named entities with a highdegree of precision, it may not extract enough possible loanwords to achieve asufficient recall. Futhermore, for loanwords in classical Chinese, the classicalChinese texts usually do not have observable markers for loanwords. So features

Transliteration Pair Extraction from Classical Chinese Buddhist LiteratureUsing Phonetic Similarity Measurement

267

such as prefix, infix, and suffix may not be useful.Many transliteration extraction methods require a bilingual parallel cor-

pus or text documents containing two languages. For example, Sherif andKondrak (2007)2) proposed a method for learning the string distance measure-ment function from a sentence-aligned English-Arabic parallel corpus to extracttransliteration pairs. Kuo et al. (2007)3) proposed a transliteration pair extrac-tion method using a phonetic similarity model. Their approach is based on thegeneral rule that when a new English term is transliterated into Chinese (inmodern Chinese texts, e.g. newswire), the English source term usually appearsalongside the transliteration. To exploit this pattern, they identify all the En-glish terms in a Chinese text and measure the phonetic similarity between thoseEnglish terms and their surrounding Chinese terms, treating the pairs with thehighest similarity as the true transliteration pairs. Despite its high accuracy,this approach cannot be applied to transliteration extraction in classical Chi-nese since the prerequisite (of the source terms alongside the transliteration)does not apply.

Some researchers have tried to extract transliterations from a single lan-guage corpus. Oh and Choi (2003)4) proposed a Korean transliteration identi-fication method using a hidden Markov model. They transformed the translit-eration identification problem into a sequential tagging problem in which eachKorean syllable block in a Korean sentence is tagged as either belonging to atransliteration or not. They built a human-tagged Korean corpus to train a hid-den Markov model with predefined phonetic features to extract transliterationterms from sentences by sequential tagging. Goldberg and Elhadad (2008)5) pro-posed an unsupervised Hebrew transliteration extraction method. They adoptedan English-Hebrew phoneme mapping table to convert the English terms in anamed entity lexicon into all the possible Hebrew transliteration forms. TheHebrew transliterations are then used to train a Hebrew transliteration identi-fication model. However, Korean and Hebrew are alphabetical writing system,while Chinese is ideographic. These identification methods heavily depend onthe phonetic characteristics of the writing system. Since Chinese characters donot necessarily reflect actual pronunciations, these methods are difficult to applyto the transliteration extraction problem in classical Chinese.

Identifying loanwords which are mostly transliterations is a key step inthe construction of language families and language contacts. Methods employedin this step usually try to measure phonetic similarity among the potential can-didates. There are several approaches to extract possible cognates based onphonetic similarity. Covington6) proposed a cognate alignment algorithm thatestimates phonetic similarity by finding the minimum- cost alignment throughdepth-first search. The cost of each alignment was measured in terms of the costof substitution, insertion, or deletion of each segment. Kondrak7) proposed theALINE algorithm to measure the similarity between the phonemes based on thephonetic characteristics such as the place and the manner of articulation. Hedefined the salience and the value of each phonetic characteristic to measure thesimilarity of each phoneme pair. The actual similarity is the total sum of the


similarity score of the phonemes on the optimal alignment.In addition to the phonemic rule based methods, there are some that

are based on machine learning. Ristad and Yianilos10) adopted the expectationmaximization (EM) to learn the probability of each string edit action and createa stochastic transducer to output the string edit distance of the two phonemicsequences. Mackay and Kondrak11) proposed a method to measure phoneticsimilarity based on a pair hidden Markov model (Pair HMM) which is oftenused in bioinformatics. The Pair HMM can generate the two output streamsconcurrently representing the alignment results of the two sequences. The PairHMM has three states, which stand for the three based string edit actions:insertion, substitution, and deletion. They used the collected cognates as atraining set to learn the parameters of the HMM model. The machine learningmethods can improve the accuracy of word alignment and the phonetic similarity.However, the supervised learning method requires a labeled training set whichis labor-intensive and not easy to obtain in many languages. The method basedon the phonemic rules can be easily constructed with all the possible phonemicphenomena.

Identifying cognates in text automatically is similar to the task of extract-ing loanwords/transliterations. Both tasks try to identify similar forms basedon some phonetic, semantic or orthographical clues. Tiedemann8) extracted cog-nates from automatically aligned sentences in parallel corpora based on iterativesize reduction, string similarity, and co-occurrence measurement. Extracted re-sults were filtered automatically based on length, orthographical similarity, andco-occurrence frequency. To extract cognates from South-Slavonic languages,Nakov et. al9) proposed a five-step method, consisting of cognate extraction,word alignment, phrase extraction, statistical phrase filtering, and linguisticphrase filtering. They adopted IBM’s word alignment method for cognate ex-traction and word alignment of a sentence-aligned corpus. They then filteredthe results according to orthography and semantic similarity. According to oursurvey of the literature, there is no known work that identifies transliterationpairs in different texts written in the same language without comparable corporaand orthographical clues.

§3 MethodOur loanword pair identification system comprises three components: term

extraction, term filtering, and phonetic similarity estimation. An overview of oursystem is shown in Fig. 1.

3.1 Term ExtractionIn order to extract transliteration loanword from classical Chinese litera-

ture, we adopt the suffix array method12) to extract possible terms. Suffix arrayis a data structure designed for efficient searching of large bodies of text. Thedata structure is simply an array containing all indexes of the text suffixes sortedin lexicographical order. Each suffix is a string starting at a certain position inthe text and ending at the end of the text. Searching a text can be performed


269

Fig. 1 Overview of Our System

by binary search using the suffix array. The term extraction procedure of suffixarray is as follows: Given a string S with length n, for each character in thestring S, create indexes from 0 to n − 1, where S[i] represents the suffixes ofthe index i. Given S = “abracadabra,” the index result would appear, afterindexing, as follows:

Character a b r a c a d a b r aIndex 0 1 2 3 4 5 6 7 8 9 10

The string S has 12 suffixes. We can sort these suffixes in the lexicograph-ical order that is shown in the following table. The frequency is the count ofthis suffix, which appears as a prefix of other suffix strings.

Sorted Suffix Suffix S Frequencya S[10] 5abra S[7] 2abracadabra S[0] 1adabra S[5] 1acadabra S[3] 1bra S[8] 2bracadabra S[1] 1cadabra S[4] 1dabra S[6] 1ra S[9] 2racadabra S[2] 1

The suffixes whose frequency is greater than 1 are considered candidateterms. However, if a suffix is a substring of another suffix and its frequency isnot greater than that of the latter, the shorter suffix will be dropped. From theabove example, we can extract two candidate terms “a” and “abra.”


3.2 Term FilteringThe suffix array method can extract a lot of possible terms, although

most of them are not actual transliteration terms. Thus, we have to filter outfalse positives before measuring phonetic similarity to avoid wasting computa-tional resources and degrading the precision of pair identification. Sometimesthe suffix array method may extract a term composed of an actual transliter-ation and one or more other words. For example, for the extracted candidateterm “ ” (in Kosala state), if we want to separate out the actualtransliteration, “ ” (Kosala), from the extra characters“ ” (in) and “ ”(state), we must first segment the term.

In order to filter and extract transliteration terms, we propose a language-model-based filtering method. A language model assigns a probability to asequence of m words P (w1, w2, · · · , wm) by means of a probability distribution.The probability of a sequence of m words can be transformed into a conditionalprobability:

P (w1, w2, · · · , wm) = P (w1)P (w2|w1)P (w3|w1, w2) · · ·P (wm|w1, w2, · · · , wm−1)

=m∏

i=1

P (wi|w1, w2, · · · , wi−1)

In practice, we can assume the probability of a word only depends on itsprevious word (bi-gram assumption). Therefore, the probability of a sequencecan be approximated as:

P (w1, w2, · · · , wm) =m∏

i=1

P (wi|w1, w2, · · · , wi−1)

≈m∏

i=1

P (wi|wi−1)

We collect person and location names from the Buddhist Authority Data-base∗1 and the known Buddhist transliteration terms from The Buddhist Trans-lation Lexicon ( )∗2 to create a dataset with 4,301 transliterations forour bi-gram language model.

After building the bi-gram language model, we apply it to filter the termsfrom the suffix array method. We formulate transliteration term segmentationinto a boundary-checking problem. For each term from the suffix array, we checkwhere the left and right boundaries of the actual transliteration are based onthe probability of the bi-gram language model.

Continuing with the previous example, “ ” (in Kosala state),we first compute the probability of the first two characters: P ( ) = P ( )P ( | ). We then compute the probability of the second and third characters:P ( ) = P ( )P ( | ). We check the probability of each bi-gram from leftto right. If the probability changes sharply from that of the previous bi-gram,∗1 http://authority.ddbc.edu.tw/∗2 http://www.cbeta.org/result/T54/T54n2131.htm


271

the previous bi-gram may be the left boundary of the transliteration. Becausethe character “ ” rarely appears in transliterations, P ( ) is much lowerthan P ( ). We thus conclude that the left boundary is between the firsttwo characters “ .” A similar check is also applied to the right boundary bycomputing the probability of bi-grams from right to left. If the probability of allthe bi-grams is low, the term may not contain any transliterations and is filteredout. In practice, we only deal with the leftmost and rightmost three bi-gramsto check if the boundary exists because there are few terms which have morethan three characters on the left or right-hand side of an actual transliterationaccording to our observations.

3.3 Phonetic Similarity

[ 1 ] Middle Chinese pronunciationAfter extracting the set of loanword candidates, we measure the phonetic

similarity between every two candidates to get possible loanword pairs. Becausethe pronunciation of Chinese characters varies diachronically and synchronically,the same Chinese characters may have been pronounced differently in differentregions and eras of ancient China. Therefore, we cannot base our measurementsof the phonetic similarity of the loanword on modern Chinese pronunciation.However, Chinese characters are ideographic and do not necessarily reflect actualpronunciations. Thus, it is difficult to acquire the actual pronunciations in thepast. In seventh century, a new kind of pronunciation dictionary, Qieyun, waspublished by Lu Fayan in Sui dynasty, based on five earlier rime dictionariesthat are no longer extant. As a guide to the recitation of literary texts andan aid in the composition of verse, the Qieyun quickly became popular and thenational standard of pronunciation during the following Tang dynasty (618 – 907C. E.). Unfortunately, the actual content of the Qieyun did not last to modernera. In 1008, during the Northern Song Dynasty (960 – 1127 C. E.), a group ofscholars commissioned by the emperor produced an expanded revision of Qieyuncalled the Guangyun. Until the mid-20th century, the oldest complete rime bookknown was the Guangyun, though extant copies of the latter were marred bynumerous transcription errors. Thus all studies of the rime book tradition wereactually based on the Guangyun.

The period of the Buddhist literature translation from Sanskrit to classicalChinese is mainly from Tang dynasty to Song dynasty, which all belong to middleChinese era. Therefore, in order to obtain the approximate pronunciation ofChinese characters in middle Chinese, we use the rime book Guangyun, compiledduring the Northern Song dynasty, instead.

Since there is no phonological symbols or alphabetical writing systems inmiddle Chinese, rime books, such as Guangyun, record contemporary characterpronunciations with fanqie “ ” analyses. Fanqie represents a character’spronunciation with another two characters, combining the former’s “initial” andthe latter’s “rhyme” and tone. An English equivalent would be to combine theinitial of ‘peek’ /phi:k/ and the rhyme of ‘cat’ /kæt/ to get ‘pat’ /phæt/. Take the


Table 1 The Phonological Features of the Character “ ” in the Guangyun

Fanqie:

Initial:Openess: openLevel: first

Rhyme:Tone: plain

character “ ” [tuN] for example. The fanqie of the character is “ ” ([tok] and[huN]), so that we can get its pronunciation if we know the actual pronunciationof these two fanqie characters. Although the fanqie is an effective method torepresent the pronunciation of a Chinese character, it is still hard to analyze sincethe usage of two characters for the initial and rhyme is arbitrary. Fortunately, therevision of the Guangyun included the additional annotations by some scholars.They analyzed all the characters used in the fanqie and categorized the samehomophones into groups and choose an identical Chinese character standingfor the group. After the analyzing, there are 36 initials and 106 rhymes inthe Guangyun. In addition to the initial and rhyme, there are another featuresadded into the revision of the Guangyun, such as fanqie, initial, rhyme, openness(round or unround), level (different medial vowels), and tone. Table 1 shows thesix features of the character “ .” and Table 2 is the 36 initials and theirapproximate pronunciations.

To employ the Guangyun rime book data in our analyses, we must firstconvert the Chinese characters it uses to represent pronunciation into Interna-tional Phonetic Alphabet (IPA) notation. There are many former researcherstrying to reconstruct the action pronunciations of the characters in middle Chi-nese. We use the reconstruction of middle Chinese pronunciation proposed byWang Li13) for this task. Take the character “ ” for example. Its initial “ ”is converted to IPA [G] and its rhyme “ ” is converted to [uN], giving us a finalreconstructed IPA phonemic form of [GuN].

[ 2 ] Phonetic similarity measurementWe use the ALINE algorithm to measure the similarity between the pho-

netic forms of every two candidate terms according to their string editing dis-tance. Algorithm 1 shows the pseudocode for the phonetic similarity algorithmwhich we constructed using dynamic programming. Given string x and y stringsas inputs, the algorithm has three string editing actions: skip, substitute, and ex-pand. Skip skips a phoneme in either input string; substitute replaces a phonemein one input string by one in the other input string, and expand substitutes aphoneme in one input string with two phonemes in the other input string. Whenfinal consonants or consonants in consonant clusters are dropped in translitera-tions from Sanskrit to middle Chinese, we have to consider the expand action.In Algorithm 1, the functions δskip, δsub, δexp are the scoring functions of theskip, substitute and expand actions.

Sound changes in a language show tendencies.14) For example, the change


273

Table 2 The 36 Initials in the Guangyun

from dental stop [t] to palatal affricate [tC] is common in many languages; how-ever, the dental stop [t] is rarely changed to the glottal fricative [h]. The sim-ilarity between [t] and [tC] should be higher than the one between [t] and [h].Therefore, the scoring functions should take the actual phonetic features suchas place of articulation, manner, voiced or voiceless into consideration. Table 3shows a detailed definition of the scoring functions. The constants in the scoringfunctions are set to Cskip = −10, Csub = 35, Cexp = 45, and Cvwl = 10 which arethe optimal settings for cognate identification from Kondrak’s research.7) Thephonetic features of each phoneme are represented as vectors, and the functiondiff(p, q, f) is the difference value between phoneme p and phoneme q in featuref .


Algorithm 1 ALINE phonetic similarity algorithmS[0, 0] ← 0n ← |x|m ← |y|for i = 1 to n do

S[i, 0] ← S[i− 1, 0] + δ(ai,−)end forfor j = 1 to m do

S[0, j] ← S[0, j − 1] + δ(−, bj)end forfor i = 1 to n do

for j = 1 to m do

S[i, j] ← max

S[i− 1, j] + δskip(xi)S[i, j − 1] + δskip(yi)

S[i− 1, j − 1] + δsub(xi, yj)S[i− 1, j − 2] + δexp(xi, yj−1yj)S[i− 2, j − 1] + δexp(xi−1xi, yj)

0

end forend forreturn S[n, m]

Table 3 Scoring Functions

δskip(p) = Cskip

δsub(p, q) = Csub − δ(p, q)− V (p)− V (q)

δexp(p, q1q2) = Cexp − δ(p, q1)− δ(p, q2)−V (p)−max(V (q1), V (q2))

where

V (p) =

{0 if p is a consonantCvwl otherwise

δ(p, q) =∑f∈R

diff(p, q, f)× salience(f)

where

R =

{RC if p or q is a consonantRV otherwise

The phonemic features we use are divided into two sets. One is for com-paring two vowels, RV , and one is for comparing two consonants or one con-sonant and one vowel, RC . The features in RV are: Syllabic, Nasal, Retroflex,High, Back, and Round. The features in RC are: Syllabic, Manner, Voice, Nasal,Retroflex, Lateral, Aspirated, and Place. The value of each feature is in the rangeof [0.0, 1.0]. Place, Manner, High, and Back features are multi-valued. Table 4shows the values of these four features which derived from Kondrak’s settings.7)

The other features only have two values: 0.0 and 1.0. The salience(f) functiondefines the importance of the feature f . For example, the place of articulationis much more important than aspiration in phonetic similarity comparison, sothe salience of the feature Place should be higher than that of Aspirated. The


275

Table 4 Multi-valued Features and Their Values

Feature name Phonological term Value

Place

[bilabial] 1.0[labiodental] 0.95[dental] 0.9[alveolar] 0.85[retroflex] 0.8[palato-alveolar] 0.75[palatal] 0.7[velar] 0.6[uvular] 0.5[pharyngeal] 0.3[glottal] 0.1

Manner

[stop] 1.0[affricate] 0.9[fricative] 0.8[approximant] 0.6[high vowel] 0.4[mid vowel] 0.2[low vowel] 0.0

High[high] 1.0[mid] 0.5[low] 0.0

Back[front] 1.0[central] 0.5[back] 0.0

Table 5 Salience Settings of the Features

Syllabic 5 Place 40Voice 10 Nasal 10Lateral 10 Aspirated 5High 5 Back 5Manner 50 Retroflex 10Round 5

salience values of the features are listed in Table 5.After computation of Algorithm 1, the max value in S[i, j], 0 ≤ i ≤ n,

0 ≤ j ≤ m is the similarity score of these two phoneme sequences. To offsetdiscrepancy due to the length of the sequences, we normalize the similarityscore into the range [0, 1] by the following formula: Suppose AlignScore(x, y) =

max0≤i≤n,0≤j≤m

S[i, j], the final similarity

Similarity(x, y) =AlignScore(x, y)AlignScore(q, q)

where q is the longest sequence between x and y.

§4 Evaluation

4.1 Data SetWe choose two Buddhist scriptures as our data set for evaluation from


the Chinese Buddhist Canon maintained by Chinese Buddhist Electronic TextAssociation (CBETA). They are the Samyuktagama ( ) and the Abhid-harma Mahavibhasa Sastra ( ). These two sutras are suitablefor evaluation purposes because both were translated into classical Chinese indifferent eras and have different characteristic of transliterations.

[ 1 ] SamyuktagamaThe Samyuktagama is an early Buddhist scripture collected shortly after

the Buddha’s death. The term agama in Buddhism refers to a collection ofdiscourses, and the name Samyuktagama means “connected discourses.” It isamong the most important sutras in Early Buddhism. The authorship of theSamyuktagama is traditionally regarded as the most early sutra collected bythe Mahakssyapa, the Buddha’s disciple, and five hundred Arhats three monthsafter the Buddha’s death. An Indian monk, Gunabhadra, translated this sutrainto classical Chinese in Liu Song dynasty around 443 C.E. The classical ChineseSamyuktagama has 50 volumes containing about 660,000 characters.

[ 2 ] MahavibhasaThe Abhidharma Mahavibhasa Sastra, or Mahavibhasa for short, is a vital

scripture in Mahayana Buddhism. The meaning of its title is “great explanationto the higher teachings.” Its authorship is attributed to five hundred Arhatsaround 150 C.E., 600 years after the Buddha’s death. The classical Chinese textof the Mahavibhasa was translated by a Chinese monk Xuanzang from the San-skrit text he got in India at the mid-seventh century in Tang dynasty. Xuanzangwas known for his extensive but careful translations of Indian Buddhist texts toChinese. He took the phonetic corresponding in transliteration very seriously, sothe transliterations he did are different from ones translated by previous trans-lators. The classical Chinese text of the Mahavibhasa is immense. It has 200volumes and more than 1.63 billion characters.

4.2 Evaluation MetricsWe use several evaluation metrics to estimate the performance of our sys-

tem. In digital humanities research, a key issue is the coverage of the extractionmethod. To maximize usefulness to researchers, a method should be able toextract as many potential transliterations from literature as possible. In ourevaluation, we use the common statistical measure of recall, defined as follows:

Recall =|Correctly extracted transliterations||Transliterations in the data set|

Furthermore, in order to reduce effort spent examining the transliterationidentification results, a good method should be able to select the correct translit-eration with the highest rank. We adopt three measurement methods to evaluatethe amount of effort annotators would have to apply to using our system’s re-sults. The first is Top-1 accuracy (ACC), which measures the correctness of thefirst transliteration candidate in the list produced by our method. ACC = 1


277

means that all top candidates are correct transliterations i.e. they match one ofthe references, and ACC = 0 means that none of the top candidates are correct.Top-1 accuracy is defined as follows:

ACC =1N

N∑

i=1

{1 if ∃ri,j : ri.j = ci,1;0 otherwise

}

where ri,j is the j-th corresponding transliteration for the i-th transliterationin the ground truth data set, ci,k is the k-th candidate transliteration (systemoutput) for the i-th transliteration in the ground truth data set, and N is thetotal number of the transliterations in the ground truth.

The second measure we use is the Mean Reciprocal Rank (MRR), whichis the average of the reciprocal ranks of results of all lists. The reciprocal rank ofa list is the multiplicative inverse of the rank of the first correct transliteration.When MRR is closer to 1, it implies that the correct answer appears close to thetop of the n-best lists. Mean Reciprocal Rank is defined as follows.

RRi =

minj

1j

if ∃ri,j , ci,k : ri,j = ci,k;

0 otherwise

MRR =1N

N∑

i=1

RRi

The definition of ri,j , ci,k and N are the same as above.Last, we also adopt the Mean Average Precision (MAP) measure, which

is widely used in information retrieval and extraction. MAP tightly measuresthe precision of the n-best candidates for the i-th transliteration term in theground truth. If all of the correct transliterations are found, then the MAP is1. We define the number of correct candidates for the i-th transliteration termin the ground truth in the k-best list as num(i, k). MAP is then given by

MAP =1N

N∑

i

1ni

(ni∑

k=1

num(i, k)

)

N is the total number of transliteration terms in the ground truth.

4.3 Experimental ResultsBecause the texts of the two scriptures in our data set have over a million

characters, it is not plausible to get all the transliteration pairs as the groundtruth to evaluate our method. Therefore, we collect the transliteration terms inthe extraction result of the suffix array method from the Samyuktagama and thetransliteration terms recorded appearing in the Samyuktagama from the Bud-dhist Authority Database∗3 maintained by Dharma Drum Buddhist College. We

∗3 http://authority.ddbc.edu.tw/


then check to see if the terms from the Samyuktagama have the other translit-eration variations which appear in the Mahavibhasa or not. We search for thesetransliteration terms in the Fo Guang Buddhism Dictionary15) and Ding Fubao’sGreat Dictionary of Buddhism16) to get all the transliteration variations of theterm and then check if some of them appearing in the Mahavibhasa. If so, weadd this two terms as a transliteration pair into our ground truth.

As mentioned in Section 2, there are no similar works on identificationof transliteration pairs in monolingual texts without source-language corpora.It is therefore difficult to find a state-of-art approach against which to comparethe performance of our system and demonstrate the advantages of our method.Instead, we constructed a baseline system to show the improvements in preci-sion of our method. The baseline approach uses the same suffix array method toextract terms from classical Chinese. For term filtering, the baseline approachadopts a rule-based method with heuristic rules based on observation of term ex-traction results from classical Chinese Buddhist literature. The baseline methodis relatively easy to implement. The heuristic rules are described as follows:

Punctuation Rule: This rule removes candidate terms containing Chinese punc-tuation characters, such as .

Initial Word Rule: This rule removes terms that contain some initial Chinesecharacters such as verbs, functional words, and titles. For example, it deletesterms that begin with functional words, such as . . . (at), . . . (in); verbs,such as . . . (say), . . . (show), . . . (regard); and some title words,such as . . . (Honored Sir).

Final Word Rule: This rule removes terms that have some specific appellationsor titles in the suffix like . . . (Bhiksu, Buddhist monk) and . . .(senior).

Maximum Matching Rule: To filter out more irrelevant terms, we add thisdictionary-based rule. We check if each candidate term has a matching entryin the Foguang Buddhism Dictionary. If a term does, all the substrings of thisterm are removed from the term extraction results. For instance, if the term“ ” is in the Foguang Buddhism Dictionary, substrings suchas “ ,” “ ,” “ ,” and “ ” are all removedfrom the term extraction results. However, if a substring also appears indepen-dently in the Foguang Buddhism Dictionary, it is retained.

We measure phonetic similarity between pairs of remaining terms usingstring edit distance in the baseline method. Chinese terms are first convertedcharacter by character into modern Hanyu Pinyin pronunciation strings.

Table 6 shows the experimental results of our method and the baselinesystem. The measurements we use are mentioned in section 4.2. Because sometransliterations are identical in the Samyuktagama and the Mahavibhasa, wealso construct a configuration to exclude the identical ones in our ground truth.


279

Table 6 Evaluation Result

Recall ACC MRR MAP

Our methodall pairs 0.7647 0.5588 0.6353 0.6329

distinct pairs 0.6863 0.4510 0.5138 0.5105

Baselineall pairs 0.6029 0.4264 0.4972 0.4950

distinct pairs 0.5098 0.3137 0.3688 0.3658

The results show that our method outperforms the baseline system. It alsodemonstrate the necessity of a specifically designed approach to deal with theclassical Chinese transliteration pair extraction problem instead of a generalterm extraction method. The evaluation results show that our method canextract almost 75% transliteration pairs and the average precision achieves 63%.It shows our method is effective to extract the transliteration pairs in Buddhistscriptures. Since most of the transliteration pairs in the ground truth are one-to-one mapping, the values of MAP are close to MRR. The MRR results show thatour method can extract the transliteration pairs in top 2 candidates averagely.It proves our method can save a lot of labor-intensive work to examine thetransliteration for the historical and humanity researchers.

§5 Discussion

5.1 Effectiveness of Transliteration Pair ExtractionOur method can extract many transliteration pairs which transliterated

from the same Sanskrit words in the Samyuktagama and the Mahavibhasasuch as “ ” and “ ” (Bhikhu, Buddhist monks), “ ” and(Maudgalyayana, one of the Buddha’s closest disciples), “ ” and “

” (Pratimoksha, Buddhist moral discipline). The transliteration pairswhose pronunciations are not similar in modern Chinese can also be extracted.Some examples are “ ” and “ ” (Upasaka, male followers of theBuddhism who are not monks), “ ” and “ ” (Kapilavastu, thename of an ancient kingdom where the Buddha was born and grew up), and“ ” and “ ” (Katyayana, one of the Buddha’s closest disciples).Our method can find the person names, location names, and Buddhism specificterms that take important roles in historical Chinese phonology.

Our method also identify some terms that have similar pronunciationsin Sanskrit. For example, our method find the transliteration pair “ ” and“ ” for the Sanskrit term “Samadhi” (mental concentration). The term“ ” (Samatha, calm abiding) that has the similar pronunciation is alsoextracted by our method. Samatha is a mediation practice to achieve Samadhi.These terms may help the research of Buddhism and phonology.

We discovered that transliterations may vary even in the same scripture.In the Samyuktagama, the Sanskrit term “Magadha” (the name of an ancientIndian kingdom) has three different transliterations: “ ,” “ ,” and“ .” The term “Arhat” (a spiritual practitioner who has realized certainhigh stages of attainment) also has three transliterations: “ ,” “ ,”and “ .” These variations may help the study of historical Chinese phonol-


ogy and philology. It also shows the importance that the terms’ actual pronun-ciations should be considered while dealing with the Buddhist scriptures.

5.2 Error CasesAlthough our method can extract and identify most transliteration pairs,

some transliteration pairs cannot be identified. The error cases can be dividedinto several categories. The first one is the transliterations that have not beenextracted by the suffix array method. For example, the transliteration term“ ” (Mathura, an ancient city in northern India) is not extracted by thesuffix array method because it appears only once in the Mahavibhasa. It is thelimitation of the suffix array method.

The other case is rarely-used characters that are not in our data set forconstructing the bi-gram language model. Take the transliteration “ ”(Aranya, the forest) in the Mahavibhasa, for example. The middle character“ ” is seldom used in transliterations and not covered in our dataset. Thewidely-used transliteration for the Sanskrit “Aranya” is “ .”

The final case is the phonemic problem of the Chinese transliteration itself.For example, the Sanskrit term “Upasika” (female followers of the Buddhism)are transliterated into “ ” in the Samyuktagama and “ ” in theMahavibhasa. The pronunciation of the last character “ ” in the transliteration“ ” in the middle Chinese is [ji]. The semi-vower [j] is not similar to thedental fricative [s] in Sanskrit origin. Thus, the phonetic similarity score is toolow to identify by our method.

§6 ConclusionThe identification of transliterations of foreign loanwords is an important

task in research fields such as historical linguistics and digital humanities. Wepropose an approach which can identify transliteration pairs automatically inclassical Chinese texts. Our approach comprises two stages: transliteration ex-traction and transliteration pair identification. To extract potential translitera-tions, we adopt a hybrid method consisting of a suffix-array-based extraction stepand a language-model based filtering process. We then compare the extractedtransliteration candidates for phonetic similarity based on their pronunciationsin the middle Chinese rime book Guangyun with the ALINE algorithm. Pairswith similarity above a certain threshold are considered transliteration pairs.To evaluate our method, we constructed an evaluation set from the two Bud-dhist texts, the Samyuktagama and the Mahavibhasa, which were translatedinto Chinese in different eras. The recall of our method achieves 0.75 and themean average precision is 0.6329. The results show our method is effective toextract the transliteration pairs in classical Chinese texts. Our method can findthe relationship of sound correspondence among the immense classical literaturesto help many researches such as historical linguistics and philology.


281

References1) Shieh, Y.-P., “Appositional Term Clip: A Subject-oriented Appositional Term

Extraction Algorithm,” New Eyes for Discovery: Foundations and Imaginationsof Digital Humanities, National Taiwan University Press, pp. 133–162, 2011.

2) Sherif, T. and Kondrak, G., “Bootstrapping a stochastic transducer for Arabic-English transliteration extraction,” In Proc. of Annual Meeting-Association forComputational Linguistics, 2007.

3) Kuo, J-S., Li, H. and Yang, Y-K., “A Phonetic Similarity Model for AutomaticExtraction of Transliteration Pairs,” ACM Trans. Asian Language InformationProcessing, 6, 2, 2007.

4) Oh, J. and Choi, K., “A statistical model for Automatic Extraction of KoreanTransliterated Foreign words,” International Journal of Computer Processing ofOriental Languages, 16, 1, pp. 41–62, 2003.

5) Goldberg, Y. and Elhadad, M., “Identification of transliterated foreign wordsin Hebrew script,” Computational Linguistics and Intelligent Text Processing,2008.

6) Covington, M.A., “An algorithm to align words for historical comparison,” Com-putational Linguistics, 22, 4, pp. 481–496, 1996.

7) Kondrak, G., “Phonetic alignment and similarity,” Computers and the Human-ities, 37, 3, pp. 273–291, 2003.

8) Tiedemann, J., “Extraction of translation equivalents from parallel corpora,”Proc. of the 11th Nordic conference on computational linguistics, pp. 120–128,1998.

9) Nakov, P., Pacovski, V. and Paskaleva, E., “Extraction of translation equivalentsfrom parallel corpora,” Proc. of the 11th Nordic conference on computationallinguistics, pp. 120–128, 1998.

10) Ristad, E.S. and Yianilos, P.N., “Learning string-edit distance,” IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 20, 5, pp. 522–532, 1998.

11) Mackay, W. and Kondrak, G., “Computing word similarity and identifying cog-nates with Pair Hidden Markov Models,” Proc. of the Ninth Conference onComputational Natural Language Learning, pp. 40–47, 2005.

12) Manzini, G. and Ferragina, P., “Engineering a lightweight suffix array construc-tion algorithm,” Algorithmica, 40, 1, pp. 33–50, 2004.

13) Wang, L., Historical Chinese Phonology, Zhonghua Book Company, 2002.

14) Cambel, L., Historical linguistics: an introduction, The MIT Press, 1987.

15) Ciyi, Fo Guang Buddhist Dictionary, Buddha’s Light Publishing, 1988.

16) Ding, F.-B., Great Dictionary of Buddhism, The Medical Press, 1922.


Yu-Chun Wang: He currently works as a researcher at Chunghwa

Telecommunication Laboratories and is also a Ph. D. student at

the Department of Computer Science and Information Engineer-

ing, National Taiwan University, Taiwan. His research interests

are the areas of developing cross-lingual techniques to extract

useful information from text resources to assist the researches on

digital humanity, linguistics and other fields. He currently also

gets involved in mobile and cloud computing researches.

Chun-Kai Wu: He received his BS in computer science from Yuan

Ze University, and now a graduate student at the Department of

Computer Science and Information Engineering, National Tsing-

Hua University, Taiwan. His research interest focuses on Machine

Learning and Natural Language Processing, and he currently is

working on the research about cross lingual entity linking.

Richard Tzong-Han Tsai, Ph.D.: He is an associate professor in

the Computer Science and Engineering Department at Yuan Ze

University. His main research interests include natural language

processing, cross-lingual information access, sentiment analysis,

and context based recommendation. He has a Ph.D. in computer

science and information engineering from National Taiwan Uni-

versity. He is a member of the Association for Computational Lin-

guistics (ACL), Association for Computational Linguistics and

Chinese Language Processing (ACLCLP), and the Taiwanese As-

sociation for Artificial Intelligence (TAAI). Contact him at tht-

[email protected].


283

Jieh Hsiang: He is a distinguished professor of Computer Science

and Information Engineering at National Taiwan University, and

the director of Research Center of Digital Humanities at National

Taiwan University. He is also a joint research fellow of Institute

of Information Science, Academia Sinica, Taiwan. His earlier re-

search had mainly been on automated reasoning, especially on

term rewriting systems, he has been working on digital libraries

and archives since late 1990s. In recent years, his main research

emphasis has been on digital humanities, with a focus on devel-

oped information technologies for historical research. Among his

other professional services are the founding Chair of IFIP WG1.6

(Working Group on Term Rewriting), the founder and first Dean

of the College of Science and Technology of the National Chi-Nan

University, the Program Director of Computer Science of the Na-

tional Science Council and Advisor of the Council for Cultural

Affairs of Taiwan.

transliteration pair extraction from classical chinese buddhist literature using phonetic similarity...

Documents