a machine transliteration model based on correspondence...

A Machine Transliteration Model Basedon Correspondence between Graphemesand Phonemes

JONG-HOON OH

National Institute of Information and Communications Technology

KEY-SUN CHOI

Korea Advanced Institute of Science and Technology

and

HITOSHI ISAHARA

National Institute of Information and Communications Technology

Machine transliteration is an automatic method for converting words in one language into pho-netically equivalent ones in another language. There has been growing interest in the use of ma-chine transliteration to assist machine translation and information retrieval. Three types of ma-chine transliteration models—grapheme-based, phoneme-based, and hybrid—have been proposed.Surprisingly, there have been few reports of efforts to utilize the correspondence between sourcegraphemes and source phonemes, although this correspondence plays an important role in ma-chine transliteration. Furthermore, little work has been reported on ways to dynamically handlesource graphemes and phonemes. In this paper, we propose a transliteration model that dynam-ically uses both graphemes and phonemes, particularly the correspondence between them. Withthis model, we have achieved better performance—improvements of about 15 to 41% in English-to-Korean transliteration and about 16 to 44% in English-to-Japanese transliteration—than hasbeen reported for other models.

Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Process-ing—Machine Translation; H.3.3 [Information Storage and Retrieval]: Information Searchand Retrieval

General Terms: Algorithms, Experimentation, Languages, Performance

Additional Key Words and Phrases: Machine transliteration, machine translation, grapheme andphoneme, information retrieval, natural language processing

Authors’ addresses: Jong-Hoon Oh and Hitoshi Isahara, the Computational Linguistics Groupat the National Institute of Information and Communications Technology (NICT), 3-5 Hikaridai,Seikacho, Soraku-gun, Kyoto, 619-0289, Japan; email: {rovellia,isahara}@nict.go.jp. Key-Sun Choi,Division of Computer Science at KAIST, 373-1 Guseong-dong, Yuseong-gu, Daejeon, 305-701, Re-public of Korea; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 PennPlaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]© 2006 ACM 1530-0226/06/0900-0185 $5.00

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, September 2006, Pages 185–208.

186 • J.-H. Oh et al.

1. INTRODUCTION

Machine transliteration is an automatic method for converting words in onelanguage into phonetically equivalent ones in another language. For example,the English word data is commonly transliterated into Korean as “deiteo” andJapanese as “deta.”1 Transliteration is generally used to phonetically trans-late proper names and technical terms, especially from languages using Ro-man alphabets into languages using nonRoman alphabets, such as from En-glish to Korean, Japanese, or Chinese. There has been growing interest in theuse of machine transliteration to assist machine translation (MT) [Al-Onaizanand Knight 2002; Knight and Graehl 1997], monolingual information retrieval(MLIR) [Kang and Choi 2000; Lee and Choi 1998], and cross-lingual informa-tion retrieval (CLIR) [Fujii and Tetsuya 2001; Lin and Chen 2002]. In the areasof MLIR and CLIR, machine transliteration bridges the gap between a translit-erated localized form and its original form by generating transliterated formsfrom each original form (or by generating original forms from each transliter-ation). For CLIR especially, machine transliteration is useful for query trans-lation where proper names and technical terms frequently appear in sourcelanguage queries as transliterations. In the area of MT, machine translitera-tion prevents translation failure when translations of proper names and tech-nical terms are not registered in a translation dictionary. The use of a machinetransliteration model should, therefore, improve the performance of MT, MLIR,and CLIR systems.

Three types of machine transliteration models have been studied: thegrapheme2-based transliteration model (ψG) [Goto et al. 2003; Kang and Choi2000; Kang and Kim 2000; Lee and Choi 1998; Lee 1999; Li et al. 2004], thephoneme3-based transliteration model (ψP ) [Knight and Graehl 1997; Lee 1999;Kang 2001], and the hybrid transliteration model (ψH ) [Al-Onaizan and Knight2002; Bilac and Tanaka 2004; Lee 1999]. The first two types are classified interms of the units to be transliterated: ψG is referred to as the direct model,because it directly transforms source language graphemes into target languagegraphemes without any phonetic knowledge of source language words, and ψP

is called the pivot model because it uses source phonemes as a pivot duringthe transliteration process. Therefore, ψP usually needs two steps:4 (1) pro-duce source language phonemes from the source language graphemes and (2)produce target language graphemes from the source language phonemes. The

1In this paper, target language transliterations are shown in Romanized form in quotation marks(“”). In Japanese transcriptions, we use an overline to denote Japanese long vowels like “e” in “deta”.2Graphemes refer to the basic units (or the smallest contrastive units) of written language; forexample, English has 26 graphemes (i.e., the 26 letters of the alphabet), Korean has 24, and Germanhas 30.3Phonemes are the simplest significant units of sound (or the smallest contrastive units of spokenlanguage); for example, /M/, /AE/, and /TH/ in the word math.4Depending on the implementation, the two steps in the phoneme-based transliteration modelare either explicit, if the transliteration system produces target language transliterations afterproducing the pronunciation of the source language words, or implicit, if the transliteration systemimplicitly uses phonemes in the transliteration stage and explicitly uses them in the learning stage,as in the Bilac and Tanaka [2004] model.

ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, September 2006.

Machine Transliteration Model • 187

last type combines ψG and ψP through linear interpolation. Hereafter, we re-fer to a source language grapheme as a source grapheme, a source languagephoneme as a source phoneme, and a target language grapheme as a targetgrapheme.

Although transliteration is a phonetic process (ψP ) rather than an ortho-graphic one (ψG) [Knight and Graehl 1997], we should consider both the sourcegraphemes and phonemes to achieve high performance in machine translit-eration, because standard transliterations are not restricted to phoneme-based transliterations.5 However, many previous studies used only the sourcegraphemes or source phonemes. While this simplifies the machine transliter-ation problem into either ψG or ψP , assuming that either ψG or ψP can coverall transliteration behaviors, transliteration is a complex process that does notonly rely on the source graphemes or phonemes. For example, the standard Ko-rean transliterations of amylase and data are, respectively, a grapheme-basedtransliteration (“amillaaje”) and a phoneme-based transliteration (“deiteo”). Amachine transliteration model must, therefore, reflect the dynamic transliter-ation behaviors in order to produce correct transliterations.

ψH has limited ability to produce correct transliterations because it simplycombines ψG and ψP through linear interpolation. It does not consider the cor-respondence between the source graphemes and phonemes, even though thiscorrespondence plays an important role in machine transliteration. For exam-ple, source phoneme /AH/6 produces significant ambiguities because it can bemapped to almost every vowel in the source and target languages (the follow-ing underlined graphemes correspond to /AH/: cinema, hostel, holocaust inEnglish; “sinema”, “hostel”, “hollokoseuteu” in their Korean counterparts; and“sinema”, “hosuteru”, “horokosuto” in their Japanese counterparts). If we knowthe correspondence between the source graphemes and phonemes in their orig-inal context, we can more easily infer the correct transliteration of /AH/ since atarget grapheme of /AH/ usually depends on the source grapheme correspond-ing to /AH/.7 Korean transliterations of source grapheme a vary among “a”, “ei”,“o”, “eo”, and so on. As shown in Table I, the correspondence makes it possibleto reduce transliteration ambiguity. In the table, underlined source graphemea in the example column is pronounced as the source phoneme in the sourcephoneme column. The correct Korean transliterations of source grapheme acan be more easily found, as shown in the Korean grapheme column, by meansof source phonemes in the source phoneme column.

5In a set of test English-to-Korean transliterations [Nam 1997], about 60% of them were foundto be phoneme-based transliterations and about 30% were grapheme-based ones. The remainderwere generated by a combination of ψG and ψP .6We use ARPAbet symbols to represent source phonemes. ARPAbet is a method used to codephonemes into ASCII characters (www.cs.cmu.edu/∼laura/pages/arpabet.ps). In this paper, we de-note source phonemes and pronunciation with two slashes, e.g., /AH/. The pronunciation repre-sented in this paper is based on the CMU Pronunciation Dictionary and the American Heritage(r)Dictionary of the English Language.7Obviously, using contextual information can reduce the ambiguity. However, in this paragraph,we limit our discussion to the effects of the source graphemes, the source phonemes, and theircorrespondence on machine transliteration.


188 • J.-H. Oh et al.

Table I. Korean Graphemes Derived from Source Grapheme a andCorresponding Source Phoneme

Korean Grapheme Source Phoneme Source Grapheme Example Usage“a” /AA/ adagio, safari, vivace“ae” /AE/ advantage, alabaster, travertine“ei” /EY/ chamber, champagne, chaos“i” /IH/ advantage, average, silage“o” /AO/ allspice, ball, chalk

We propose “correspondence-based machine transliteration model” (ψC) thatdynamically uses both source graphemes and phonemes. ψC has two significantstrengths compared to ψG , ψP , and ψH . First, it produces transliterations byutilizing the correspondence between the source graphemes and phonemes. Asdescribed above, this correspondence is very useful for reducing transliterationambiguity. Thus, ψC is better at reducing ambiguity than ψG , ψP , and ψH .Second, it dynamically handles source graphemes and phonemes based on theircontexts. Although ψH also utilizes the source graphemes and phonemes, it usesonly a static weight for interpolating between the two probabilities related tothe source graphemes and source phonemes. Because ψC dynamically handlessource graphemes and phonemes based on their contexts, it can produce bothgrapheme- and phoneme-based transliterations based on the context. It canalso produce a transliteration in which one part is grapheme-based and theother part is phoneme-based. For example, in the Korean transliteration ofneomycin, “neomaisin,” “neo” is a grapheme-based transliteration and “maisin”is a phoneme-based transliteration.

This paper is organized as follows. Section 2 describes related work. Section 3describes our proposed correspondence-based machine transliteration model,and Section 4 describes the modeling of the component functions. Section 5describes the experiments we conducted to evaluate our model’s performanceand presents some of the results. Section 6 discusses the various types of errorsin transliteration based on an analysis of actual transliterations. We concludein Section 7 with a brief summary and a look at future work.

2. PREVIOUS WORK

2.1 Grapheme-Based Transliteration Models

Grapheme-based transliteration models are classified into those based on sta-tistical translation, decision trees, transliteration networks, or joint sourcechannels. In this section, we describe the key points of each type.

Lee [1998, 1999] proposed a grapheme-based English-to-Korean translitera-tion model based on statistical translation. He attempted to generate a translit-erated Korean word K for a given source English word E using Eq. (1). Hedefined a “pronunciation unit” (PU) as the graphemes that correspond to thesource phoneme. He then segmented English into PUs and attempted to findthe most relevant Korean graphemes corresponding to the PUs. For example,the English word “board (/B AO R D/)” was segmented into its pronunciation



units b(/B/):oa(/AO/):r(/R/):d(/D/);8 here, b, oa, r, and d are PUs. With these units,English word Ei is represented as Ei = epui1, . . . , epuin, where epuij is thej th PU of Ei. A sequence of Korean PUs, kpui1, kpui2, . . . , kpuin, is generatedfor each Ei. Lee [1998] generated all possible English PU sequences for eachword and their corresponding Korean PU sequences. For example, board canbe divided into PU sequences b:oar:d, b:oa:r:d, b:o:a:r:d, and so on. All possibleKorean PUs are then generated from these sequences (“b:o:deu,” “b:o:reu:deu,”“b:o:a:reu:deu,” and so on). The best PU is then selected using Eq. (1) and takenas the Korean transliteration.

argmaxK P (K |E) = argmaxK P (K )P (E|K ) (1)

P (K ) ∼= p(kpu1)n∏

i=2

p(kpui|kpui−1)

P (E|K ) ∼=n∏

i=1

p(epui|kpui)

Kim et al. [1999] expanded on Lee’s work by considering more informationwhen estimating P (E|K ), as shown in Eq. (2). He used additional information(Korean PUs kpui−1 and kpui+1) to approximate P (E|K ).

P (E|K ) ∼=n∏

i=1

p(epui|kpui−1, kpui, kpui+1) (2)

There are two drawbacks to this approach. First, errors can occur in the PU seg-mentation, leading to incorrect transliterations. Second, generating all possiblePUs for each word in each language is a time-consuming process: if the totalnumber of English PUs is N and the average number of Korean PUs generatedfor each English PU is M , the total number of generated Korean PU sequenceswill be about N × M .

Kang and Choi [2000]9 and Kang [2001] proposed an English grapheme toKorean grapheme conversion model based on decision trees. Seven contextualEnglish graphemes (the left three, the right three, and the target one) areused to determine the Korean graphemes corresponding to the target Englishgrapheme. For each English grapheme, a corresponding decision tree is con-structed (26 in all). While this approach considers wider contextual informa-tion, it does not consider the phonetic aspect of the transliteration.

Kang and Kim [2000]10 proposed an English–Korean transliteration modelbased on a transliteration network. All possible grapheme sequences are gener-ated to make a transliteration network, as shown in Figure 1. Each node in thenetwork is composed of more than one English grapheme (including graphemesand chunks of graphemes) and the corresponding Korean graphemes. Each arcrepresents a possible link between nodes. The strength of a link is representedby the weight calculated using Eq. (3), in which “Context” refers to historical

8‘:’ is used as a PU boundary.9GDT in Section 5.10GTN1 in Section 5.


190 • J.-H. Oh et al.

Fig. 1. Korean transliteration network for English word scalar.

information regarding English graphemes and “Output” is the generated Ko-rean grapheme. The optimal path is the one with the highest total weight, ascalculated using a Viterbi algorithm [Forney 1973] and a tree-trellis algorithm[Soong and Huang 1991].

weight(Context, Output) = C(Output)C(Context)

(3)

Goto et al. [2003]11 proposed an English-to-Japanese transliteration model,similar to Kang and Kim’s model [2000], in which the network is also con-structed with nodes and arcs. A node represents more than one Englishgrapheme (including graphemes and chunks of graphemes) and its corre-sponding katakana characters. An arc represents a possible link betweennodes. Consider Eq. (4), where K U represents a katakana string, E rep-resents an English word, ku represents a katakana unit, and z repre-sents the chunk of English graphemes to be transliterated. For example,the transliterated katakana string for actinium is “akuchiniumu”, K U =ku1, . . . , ku6, where ku1 = “a,” ku2 = “ku,” ku3 = “chi,” ku4 = “ni,” ku5 =“u,” and ku6 = “mu.” Equation (4) is composed of a translation model,p(kui|kui−1

i−c , ei+bi−c ), and a chunking model, p(z j |z j−1

j−c , e j+b′

j−a′+1). The parametersof each model are estimated using a maximum entropy model. The most prob-able katakana string that maximizes P (K U |E) is calculated using a Viterbialgorithm.

K U ∗ = arg maxK U P (K U |E) (4)

=∏

i=1

p(kui|kui−1

i−c , ei+bi−c

) ×m−1∏

j=1

p(z j |z j−1

j−c , e j+b′

j−a′+1

)

The advantages of the Kang and Kim [2000] and Goto et al. [2003] models arethat (1) they consider the phonetic aspect called “chunks of graphemes” and (2)they perform a one-step procedure using a transliteration network as opposedto the two-step pipelined procedure (grapheme chunking and then transliter-ation) of Lee’s model [1999]; Lee and Choi [1998]. However, they consider thephonetic aspect only at the grapheme level, which is inadequate. Further infor-mation, such as the source phonemes corresponding to a source grapheme (or a

11GTN2 in Section 5.



chunk of source graphemes) is necessary to generate a correct target languagetransliteration.

The joint source-channel based model [Li et al. 2004] simultaneously modelsboth source language and target language contexts (bigram and trigram) formachine transliteration. Its main advantage is its use of bilingual contexts.

2.2 Phoneme-Based Transliteration Models

Knight and Graehl [1997] proposed a Japanese-to-English transliterationmodel that is strictly a back-transliteration model rather than a transliter-ation one in Knight and Graehl’s [1997] definition. They modeled Japanese-to-English transliteration with weighted finite-state transducers (WFSTs). Fiveparameters are considered: p(w), p(e|w), p( j |e), p(k| j ), and p(o|k). They repre-sent, respectively, the probabilities for word sequence, word to English sound,English sound to Japanese sound, Japanese sound to katakana, and katakanato OCR. Given the katakana string o observed by OCR, their model finds theEnglish word sequence w that maximizes the sum, over e, j , and k, of thefive parameters. The main contribution of this work is that it offers a basicframework for ψP (source language word → pronunciation → target languageword).

Lee [1999] proposed a phoneme-based English-to-Korean transliterationmodel that generates Korean transliterations through a two-step procedure.First, it converts English PUs into English phonemes using the statistical trans-lation model described in Section 2.1. The phonemes are then transformedinto Korean PUs using the “English-to-Korean Standard Conversion Rules”(EKSCRS), which describe the conversion of English phonemes into Koreangraphemes using the phonemes as conditions for outputting Korean translit-erations [Korea Ministry of Culture and Tourism 1995]. This approach suffersfrom two problems: error propagation and limitations in EKSCRS. The English-PU-to-English phoneme conversion procedure usually produces errors, whichpropagate to the next step. The propagated errors make it difficult to gener-ate correct transliterations. The other problem is that EKSCRS does not con-tain enough rules to generate correct Korean transliterations for all the corre-sponding English words since its main focus is on mapping from one Englishphoneme to one Korean grapheme without considering the context of the En-glish graphemes and phonemes. For example, the English word board and itspronunciation /B AO R D/ are incorrectly transliterated into “boreudeu” byEKSCRS. If the contexts are considered, they are correctly transliterated into“bodeu.” Because of these limitations, Lee’s phoneme-based model performsworse than his grapheme-based one (described in Section 2.1).

Kang [2001]12 proposed a phoneme-based English-to-Korean transliterationmodel based on decision trees. First, the model produces pronunciation basedon a pronunciation dictionary. Then, decision trees for transforming sourcephonemes into target graphemes are applied. The model used for the decisiontrees is similar to his grapheme-based transliteration model [Kang and Choi2000; Kang 2001] described in Section 2.1. Context information for constructing

12PDT in Section 5.


192 • J.-H. Oh et al.

decision trees is seven English phonemes (the left three, the right three, andthe target one). However, this model depends on a pronunciation dictionary,making it difficult to produce transliterations when a given English word is notregistered in the pronunciation dictionary.

2.3 Hybrid Transliteration Models

Several attempts have been made to use both source graphemes and phonemesin machine transliteration. Several researchers [Lee 1999; Al-Onaizan andKnight 2002; Bilac and Tanaka 2004] have proposed hybrid transliterationmodels in which ψG and ψP are modeled with WFSTs [Bilac and Tanaka 2004]13

or a source-channel model [Lee 1999; Al-Onaizan and Knight 2002]. Then, ψG

and ψP are combined through linear interpolation. In ψP , several parametersare considered, such as the source grapheme to source phoneme probability, thesource phoneme to target grapheme probability, and the target language wordprobability. The ψG mainly considers the source grapheme to target graphemeprobability. The main disadvantage of the hybrid models is that they do notconsider the dependence between the source graphemes and phonemes in thecombining process.

2.4 Summary

Much of the previous work has focused on ψG , rather than ψP , because the for-mer offers certain advantages compared to the latter. First, unlike ψP , ψG doesnot require any knowledge about pronunciation, meaning that while ψG re-quires only the source grapheme to target grapheme transformation, ψP needssource grapheme to source phoneme transformation and source phoneme to tar-get grapheme transformation. Second, error propagation occurs in ψP becauseit is composed of two steps. Errors in the first step usually make it difficultto generate correct transliterations in the second step. This is the main rea-son for the performance of ψP usually being lower than that of ψG , althoughmany standard transliterations are phoneme-based transliterations. However,ψG has limitations. Although it is a relatively simple and effective model, itdoes not consider phonetic features such as source phonemes. This causes er-rors when a source phoneme rather than a source grapheme provides importantclues that can be used to generate the correct transliteration.

Because of the respective natures of ψG and ψP , they cannot simultane-ously consider both a source grapheme and a source phoneme. This makesit difficult to generate a correct transliteration, particularly when one modelencounters source language words that should be transliterated through ne-gotiation between the source grapheme and the source phoneme or by a dif-ferent transliteration model. The hybrid transliteration model was proposedto solve this problem. However, it still does not consider the dependence be-tween the source grapheme and the source phoneme. Instead, it integratesPr(ψG) and Pr(ψP ) by linear interpolation, assigning a static weight for eachprobability.

13HWFST in Section 5.



Fig. 2. System architecture for correspondence-based transliteration model.

Our model, ψC, overcomes these problems. Although it requires pronunci-ation knowledge, it minimizes errors caused by error propagation by using asource grapheme corresponding to a source phoneme. It is, therefore, more likelyto avoid errors caused by error propagation. Because ψC can dynamically usesource graphemes and source phonemes depending on context, it can producetransliterations more effectively.

3. CORRESPONDENCE-BASED MACHINE TRANSLITERATION MODEL

Our correspondence-based transliteration model (ψC) is composed of two com-ponent functions: φP and φ(SP)T . The φSP function produces pronunciation(S → P ), and the φ(SP)T function produces target graphemes (S × P → T ).First, φSP produces pronunciation and then φ(SP)T produces target graphemescorresponding to the source grapheme and phoneme produced by φSP. The goalof φSP is to produce the most probable sequence of source phonemes correspond-ing to the source graphemes. For example, φSP produces /B/, /AO/, /∼/14, /R/, and/D/ for each source grapheme (b, o, a, r, and d) in board (see “Results of φSP” onthe right side of Figure 2). In this step, pronunciation is generated in two ways:through a pronunciation dictionary search and by pronunciation estimation. Apronunciation dictionary contains the correct pronunciation corresponding toEnglish words. Therefore, whether English words are registered in the dictio-nary is checked first; if they are not, pronunciation estimation is used. φ(SP)Tproduces target graphemes corresponding to the source graphemes and sourcephonemes. For example, φ(SP)T produces “b,” “o,” “∼,” “∼,” and “deu” using theresults of φSP (b-/B/, o-/AO/, a-/∼/, r-/R/, and d-/D/) (see “Results of φ(SP)T ” on

14In this paper, /∼/ represents silence and ‘∼’ represents a null target grapheme


194 • J.-H. Oh et al.

Table II. Feature Types Used for Correspondence-Based TransliterationModela

Feature type Description and Possible ValuesfS,Stype f S Source graphemes in S:

26 letters in English alphabetfStype Source grapheme types:

Consonant (C) and Vowel (V)fP,Ptype f P Source phonemes in P

(/AA/, /AE/, and so on)fPtype Source phoneme types: Consonant (C), Vowel (V),

Semi-vowel (SV), and silence (/∼/)fT Target graphemes in T

a S is a set of source graphemes (e.g., English letters), P is a set of source phonemesdefined in ARPABET, and T is a set of target graphemes. Note that fS,Stype indicatesboth f S and fStype and fP,Ptype indicates both f P and fPtype.

the right side of Figure 2). Finally, the target language transliteration, such asthe Korean transliteration “bodeu” for board, is obtained by concatenating thesequence of target graphemes produced by φ(SP)T .

Machine-learning algorithms are used to train φSP and φ(SP)T . To train eachcomponent function, we need features that represent the training instancesand data. Table II shows the five feature types, fS , fP , fStype, fPtype, and fT , thatour model uses. Depending on the component function, different feature typesare applied. φSP uses { f S , fStype, f P }, while φ(SP)T uses { f S , f P , fStype, fPtype,fT }.

3.1 Producing Pronunciation

The component function for producing pronunciation (φSP:S → P ) finds sourcephonemes in a set P for each source grapheme, where P is a set of sourcephonemes defined in ARPABET and S is a set of source graphemes (e.g., En-glish letters). The results of this step can be represented as a sequence of cor-respondences between the source graphemes and source phonemes. We denotethe sequence as SP = {sp1, sp2, . . . , spn; spi = (si, pi = φSP(si))}, where si is theith source grapheme in the source language word SW (= s1, s2, . . . , sn). φSP iscomposed of two steps. The first step involves a search in the pronunciationdictionary, which contains English words and their pronunciation. Here we useThe CMU Pronouncing Dictionary,15 which contains 120,000 English words andtheir pronunciations. The second step involves pronunciation estimation. If anEnglish word is not registered in the pronunciation dictionary, we estimate itspronunciation.

Let SW (= s1, s2, . . . , sn) be an English word and PSW (= p1, p2, . . . , pn) beSW’s pronunciation, where si represents the ith grapheme, and pi = φSP(si).Pronunciation estimation is a task to find the most correct source phonemein the set of all possible source phonemes that can be derived from sourcegrapheme si. Table III shows an example of pronunciation estimation for b inboard. The L1–L3 and R1–R3 represent the left and right source graphemes,

15Available at http://www.speech.cs.cmu.edu/cgi-bin/cmudict



Table III. An Example of Pronunciation Estimation for b in Board

Feature type L3 L2 L1 C0 R1 R2 R3 φSP(C0)f S $ $ $ b o a r → /B/fStype $ $ $ C V V Cf P $ $ $

Table IV. An Example of φ(SP)T for b in Board

Feature type L3 L2 L1 C0 R1 R2 R3 φ(SP)T(C0)f S $ $ $ b o a r → “b”fStype $ $ $ C V V Cf P $ $ $ /B/ /AO/ /∼/ /R/fPtype $ $ $ C V /∼/ CfT $ $ $

respectively, and C0 represents the current source grapheme (or focus). TheφSP(C0) represents the estimated source phoneme corresponding to f S at C0,and $ represents the start of words. The results can be interpreted as follows.The most relevant source phoneme of b, /B/, can be produced by means of thecontext, f S , fStype, and f P at L1–L3, C0, and R1–R3. Other source phonemes foro, a, r, and d in board are produced in the same manner (o → /AO/, a → / ∼ /,r → /R/, and d → /D/). Finally, we obtain the pronunciation of board as /BAO R D/ by concatenating the sequence of source phonemes.

3.2 Producing Target Graphemes

The component function for producing target graphemes (φ(SP)T : S × P → T )finds the target grapheme in T for each spi that is a result of φSP. The result ofthis step, SPT, is represented by a sequence of spi and its corresponding targetgraphemes generated by φ(SP)T , e.g., SPT={spt1, spt2, . . . , sptn; spti = (spi, ti =φ(SP)T (spi))}.

Let SW (= s1, s2, . . . , sn) be a source language word, PSW (= p1, p2, . . . , pn) beSW’s pronunciation, and TSW (= t1, t2, . . . , tn) be a target language word of SW ,where si, φSP(si) = pi, and φ(SP)T(spi) = ti represent the ith source grapheme,the source phoneme corresponding to si, and the target grapheme correspond-ing to spi, respectively. φ(SP)T finds the most probable target grapheme in the setof all possible target graphemes that can be derived from spi. φ(SP)T producestarget graphemes with the source grapheme ( f S), source phoneme ( f P ), sourcegrapheme type ( fStype ), source phoneme type ( fPtype), and φ(SP)T’s previous out-put ( fT ) in the context window. Table IV shows an example of φ(SP)T for b inboard. φ(SP)T produces the most probable sequence of target graphemes (e.g., Ko-rean or Japanese), such as φ(SP)T(sp1) = “b,” φ(SP)T(sp2) = “o,” φ(SP)T(sp3) =“∼,”φ(SP)T(sp4) =“∼,” and φ(SP)T(sp5) =“deu” for board. Finally, the target languagetransliteration of board, “bodeu,” is acquired by concatenating the sequence ofproduced target graphemes.

4. MACHINE-LEARNING ALGORITHMS FOR EACH COMPONENT FUNCTION

We can model the component functions using three machine-learning algo-rithms (maximum entropy model, decision tree, and memory-based learning).


196 • J.-H. Oh et al.

Because φSP and φ(SP)T share a similar framework, we limit our focus to φ(SP)T

in this section.

4.1 Maximum Entropy Model

The maximum entropy model (MEM) is a widely used probability model that canincorporate heterogeneous information effectively [Berger et al. 1996; Miyaoand Tsuji 2002]. In MEM, an event, ev, is usually composed of a target event(te) and a history event (he); say ev =< te, he >. Event ev is represented by abundle of feature functions, f ei(ev), which represent the existence of a certaincharacteristic in event ev. A feature function is a binary-valued function. It isactivated ( f ei(ev) = 1) when it meets its activating condition; otherwise it isdeactivated ( f ei(ev) = 0) [Berger et al. 1996; Miyao and Tsuji 2002]. φ(SP )Tbased on the maximum entropy model can be represented as

Prφ(SP )T (TSW |SW, PSW) (5)= Prφ(SP )T (t1, . . . , tn|s1, . . . , sn, p1, . . . , pn)

Let source language word SW be composed of n graphemes. SW, PSW , andTSW can then be represented by SW = s1, . . . , sn, PSW = p1, . . . , pn, andTSW = t1, . . . , tn, respectively. PSW and TSW represent the pronunciation andtarget language word corresponding to SW, and pi and ti represent the sourcephoneme and target grapheme corresponding to si. With the assumption thatφ(SP )T depends on context information in a window size of k, we simplify Eq. (5):

Prφ(SP)T(TSW|SW, PSW) (6)

≈∏

i

Prφ(SP)T(ti|ti−k , . . . , ti−1, pi−k , . . . , pi+k , si−k , . . . , si+k)

Because t1, . . . , tn, s1, . . . , sn, and p1, . . . , pn can be represented by fT , f S,Stype,and f P,Ptype, respectively, we can rewrite Eq. (6)

Prφ(SP)T(TSW|SW, PSW) (7)

≈∏

i

Prφ(SP)T

(ti| fT(i−k,i−1) , f P,Ptype(i−k,i+k)

, f S,Stype(i−k,i+k),)

where i is the index of the current source grapheme and source phoneme tobe transliterated and f X (l ,m) represents the features of feature type f X locatedfrom position l to position m.

One important thing in designing a model based on the maximum entropymodel is to identify feature functions that effectively support certain decisions ofthe model. Our basic philosophy of feature function design for each componentfunction is that the context information collocated with the unit of interestis important. We thus designed the feature function with collocated featuresin each feature type and in different feature types. The features used in eachcomponent function are listed as follows. All of the feature types listed beloware used as activating conditions of feature functions for the maximum entropymodel to estimate the most relevant output of each component function.



Table V. Feature Functions for φ(SP )T Derived from Table IV

fe j target (ti) fT(i−k,i−1) f S,Stype(i−k,i+k) fP,Ptype(i−k,i+k)

fe1 “b” - fSi = b fPi = /B/

fe2 “b” - fSi−1 = $ -fe3 “b” fTi−1 = $ fSi+1 = o and fStypei+2 = V fPi = /B/

fe4 “b” - - fPi+1 = /AO/

fe5 “b” fTi−2 = $ fSi+3 = r fPtypei = C

• Feature type and features used for φ(SP )T (k = 3)• All possible features in fS,Stypei−k,i+k

, fP,Ptypei−k,i+k, and fTi−k,i−1 (e.g., f Si−1 , f Pi−1 ,

and fTi−1 )• All possible feature combinations between features of the same feature type

(e.g., { fSi−2 , fSi−1 , fSi+1}, { fPi−2 , fPi , fPi+2}, and { fTi−2 , fTi−1})• All possible feature combinations between features of different feature types

(e.g., { fSi−1 , fPi−1}, { fSi−1 , fTi−2} , and { fPtypei−2, fPi−3 , fTi−2})

—between fS,Stypei−k,i+kand fP,Ptypei−k,i+k

—between fS,Stypei−k,i+kand fTi−k,i−1

—between fP,Ptypei−k,i+kand fTi−k,i−1

Generally, a conditional maximum entropy model is an exponential log-linearmodel that gives the conditional probability of event e =< te, he > as describedin Eq. (8) [Berger et al. 1996; Miyao and Tsuji 2002]. Note that τ (he) is a set oftargets observable with history he.

Pr(te|he) = 1Zhe

∏

j

αf e j (te,he)j (8)

Zhe =∑

te′∈τ (he)

∏

j

αfe j (te′,he)j

In φ(SP)T , the target event (te) is target graphemes to be assigned andthe history event (he) can be represented as a tuple < fT(i−k,i−1) , fS,Stype(i−k,i+k)

,fP,Ptype(i−k,i+k)

>. Therefore, we can rewrite Eq. (7) as

Prφ(SP)T

(ti| fT(i−k,i−1) , f S,Stype(i−k,i+k)

, f P,Ptype(i−k,i+k)

)(9)

= Prφ(SP)T(te(SPT)|he(SPT))

= 1Zhe(SPT)

∏

q

α f eq (te(SPT),he(SPT)).q

Table V shows examples of feature functions for φ(SP )T ; Table IV was used toderive the feature functions. For example, f e1 represents an event where f Si

is b, f Pi is /B/, and fTi (target) is “b.” To model each component function, basedon MEM, Zhang’s maximum entropy modeling tool is used [Zhang 2004].

4.2 Decision Tree

Decision-tree learning is one of the most widely used and best known methodsfor inductive inference [Quinlan 1986; Mitchell 1997]. ID3 is a greedy algorithm


198 • J.-H. Oh et al.

Fig. 3. Decision tree for φ(SP)T .

that constructs decision trees in a top-down manner using the informationgain, which measures how well a given feature (or attribute) separates trainingexamples based on their target class [Quinlan 1993; Manning and Schutze1999]. We use C4.5 [Quinlan 1993], a well known tool for decision-tree learningand implementation of Quinlan’s ID3 algorithm.

Training data for φ(SP )T is represented by features located in L3–L1, C0, andR1–R3, as described in Tables IV. C4.5 tries to construct a decision tree by look-ing for regularities in the training data [Mitchell 1997]. Figure 3 shows partof a decision tree constructed for φ(SP)T in English-to-Korean transliteration.To simplify our examples, we use only f S and f P . (Note that all the featuretypes described in Tables IV are used to construct decision trees.) The set of thetarget classes for φ(SP)T is the set of target graphemes. In Figure 3, the decisiontree is represented by decision nodes (circles) and leaf nodes (rectangles), whichdescribe the target classes. Intuitively, the most effective feature for φ(SP)T maybe located in C0 among L3–L1, C0, and R1–R3 because the correct outputs forφ(SP)T strongly depend on the source grapheme or phoneme in the C0 position.As we expected, the most effective feature in the decision trees was locatedin the C0 position, such as C0( f P ). (Note that the first feature to be testedis the most effective feature in decision trees.) The decision tree for φ(SP)T inFigure 3 produces the target grapheme (Korean grapheme) “o” for x(SPT) by re-trieving the decision nodes from C0( fP ) = /AO/ to R1( fP ) = / ∼ / representedby “∗.”

4.3 Memory-Based Learning

Memory-based learning (MBL) is an example-based learning method that isalso called instance-based learning and case-based learning. It is based on ak-nearest neighborhood algorithm [Aha et al. 1991; Aha 1997; Cover and Hart1967; Devijver and Kittler 1982]. In the training phase, MBL places all train-ing data as examples in memory and clusters some examples according to the



Fig. 4. Memory-based learning for φ(SP)T .

k-nearest neighborhood principle. It then produces output using similarity-based reasoning between test data and the examples in the memory. Let the testdata be x and a set of examples in a memory be Y . The similarity between x andY is estimated using a distance function, �(x, Y ). MBL selects an example yi ora cluster of examples that are most similar to x and then assigns a target class ofthe example to the x ’s target class. We use a memory-based learning tool calledTiMBL (Tilburg memory-based learner) version 5.0 [Daelemans et al. 2003].

Training data for MBL is represented in the same form as the training datafor a decision tree. Note that the target classes for φ(SP)T that are output by MBLare source phonemes and target graphemes. As in Figure 3, we use only f S andf P to simplify our examples. Figure 4 shows examples of φ(SP)T based on MBLfor English-to-Korean transliteration. All training data are represented withtheir features in the context of L3–L1, C0, and R1–R3 and their target classesfor φ(SP)T . They are stored in memory during the training phase. The featuresare weighted during the training phase to differentiate their importance. Asshown in Figure 4, φ(SP)T based on MBL outputs the target grapheme “o.”

5. EXPERIMENTS

We evaluated our correspondence-based transliteration model (ψC) by con-ducting two types of English-to-Korean and English-to-Japanese translitera-tion experiments: a comparison test and a context window size test. In thecomparison test, we compared the performance of our model with that ofKang and Choi’s English grapheme to Korean grapheme-conversion modelbased on decision trees (GTN) [Kang and Choi 2000; Kang 2001], Kang andKim’s English–Korean transliteration model based on a transliteration net-work (GTN1) [Kang and Kim 2000], Goto et al’s English-to-Japanese translit-eration model (GTN2) [Goto et al. 2003], Kang’s phoneme-based English-to-Korean transliteration model based on decision trees (PDT) [Kang 2001], and ahybrid transliteration model in which ψG and ψP are modeled with weighted fi-nite state transducers (HWFST) [Bilac and Tanaka 2004]. Note that PDT relies


200 • J.-H. Oh et al.

Table VI. Results of Comparison Test

EKSet (%) EJSet (%)PDT 47.5 50.7GDT 51.4 50.3GTN1 55.1 53.2GTN2 55.9 56.2HWFST 58.3 62.5CDT 62.0 66.8CMEM 63.3 67.0CMBL 66.9 72.2

only on a pronunciation dictionary for producing pronunciation. We tested threeversions of our model: maximum entropy (CMEM), decision tree (CDT), andmemory-based learning (CMBL). To enable us to compare their performancewith that of PDT, regardless of the performance of producing pronunciation(φSP), we applied the same method for producing pronunciation to PDT as usedby our model. In the context window size test, we estimated the effect of thecontext window size on the performance of our model.

The English-to-Korean test set (EKSet) [Nam 1997] consisted of 7185English–Korean pairs—there were 6185 training data and 1000 test data. EK-Set contained no transliteration variations.16 The English-to-Japanese test set(EJSet) consisted of 10,398 English-katakana pairs from EDICT [Breen 2003]—1000 for testing and the rest for training. EJSet did contain transliterationvariations, like (micro, “maikuro”) and (micro, “mikuro”); the average numberof Japanese transliterations for an English word was 1.15. Because our EKSetand EJSet contain no or only a few transliteration variations, our evaluationfocused on how well our system produced the standard transliterations in theEKSet and EJSet. Transliteration variations will be discussed in Section 6. Ourevaluation measure was word accuracy (WA), as given by Eq. (10), where “CTs”denotes correct transliterations and “OTs” denotes the output transliterations.

WA = number of CTsnumber of OTs

(10)

As shown in Table VI, all three versions of our model had better performancethan the other models for both the English-to-Korean and English-to-Japanesetransliterations. Table VII shows the information and context size used by eachof the models. The key reason for the better performance of our model is thatit can use correspondence while the others cannot.

The models that used a wide context window and previous outputs tendedto perform better. For example, GTN2 had more accurate results than GDT(Table VI). Because machine transliteration is sensitive to context, the widerthe context, the better the performance. Note that the context window size of

16Because of the difference in phonetic systems between two languages, there is usually ambiguityin the transliteration process from one language to the other. This ambiguity makes it difficultfor translators or users to always follow the transliteration guidelines. Therefore, they may par-tially observe the transliteration guidelines while creating a transliteration in their own stylewhen necessary. We call this transliteration variations. Note that standard transliterations referto transliterations observing the transliteration guidelines.



Table VII. Information and Context Size Used by Each Modela

Model S P C POut Context SizeGDT Y N N N < −3, +3 >

GTN1 Y N N Y UnboundedGTN2 Y N N Y < −3, +3 >

PDT N Y N N < −3, +3 >

HWFST Y Y N - -Ours Y Y Y Y < −3, +3 >

aSource grapheme (S), source phoneme (P), correspondence (C), and pre-vious output (POut).

Table VIII. Results of ContextWindow Size Test

Size EKSet (%) EJSet (%)1 54.5 62.72 63.3 70.03 66.9 72.24 63.9 70.75 63.8 69.3

the other models was limited to three, because a size greater than three eitherdegrades [Kang and Choi 2000; Kang 2001] or does not improve performance[Kang and Kim 2000]. Determining an optimal context window size is thusimportant for machine transliteration.

For our context window size test, we used CMBL, which showed the bestperformance among the three versions of our model (Table VI). We varied thewindow size from 1 to 5. As shown in Table VIII, the best performance wasobtained when the size was 3 for both the English-to-Korean and English-to-Japanese transliterations. When the size was 1, there were many cases in whichcorrect transliterations were not produced because of a lack of information. Forexample, the right three graphemes are needed to produce the correct targetgrapheme for t in -tion. When the context window size was over three, it was dif-ficult to generalize the training data because of the increased variety of trainingdata. For both these reasons, our model worked best when the context windowsize was three. As also shown in Table VIII, the context size should be at leasttwo to avoid significant performance deterioration because of a lack of contex-tual information.

In summary, our model performed significantly better than the othermodels—about 15–41% better for the English-to-Korean transliteration andabout 16–44% better for the English-to-Japanese transliteration. Our resultsshow that a good transliteration model should simultaneously consider thesource grapheme and the source phoneme, along with their correspondence, areasonable context size, and the previous output. Our transliteration model sat-isfies these three conditions, enabling it to perform better than the other models.

6. DISCUSSION

We have shown experimentally that our correspondence-based transliterationmodel (ψC) effectively produces transliterations. The main reason ψC performed


202 • J.-H. Oh et al.

the best is its ability to effectively handle both grapheme- and phoneme-basedtransliteration, depending on context. Because most of the other models towhich we compared ψC are grapheme-based, they are better able to producegrapheme-based transliterations than ψC; ψC failed to produce some grapheme-based transliterations correctly that were produced correctly by some of theother models, including (accent, “aksenteu”), (gel, “gel,”), (ionium, “ioniumu”),and (nickel, “nikkel”). The other models sometimes failed to produce correctphoneme-based transliterations; for example, only ψC produced the correcttransliteration for (paradigm, “paeleodaim”), (firmware, “peomweeo”), (gyro-copter, “jairokoputa”), and (cyrix, “sairikkusu”). In short, each transliterationmodel has both strong and weak points. Thus, by using a combination of mod-els that complement each other, we can achieve a high-performance machinetransliteration system.

In our evaluation, transliteration variations were not considered to be correcttransliterations. In other words, only the standard transliterations in the goldstandard (EKSet and EJSet) were considered to be correct. Variations wereconsidered to be an error. In such cases, our transliteration model producedtransliteration variations rather than the correct transliterations in the goldstandard.

To analyze these errors, we first need to identify the transliteration vari-ations among the transliterations produced by our model. We assume that atransliteration variation should have the support of people, in general, becausethey are the ones who produce and consume transliteration variations. There-fore, we believe that we can identify the transliteration variations by examin-ing their usage in web data, i.e., the number of web documents in which theyappear. To obtain these “web frequencies,” we used a phrasal search methodin which a phrase composed of a machine-generated transliteration and itssource word was used as a query for a web search engine. The retrieved docu-ments contained the query as a phrase. This enabled us to investigate whethera machine-generated transliteration and its source word are generally usedas “transliteration pairs”17 in target language texts. Example web documentsretrieved by a phrasal search are shown in Figure 5. The rectangles drawnin the documents mark machine-generated transliterations, with their sourcewords shown in parentheses. Such patterns indicate that web documents re-trieved by a phrasal search are useful for identifying valid transliteration pairs.We considered produced transliterations that were not in the gold standardto be transliteration variations if they could be found in web documents. Inother words, we accepted transliteration variations if their web frequency wasabove zero. Using this strategy, we found that about 120 of the 331 errorsin the English-to-Korean transliteration and about 70 of 278 errors in theEnglish-to-Japanese transliteration were caused by transliteration variations.

Tables IX and X show example standard transliterations and transliterationvariations for several English words along with their web frequency (WF). Fromthese results, we identified three characteristics that indicate it is important to

17A transliteration pair can be defined as a word pair or a translation pair composed of a translit-eration and its corresponding source language word.



Fig. 5. Web documents used to obtain web frequency in Tables IX and X.

Table IX. Korean Transliterations and Their Web Frequency (WF)

English Word Standard WF Variation WFaccent “aksenteu” 604 “aeksenteu” 783ambulance “aembyulleonseu” 194 “aembulleonseu” 7canter “kaenteo” 46 “kanteo” 5chinook “chinukeu” 390 “chinuk” 106community “keomyuniti” 52,215 “keomyuneoti” 83gel “gel” 9,693 “jel” 5,251outreach “auteulichi” 18 “autlichi” 665

Table X. Japanese Transliterations and Their Web Frequency (WF)

English Word Standard WF Variation WFcriteria “kuraiteria” 114 “kuriteria” 1,050bomb ‘bomu” 1,596 ‘’bon” 2,436condominium “kondominiamu” 340 “kondomi” 33hashish “hashishi” 3 “hashisshu” 21hysteria “hisuteri” 118 “hisuteria” 777cyclosporine “saikurosuporin” 14 “shikurosuporin” 17whisker “uisuka” 12 “wisuka” 27

accommodate both transliteration variations and the standard transliterationsin natural language processing.

� The WF (i.e., number of web documents) shows that both transliterationvariations and standard transliterations are frequently used in real-worldtexts.

� The WF of transliteration variations was sometimes higher than that of thestandard transliteration.

� The difference between a standard transliteration and a transliteration vari-ation can usually be found in the target language vowels rather than in thetarget language consonants. This is because vowels result in more translit-eration ambiguity than consonants.

To evaluate the coverage of our correspondence-based transliterationmodel (i.e., how well it produces various transliterations including standard


204 • J.-H. Oh et al.

Table XI. Characteristics of KTSET and NTCIR

Features KTSET NTCIRNumber of documents 4,414 339,501Number of queries 50 83Average length of documents 150 338Average length of queries 2.84 19.69

transliterations and their variations) and to indirectly evaluate whether theproduced transliterations are useful for IR (information retrieval), we con-ducted a “coverage test” on Korean and Japanese IR test collections. Sinceterm frequency (tf) and document frequency (df) are important factors in IR fordetermining a term’s weight, we used tf and df as two of the three evaluationmeasures of transliteration coverage (TC). We also used ut, which representsunique transliteration type. These three TC measures (TCt f , TCdf, and TCut) aredescribed in Eq. (11), where TRFound represents the set of transliterations foundby our method, TRGold represents the set of transliterations in the gold stan-dard, tri represents the ith transliteration in TR, tf(tri) is the term frequencyof tri, and df(tri) is the document frequency of tri.

TCut = |TRFound||TRGold| (11)

TCtf =∑

tr j ∈TRFoundtf(tr j )

∑tri∈TRGold

tf(tri)

TCdf =∑

tr j ∈TRFounddf(tr j )∑

tri∈TRGolddf(tri)

We constructed the test sets used for the coverage test using Korean andJapanese IR test collections—KTCSet (Korean transliteration coverage testset) was constructed using KTSET 2.0 [Park et al. 1996] and JTCSet (Japanesetransliteration coverage test set) was constructed using the NTCIR-1 ADHOCtest collection [Kando et al. 1999]. Henceforth, we refer to KTSET 2.0 as “KT-SET” and the NTCIR-1 ADHOC test collection as “NTCIR.” The characteristicsof KTSET and NTCIR are summarized in Table XI. There are four types ofqueries in NTCIR (title, description, narrative, and concept—representing titleof query, short description of query, long description of query, and concept key-word of query, respectively. We used the concept query type for our experimentand “average length of queries” (Table XI) because the concept queries usuallycontain some bilingual information (English concept keywords and correspond-ing Japanese concept keywords).

As shown in Table XI, NTICR contains much more documents than KTSET.Moreover, the document and query lengths in NTCIR are also longer than inKTSET. Given these difference, we used different strategies in constructingKTCSet and JTCSet. We constructed KTCSet with Korean transliterations ex-tracted from all documents and queries in KTSET. We then manually added anEnglish word corresponding to each Korean transliteration. We ignored one-syllable Korean transliterations because they are usually confused with pure



Table XII. Results of Coverage Test

KTCSet (%) JTCSet (%)

TCut TCtf TCdf TCut TCtf TCdf

CMBL 62.1 77.9 78.4 52.1 84.5 84.3CMBL + CMEM 68.2 87.8 87.8 63.0 96.1 95.7CMBL + CMEM + CDT 69.2 88.6 88.6 65.0 96.2 95.7Upper bound 85.0 98.0 97.6 79.7 99.5 99.2

Korean words. KTCSet contains 1283 transliterations corresponding to 1149English words, so one English word corresponds to 1.12 transliterations. Be-cause NTCIR contains many more documents than KTSET, it would be hardto construct JTCSet with all documents. We, therefore, used those documentsin NTCIR-containing transliterations originating from the same English wordsas the transliterations in the queries. First, we manually added English wordscorresponding to the Japanese transliterations to the NTCIR queries. We thenfound Japanese transliterations in documents that originated from the sameEnglish word as those in the queries. JTCSet contains 553 transliterationscorresponding to 311 English words, so one English word corresponds to 1.78transliterations. One of the main reasons many more transliteration variationsare in JTCSet than KTCSet is the long vowel in Japanese transliterations (orJapanese katakana). The long vowel results in various transliteration varia-tions, because it can be inserted after every vowel. For example, NTCIR con-tains several transliterations corresponding to English word data, including“deta,” “deta,” and “deta,” where “e” and “a” represent the long vowels “e” and“a,” respectively).

For the coverage test, we produced transliterations for English words inKTCSet and JTCSet using CDT, CMEM, and CMBL. We then filtered out theproduced transliterations for which the WF was zero (only the transliterationappearing in web documents were considered). The results are shown in Ta-ble XII. We calculated the upper bound by applying the WF to the gold standardsof KTCSet and JTCSet. It is the TC when TCFound is calculated by removingthe transliterations in TCGold that do not appear in web documents. We useCMBL as the baseline because it had the best performance (Table VI). We thenadded transliterations sequentially produced by CMEM and CDT. As shown inTable XII, combining transliterations produced by the different versions of ourmodel effectively increases TC. Because our approach focuses on producing themost probable transliteration, CMBL alone was less effective than CMBL +CMEM + CMBL. In other words, more than one transliteration should be con-sidered in order to achieve high TC, because KTCSet and JTCSet contain 1.12and 1.78 transliterations for each English word, respectively. Although sometransliterations in the gold standard were not be produced by CMBL + CMEM+ CMBL, the transliterations had a low term frequency and a low document fre-quency. Therefore, TCtf and TCdf were relatively high compared to TCut in ourresults. For the same reason, the performance gap in terms of TCtf and TCdf wasless than that in terms of TCut for CMBL + CMEM + CMBL compared to theupper bound. Overall, CMBL + CMEM + CMBL effectively covered translitera-tions about 52 to 69% for TCut, 78–96% for TCtf, and 78–96% for TCdf; although


206 • J.-H. Oh et al.

TCut was relatively low, TCsf and TCdf were high. Our transliteration modelshould be useful for IR, because its TCtf and TCdf were high, meaning that itproduced transliterations that frequently appeared in texts.

While most machine transliteration research has focused on producing stan-dard transliterations, transliteration variations are also important for someapplications, especially for information retrieval systems [Kang and Choi 2000;Lee and Choi 1998; Fujii and Tetsuya 2001]. The ability to recognize terms thatare semantically equivalent, but orthographically different, such as transliter-ation equivalent18 is important for improving the performance of informationretrieval systems. As shown in Tables IX and X, we may miss many chances toretrieve relevant documents if we do not consider transliteration equivalentsbecause there are many documents containing only the standard transliterationor a transliteration variation. Consequently, there have been studies on translit-eration pair (or transliteration equivalent) acquisition from corpora [Brill et al.2001; Tsujii 2002; Collier et al. 1997; Kuo and Yang 2004]. While machinetransliteration is aimed at automatically producing transliterations for givensource words, transliteration pair acquisition is aimed at finding a phonetic cog-nate in two languages from bilingual corpora. Transliteration acquisition meth-ods usually have three steps: candidate extraction, phonetic conversion, andcomparison. In candidate extraction, an algorithm tries to find transliterationpair candidates in bilingual corpora. Source (or target) language words in thecandidates are then phonetically converted to target (or source) language ones,as in machine transliteration. For example, Brill et al. [2001], Tsujii [2002], andCollier et al. [1997] transform Japanese katakana words into English wordswith their Japanese-to-English back-transliteration systems. Finally, the algo-rithm finds the relevant transliteration pairs by means of phonetic similarity,which is determined by phonetic comparison between the phonetically con-verted source (or target) language word and the target language word in thecandidates. Therefore, machine transliteration can serve as one component intransliteration pair acquisition by providing a machine-generated transliter-ated form. In addition, the results of transliteration pair acquisition—pairsof source words and their transliterations—can be used as training data inmachine transliteration. Therefore, these two functions should be used in acomplementary fashion to address the problems related to transliteration.

7. CONCLUSION

Unlike previous transliteration models, our correspondence-based machinetransliteration model, ψC, uses the correspondence between source graphemesand source phonemes. This correspondence makes it possible for ψC to ef-fectively produce both grapheme- and phoneme-based transliterations. More-over, the correspondence helps it reduce transliteration ambiguity. Experimentsshowed that ψC is more effective than other transliteration models; its perfor-mance was better by about 15–41% in English-to-Korean transliteration andby about 16–44% in English-to-Japanese transliteration.

18“Transliteration equivalent” refers to a set of transliteration pairs that originated from the sameforeign word.



We plan to apply our transliteration model to an English-to-Chinese translit-eration model. To demonstrate the usefulness of our model in NLP applications,we plan to apply it to such applications as automatic bilingual dictionary con-struction, information retrieval, and machine translation.

REFERENCES

AHA, D. W. 1997. Lazy learning: Special issue editorial. Artificial Intelligence Review 11, 710.AHA, D. W., KIBLER, D., AND ALBERT, M. 1991. Instance-based learning algorithms. Machine Learn-

ing 6, 3766.AL-ONAIZAN, Y. AND KNIGHT, K. 2002. Translating named entities using monolingual and bilingual

resources. In Proceedings of ACL 2002. 400–408.BERGER, A. L., PIETRA, S. D., AND PIETRA, V. J. D. 1996. A maximum entropy approach to natural

language processing. Computational Linguistics 22, 1, 39–71.BILAC, S. AND TANAKA, H. 2004. Improving back-transliteration by combining information sources.

In Proceedings of IJCNLP 2004. 542–547.BREEN, J. 2003. EDICT Japanese/English dictionary .le. the Electronic Dictionary Research and

Development Group, Monash University.BRILL, E., KACMARCIK, G., AND BROCKETT, C. 2001. Automatically harvesting Katakana-English

term pairs from search engine query logs. In Proceedings of the Natural Language ProcessingPacific Rim Symposium 2001. 393–399.

COLLIER, N., KUMANO, A., AND HIRAKAWA, H. 1997. Acquisition of English-Japanese proper nounsfrom noisy-parallel newswire articles using Katakana matching. In Proceedings of the NaturalLanguage Processing Pacific Rim Symposium 1997. 309–314.

COVER, T. M. AND HART, P. E. 1967. Nearest neighbor pattern classification. Institute of Electricaland Electronics Engineers Transactions on Information Theory 13, 2127, 57–67.

DAELEMANS, W., ZAVREL, J., SLOOT, K. V. D., AND BOSCH, A. V. D. 2003. TiMBL: Tilburg Memory-Based Learner - version 4.3 reference guide.

DEVIJVER, P. A. AND KITTLER., J. 1982. Pattern recognition: A statistical approach. Prentice-Hall,Englewood Cliffs, NJ.

FORNEY, G. D. 1973. The Viterbi algorithm. In Proceedings of the IEEE. Vol. 61. 268–278.FUJII, A. AND TETSUYA, I. 2001. Japanese/English cross-language information retrieval: Explo-

ration of query translation and transliteration. Computers and the Humanities 35, 4, 389–420.

GOTO, I., KATO, N., URATANI, N., AND EHARA, T. 2003. Transliteration considering context informa-tion based on the maximum entropy method. In Proceedings of MT-Summit IX. 125–132.

KANDO, N., KURIYAMA, K., NOZUE, T., EGUCHI, K., KATO, H., AND HIDAKA, S. 1999. Overview of IRtasks. In Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval andTerm Recognition. 11–44.

KANG, B. J. 2001. A resolution of word mismatch problem caused by foreign word transliterationsand English words in Korean information retrieval. Ph.D. thesis, Computer Science Dept., KAIST.

KANG, B. J. AND CHOI, K. S. 2000. Automatic transliteration and back-transliteration by decisiontree learning. In Proceedings of the 2nd International Conference on Language Resources andEvaluation. 1135–1411.

KANG, I. H. AND KIM, G. C. 2000. English-to-Korean transliteration using multiple unboundedoverlapping phoneme chunks. In Proceedings of the 18th International Conference on Computa-tional Linguistics. 418–424.

KIM, J. J., LEE, J. S., AND CHOI, K. S. 1999. Pronunciation unit based automatic English-Koreantransliteration model using neural network. In Proceedings of Korea Cognitive Science Associa-tion. 247–252.

KNIGHT, K. AND GRAEHL, J. 1997. Machine transliteration. In Proceedings of the 35th AnnualMeetings of the Association for Computational Linguistics. 128–135.

KOREA MINISTRY OF CULTURE & TOURISM. 1995. English to Korean standard conversion rules.KUO, J.-S. AND YANG, Y.-K. 2004. Generating paired transliterated-cognates using multiple pro-

nunciation characteristics from web corpora. In Proceedings of PACLIC 18. 275–282.


208 • J.-H. Oh et al.

LEE, J. S. 1999. An English-Korean transliteration and retransliteration model for cross-lingualinformation retrieval. Ph.D. thesis, Computer Science Dept., KAIST.

LEE, J. S. AND CHOI, K. S. 1998. English to Korean statistical transliteration for informationretrieval. Computer Processing of Oriental Languages 12, 1, 17–37.

LI, H., ZHANG, M., AND SU, J. 2004. A joint source-channel model for machine transliteration. InProceedings. of ACL 2004. 160–167.

LIN, W. H. AND CHEN, H. H. 2002. Backward machine transliteration by learning phonetic simi-larity. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL). 139–145.

MANNING, C. AND SCHUTZE, H. 1999. Foundations of Statistical Natural Language Processing. MITPress, Cambridge, MA.

MITCHELL, T. M. 1997. Machine Learning. McGraw-Hill, New-York.MIYAO, Y. AND TSUJI, J. 2002. Maximum entropy estimation for feature forests. In Proceedings of

Human Language Technology Conference. 292–297.NAM, Y. S. 1997. Foreign dictionary. Sung An Dang.PARK, Y., CHOI, K.-S., KIM, J., AND KIM, Y. 1996. Development of the data collection Ver. 2.0 (KTSET

2.0) for Korean information retrieval studies. In Proceedings of Special Interest Group on ArtificialIntelligence of Korea Information Science Society Spring Conference. 59–65.

QUINLAN, J. R. 1986. Induction of decision trees. Machine Learning 1, 81–106.QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.SOONG, K. F. AND HUANG, E. F. 1991. A tree-trellis based fast search for finding the n best sentence

hypotheses in continuous speech recognition. In Proceedings of IEEE International Conferenceon Acoustic Speech and Signal Processing. 546–549.

TSUJII, K. 2002. Automatic extraction of translational Japanese-Katakana and English wordpairs from bilingual corpora. International Journal of Computer Processing of Oriental Lan-guages 15, 3, 261–279.

ZHANG, L. 2004. Maximum entropy modeling toolkit for python and C++.

Received January 2006; revised April 2006; accepted May 2006


a machine transliteration model based on correspondence...

Documents