extraction of transliteration pairs from parallel corpora using a statistical transliteration model

24

Click here to load reader

Upload: chun-jen-lee

Post on 26-Jun-2016

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

Extraction of transliteration pairsfrom parallel corpora using astatistical transliteration model

Chun-Jen Lee a,b,*, Jason S. Chang b,Jyh-Shing Roger Jang b

a Telecommunication Labs., Chunghwa Telecom Co., Ltd., 326 Chungli, Taiwanb Department of Computer Science, National Tsing Hua University, 300 Hsinchu, Taiwan

Received 26 August 2003; received in revised form 8 June 2004; accepted 9 October 2004

Abstract

This paper describes a framework for modeling the machine transliteration problem.

The parameters of the proposed model are automatically acquired through statistical

learning from a bilingual proper name list. Unlike previous approaches, the model does

not involve the use of either a pronunciation dictionary for converting source words into

phonetic symbols or manually assigned phonetic similarity scores between source and

target words. We also report how the model is applied to extract proper names and cor-

responding transliterations from parallel corpora. Experimental results show that the

average rates of word and character precision are 93.8% and 97.8%, respectively.

� 2004 Elsevier Inc. All rights reserved.

Keywords: Transliteration pair; Transliteration model; Parallel corpora; Statistical learning;

Machine transliteration

0020-0255/$ - see front matter � 2004 Elsevier Inc. All rights reserved.

doi:10.1016/j.ins.2004.10.006

* Corresponding author.

E-mail addresses: [email protected] (C.-J. Lee), [email protected] (J.S. Chang),

[email protected] (J.-S.R. Jang).

Information Sciences 176 (2006) 67–90

www.elsevier.com/locate/ins

Page 2: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

1. Introduction

Machine transliteration is very important for research in natural language

processing (NLP), such as machine translation (MT), cross-language informa-

tion retrieval (CLIR), question answering (QA), and bilingual lexicon con-

struction. Proper nouns are not often found in existing bilingual dictionaries.Thus, it is difficult to handle transliteration only via simple dictionary lookup.

Unfamiliar personal names, place names, and technical terms are especially dif-

ficult for human translators to transliterate correctly. In CLIR, the accuracy of

transliteration greatly affects the retrieval performance.

Recently, much research has been done on machine transliteration for many

language pairs, such as English/Arabic [24,1], English/Chinese [3,28,18], Eng-

lish/Japanese [12], and English/Korean [16,11,21]. Machine transliteration is

classified into two types based on transliteration direction. Transliteration isthe process that converts an original proper name in the source language into

an approximate phonetic equivalent in the target language, whereas back-

transliteration is the reverse process that converts the transliteration back into

its original proper name. Lee and Choi [16] proposed an automatic learning

procedure for English-to-Korean transliteration with limited evaluation.

Chen et al. [3] proposed a method for Chinese-to-English back-translitera-

tion. In that heuristic approach, letters commonly shared between a Roman-

ized Chinese word and an original English word are considered. Themodel is also enhanced with pronunciation rules. Knight and Graehl [12] ex-

plored a generative model for Japanese-to-English back-transliteration based

on the source-channel framework. Stalls and Knight [24] extended that ap-

proach to Arabic-to-English back-transliteration. Wan and Verspoor [28] pro-

posed a method for English-to-Chinese place name transliteration based on

heuristic rules for relationships between English phonemes and the Chinese

phonetic system. Kang and Choi [11] proposed a method based on decision

trees to learn transliteration and back-transliteration rules between Englishand Korean. Lin and Chen [18] proposed a learning algorithm for Chinese-

to-English back-transliteration using both a pronunciation dictionary and a

speech synthesis system to generate the pronunciation of an English proper

name. Oh and Choi [21] presented an English-to-Korean transliteration model

using a pronunciation dictionary and contextual rules. Al-Onaizan and Knight

[1] presented a spelling-based model for Arabic-to-English named entity trans-

literation. Most of the above approaches require a pronunciation dictionary

for converting a source word into a sequence of pronunciations. However,words with unknown pronunciations may cause problems for transliteration.

In addition, Chen et al. [3] and Oh and Choi [21] used a language-dependent

penalty function to measure the similarity between a proper name and its cor-

responding transliteration. For learning the rules of transliteration and back-

68 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 3: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

transliteration, Kang and Choi [11] used a language-dependent penalty func-

tion to perform phonetic alignment between pairs of English words and Kor-

ean transliterations. Wan and Verspoor [28] also used handcrafted heuristic

mapping rules. This may lead to problems when porting to other language

pairs.

In recent years, much research has focused on the study of automaticbilingual lexicon construction based on bilingual corpora. Proper names and

corresponding transliterations can often be found in parallel corpora or

topic-related bilingual comparable corpora. However, as noted by Tsuji [26],

many previous methods [6,13,30,20,25] dealt with this problem based on the

frequencies of words appearing in corpora, an approach which cannot be effec-

tively applied to low-frequency words, such as transliterated words. In this

paper, we present a framework for extracting English and Chinese transliter-

ated word pairs based on the proposed statistical machine transliterationmodel to overcome the problem.

Compared with previous approaches, the method proposed in this paper re-

quires no pronunciation dictionary for converting source words into phonetic

symbols. Additionally, the parameters of the model are automatically learned

from a bilingual proper name list using the Expectation Maximization (EM)

algorithm [7]. No manually assigned phonetic similarity scores between bilin-

gual name pairs are required. Moreover, the learning approach is unsupervised

except for the use of seed constraints based on phonetic knowledge to acceler-ate the convergence of EM training. To capture grapheme-level string mapping

more precisely, a mapping scheme based on transliteration units, instead of

individual characters, is adopted in this study. Furthermore, how the model

can be applied to the extraction of proper names and transliterations from par-

allel corpora is described.

The remainder of the paper is organized as follows: Section 2 gives an over-

view of machine transliteration and describes the proposed approach. Section 3

describes how the model is applied to the extraction of transliterated targetwords from parallel texts. The experimental setup and a quantitative assess-

ment of performance are presented in Section 4. Concluding remarks are made

in Section 5.

2. Statistical machine transliteration model

In this section, we first give an overview of machine transliteration andbriefly illustrate our approach with an example. A formal description of the

proposed transliteration model and a parameter estimation procedure based

on the EM algorithm will be presented in Section 2.2 and Section 2.3,

respectively.

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 69

Page 4: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

2.1. Overview of the noisy channel model

One can consider machine transliteration as a noisy channel, as illustrated in

Fig. 1. Briefly, the language model generates a source proper name E, and the

transliteration model converts the proper name E into a target transliteration

C.

Throughout the rest of the paper, we assume that E is written in English,

while C is written in Chinese. Since Chinese and English are not in the samelanguage family, there is no simple or direct way of performing mapping and

comparison. One feasible solution is to adopt a Chinese Romanization sys-

tem 1 to represent the pronunciation of each Chinese character. Among the

many Romanization systems for Chinese, Wade-Giles and Hanyu Pinyin are

the most widely used. The Wade-Giles system is commonly adopted in Taiwan

today and has traditionally been popular among Western scholars. For this

reason, we use the Wade-Giles system to Romanize Chinese characters.

However, the proposed approach is equally applicable to other Romanizationsystems. The language model gives the prior probability P(E) which can be

modeled using maximum likelihood estimation. As for the transliteration

model P(CE), we can approximate it by decomposing E and Romanization

of C into transliteration units (TUs). TU is defined as a sequence of characters

transliterated as a group. For English, a TU can be a monograph, a digraph, or

a trigraph [29]. For Chinese, a TU can be a syllable initial, a syllable final, or a

syllable [2] represented by Romanized characters.

To illustrate how the approach works, we take the TU alignment in Fig. 2 asan example. In this example, the English name ‘‘Smith’’ can be segmented into

four TUs and aligned with the Romanized transliteration. Assuming that

Language

Model

P (E)

Transliteration

Model

P (C|E)

E C

Fig. 1. The noisy channel model in machine transliteration.

1 Ref. sites: ‘‘http://www.romanization.com/index.html’’ and ‘‘http://www.edepot.com/

taoroman.html’’.

70 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 5: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

‘‘Smith’’ is segmented into ‘‘S-m-i-th,’’ then a possible alignment with the Chi-

nese transliteration ‘‘ (Shihmissu)’’ is depicted in Fig. 2.Intuitively, the probability of P(([f(x)])j Shihmissu Smith) can be approxi-

mated by the following equation based on TU decomposition:

P ð jSmithÞ ffi PðShihmissujSmithÞffi PðShihjSÞP ðmjmÞPðijiÞP ðssujthÞ; ð1Þ

where ‘‘Shihmissu’’ is the Wade-Giles Romanization of ‘‘ .’’A formal description of this approximation scheme will be given in the next

subsection.

2.2. Formal description: statistical transliteration model (STM)

A proper name E with l characters and a Romanized transliteration C with n

characters are denoted by el1 and cn1, respectively. Assume that the number of

aligned TUs for (E,C) is N, and let M = {m1,m2, . . . ,mN} be an alignment can-didate, where mj is the match type of the jth TU. The match type is defined as a

pair of lengths of TUs in the two languages. For instance, in the case of

(‘‘Smith,’’ ‘‘Shihmissu’’), N is 4, and M is {1-4, 1-1, 1-1, 2-3}. We write E

and C as follows:

E ¼ el1 ¼ uN1 ¼ u1; u2; . . . ; uN ;

C ¼ cn1 ¼ vN1 ¼ v1; v2; . . . ; vN ;

�ð2Þ

where ui and vj are the ith TU of E and the jth TU of C, respectively.Then the probability of C given E, P(CjE), is formulated as follows:

P ðCjEÞ ¼XM

P ðC;M jEÞ ¼XM

P ðCjM ;EÞP ðM jEÞ: ð3Þ

Theoretically, Eq. (3) is computed over all possible alignmentsM. To reduce

the computational cost, one alternative approach is to modify the summation

S m i th

Shih m i ssu

Fig. 2. TU alignment between English and Chinese Romanized character sequences.

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 71

Page 6: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

criterion through the best alignment. Therefore, the process of finding the most

probable transliteration C* for a given E can be formulated as:

C� ¼ argmaxC

maxM

P ðCjM ;EÞP ðM jEÞ

¼ argmaxC

maxM

P ðCjM ;EÞP ðMÞ: ð4Þ

We can approximate P(CjM,E)P(M) as follows:

P ðCjM ;EÞPðMÞ ¼ P ðvN1 juN1 ÞP ðm1;m2; . . . ;mN Þ �YN

i¼1P ðvijuiÞP ðmiÞ: ð5Þ

Therefore, we have

C� ¼ argmaxC

maxM

YNi¼1

P ðvijuiÞP ðmiÞ; ð6Þ

Then, the transliteration score function for C, given E, is formulated as

ScoreSTMðCÞ � maxM

logYNi¼1

P ðvijuiÞP ðmiÞ !

¼ maxM

XNi¼1

ðlog P ðvijuiÞ þ log P ðmiÞÞ: ð7Þ

Let S(i, j) be the maximum accumulated log score between the first i charac-ters of E and the first j characters of C. Then, ScoreSTM(C) = S(l,n), the max-

imum accumulated log score among all possible alignment paths of E with

length l and of C with length n, can be computed using a dynamic program-

ming strategy, as shown in the following:

Step 1 (Initialization)

Sð0; 0Þ ¼ 0; ð8Þ

Step 2 (Recursion)

Sði; jÞ ¼ maxh;k

½Sði� h; j� kÞ þ log Pðcjj�kjeii�hÞ þ log P ðh; kÞ�;

0 6 i 6 l; 0 6 j 6 n; ð9Þ

where logP(h,k) is defined as the log score of the match type ‘‘h-k’’, which cor-responds to the last term in Eq. (7).

Step 3 (Termination)

ScoreSTMðCÞ ¼ Sðl; nÞ: ð10Þ

In practice, the values of h and k are limited to a small set.

72 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 7: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

2.3. Estimation of model parameters

In the following, we describe the iterative procedure for re-estimation of

P(vjjui) and P(mi). We first define the following functions:

count(ui,vj) = the number of occurrences of aligned pair ui and vi in the trainingset;

count(ui) = the number of occurrences of ui in the training set;

count(h,k) = the total number of occurrences of match type ‘‘h-k’’ in the train-

ing set.

Therefore, P(vjjui) can be approximated as follows:

P ðvjjuiÞ ¼countðui; vjÞcountðuiÞ

: ð11Þ

Similarly, P(h,k) can be estimated as follows:

P ðh; kÞ ¼ countðh; kÞPi

Pjcountði; jÞ : ð12Þ

Because count(ui,vj) is unknown initially, a reasonable approach to obtain-ing a rough estimate of the parameters of the translation model is to constrain

the TU alignments of a word pair (E,C) within a position distance d [16]. As-

sume that ui ¼ epþh�1p and vj ¼ cqþk�1

q , and that the aligned pair (ui,vj) is con-

strained as follows:

p � q�ln

�� �� < d; and

ðp þ h� 1Þ � ðqþk�1Þ�ln

��� ��� < d;

8<: ð13Þ

where l and n are the length of the source word E and the target word C,

respectively.

To accelerate the convergence of the model training process and reduce the

number of noisy TU aligned pairs (ui,vj), we restrict the combination of TU

pairs to limited patterns in the beginning. Based on the assumption that the

articulatory representations of phonemes are very similar across languages,

we consider that the phonemes of TUs are independent from the underlying

languages. In this approach, the similarities of phonemes of TUs are initiallyclassified based on phonetic knowledge. Consonant TU pairs only with the

same or similar phonemes can be matched. An English consonant can also

be matched with a Chinese syllable beginning with the same or similar

phonemes. An English semivowel TU can either be matched with a Chinese

consonant or with a vowel with the same or similar phonemes, or can be

matched with a Chinese syllable beginning with the same or similar phonemes.

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 73

Page 8: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

In the initialization phase, P(h,k) is set to uniform distribution, as shown in

the following:

P ðh; kÞ ¼ 1

T; ð14Þ

where T is the total number of match types allowed.Based on the EM algorithm with Viterbi decoding [8], the iterative param-

eter estimation procedure is described as follows:

Step 1 (Initialization): Use Eq. (13) to generate likely TU alignment pairs.

Calculate the initial model parameters, P(vjjui) and P(h,k), using

Eqs. (11) and (14), respectively.

Step 2 (Expectation): Based on the current model parameters, find the best

Viterbi path for each E and C word pair in the training set.Step 3 (Maximization): Based on all the TU alignment pairs obtained in Step

2, calculate the new model parameters using Eqs. (11) and (12).

Replace the model parameters with the new model parameters. If a

stopping criterion or a predefined number of iterations is reached, then

stop the training procedure. Otherwise, go back to Step 2.

In the first iteration, TUs in English and Chinese are constrained based on

phonetic knowledge. However, in the subsequent iterations, the whole trainingprocess is run in a totally unsupervised manner. Therefore, some new TUs are

automatically discovered from the training data within the constraints of

match types, as demonstrated in Section 4.

3. Extractions of transliterations from parallel corpora

The proposed transliteration model can be applied to the tasks of the extrac-tion of bilingual name and transliteration pairs [14], and back-transliteration

[15]. These tasks become more challenging for language pairs with different

sound systems, such as Chinese/English, Japanese/English, and Arabic/English.

For clarity of the paper, we focus on the extraction of English–Chinese name

and transliteration pairs. However, the proposed framework is easily extend-

able to other language pairs.

3.1. Overall process

For the purpose of extracting name and transliteration pairs from parallel

corpora, a sentence alignment procedure is applied first to align parallel texts

at the sentence level. Then, we use a part of speech tagger to identify proper

nouns in the source text. After that, the machine transliteration model is

74 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 9: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

applied to isolate the transliteration in the target text. In general, the proposed

transliteration model can be further augmented by linguistic processing, which

will be described in more detail in the next subsection. The overall process is

summarized in Fig. 3.

An excerpt from the magazine Scientific American [5] is given in the

following:

Source language sentence: ‘‘Rudolf Jaenisch, a cloning expert at the Whitehead

Institute for Biomedical Research at the Massachusetts Institute of Technol-

ogy, concurred:’’

Target language sentence:

‘‘ :’’.

In the above excerpt, three English proper nouns, ‘‘Jaenisch,’’ ‘‘Whitehead,’’and ‘‘Massachusetts,’’ were identified from the results of tagging. Utilizing Eq.

(7) and Viterbi decoding, we found that the target word ‘‘ (huaihaite)’’

most likely corresponded to ‘‘Whitehead.’’ The other word pair (Jaenisch,

‘‘chiehnihsi’’) can also be extracted through a similar process. How-

ever, the third word pair (Massachusetts, ‘‘masheng’’) failed to be ex-

tracted by the proposed approach. The reason is that ‘‘ ’’ is an

abbreviation of ‘‘ (masachusaichou)’’ which is a well established

Bilingual Corpus

Sentence Alignment

Source Language Sentences

Target LanguageSentences

Proper Names: Word Extraction

Source Words

Proper Names: Source & Target Words

LinguisticProcessing

TransliterationModel

Pre -process

MainProcess

Fig. 3. The overall process for extracting name and transliteration pairs from parallel corpora.

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 75

Page 10: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

popular translated name of ‘‘Massachusetts.’’ Therefore, the proposed model is

incapable of resolving the abbreviation mentioned above.

In order to retrieve the transliteration for a given proper noun, we need to

keep track of the optimal TU decoding sequence associated with the given Chi-

nese term for each word pair under the proposed method. It can be easily ob-

tained via backtracking the best Viterbi path [19]. For the name-transliterationpair (Whitehead, ) mentioned above, the alignments of the TU match-

ing pairs via the Viterbi path are illustrated in Figs. 4 and 5.

In this example, the word ‘‘Whitehead’’ is decomposed into seven TUs,

‘‘Wh-i-t-e-h-e-a-d,’’ and aligned with the Romanization ‘‘huaihaite’’ of the

transliteration ‘‘ .’’

3.2. Linguistic processing

Some language-dependent knowledge can be integrated to further improve

the performance, especially when we focus on specific language pairs.

3.2.1. Linguistic processing rule 1 (R1)

Some source words have both transliterations and translations, which are

equally acceptable and can be used interchangeably. For example, the transla-

tion and the transliteration of the source word ‘‘England’’ are ‘‘ (Ying-

kou)’’ and ‘‘ (Yingkolan),’’ respectively, as shown in Fig. 6. Since the

Match Type TU Pair:

0 - 1 , -- y0 - 1 , -- u0 - 1 , -- a0 - 1 , -- n2 - 2 , Wh -- hu1 - 1 , i -- a1 - 0 , t --1 - 1 , e -- i1 - 1 , h -- h0 - 1 , -- a2 - 1 , ea -- i1 - 2 , d -- te0 - 1 , -- s0 - 1 , -- h0 - 1 , -- e0 - 1 , -- n0 - 1 , -- g

Fig. 4. The alignments of the TU matching pairs via the Viterbi path.

76 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 11: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

proposed model is designed specifically for transliteration, such cases may

cause problems. One way to overcome this limitation is to handle these cases

by using a list of commonly used proper names and translations. A portion

of the list is shown in Table 1.

3.2.2. Linguistic processing rule 2 (R2)

From error analysis of the aligned results of the training set, we have found

that the proposed approach suffers from fluid TUs, such as ‘‘t,’’ ‘‘d,’’ ‘‘tt,’’‘‘dd,’’ ‘‘te,’’ and ‘‘de.’’ Sometimes they are omitted during transliteration,

and sometimes they are transliterated as Chinese characters. For instance,

W

h

i

t

e

h

e

a

d

m a s h e n g l i

y u a n h u a i h a i t e s

… …

Fig. 5. The Viterbi alignment path.

Fig. 6. Examples of mixed usages of translation and transliteration.

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 77

Page 12: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

‘‘d’’ is usually transliterated as ‘‘ ,’’ ‘‘ ,’’ or ‘‘ ’’ corresponding to the Chi-

nese TU of ‘‘te.’’ The English TU ‘‘d’’ is transliterated as ‘‘ ’’ in (Clifford,

), but left out in (Radford, ). This phenomenon causes prob-

lems; in the example shown in Fig. 7, the TU ‘‘d’’ in ‘‘David’’ is mistakenly

matched up with ‘‘ .’’

Similarly, the English TU ‘‘s’’ or ‘‘se’’ is likely to misalign with ‘‘ ’’ (TU

‘‘shih’’) as in ‘‘ (Athens was one of the

most powerful city-states of ancient Greece.).’’ See Fig. 8 for more details.However, the problem caused by fluid TUs can be partly overcome by add-

ing more linguistic constraints in the post-processing phase. We calculate the

Chinese character distributions of proper nouns from the corpus. A small set

of Chinese characters is often used for transliteration. Therefore, it is possible

to improve the performance by pruning extra tailing characters, which do not

belong to the transliterated character set, from the transliteration candidates.

For instance, the probability of ‘‘ ’’ being used in translitera-

tion is very low. Therefore, the correct transliteration ‘‘ ’’ for the sourceword ‘‘David’’ can be extracted by removing the character ‘‘ ’’ We denote this

strategy as Rule 2 (R2).

Table 1

A portion of the list for translation

Source word Target word Source word Target word

Afghanistan England

America France

Asia Greece

Canada India

China Spanish

Christ Yugoslavia

Fig. 7. Example of transliterated word extraction for ‘‘David.’’

78 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 13: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

3.3. Work flow of integrating linguistic and statistical information

Combining the linguistic processing and transliteration model, we present

the algorithm for transliteration extraction as follows:

Step 1: Look up the translation list as stated in R1. If the translation of a

source word appears in both the entry of the translation list and the

aligned target sentence (or paragraph), then pick the translation asthe target word. Otherwise, go to Step 2.

Step 2: Pass the source word and its aligned target sentence (or paragraph)

through the proposed model to extract the target word. Once this is

done, go to Step 3.

Step 3: Apply linguistic processing R2 to remove superfluous tailing characters

from the extracted transliterations.

After the above steps are completed, the performance of source-target wordextraction is significantly improved.

4. Experiments

In this section, we focus on the setup for the experiments and a performance

evaluation of the proposed model applied to extract bilingual word pairs from

parallel corpora.

4.1. Experimental setup

Several corpora were collected to estimate the parameters of the proposed

models and to evaluate the performance of the proposed approach. The corpus

Fig. 8. Example of transliterated word extraction for ‘‘Athens.’’

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 79

Page 14: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

T0 for training consisted of 2,430 pairs of English names and transliterations in

Chinese. The training corpus, composed of a bilingual proper name list, was

collected from ‘‘Handbook of English Name Knowledge’’ edited by Huai

[10]. The bilingual proper name list consists of first names, last names, and

nicknames. For example, (Adolf, ‘‘Ataofu’’) and (Adelaide,

‘‘Atelaite’’) are first names, (Abbey, ‘‘Api’’) and (Adela,‘‘Atela’’) are last names, and (Archie, ‘‘Aerhchi’’) and (Allie,

‘‘Ali’’) are nicknames, for males and females, respectively. Some first

names are also used as last names. For instance, ‘‘Abel’’ can be either a first

name or a last name. Table 2 shows some samples of the training corpus.

In the experiment, three sets of parallel-aligned texts [4], P1, P2, and P3,

were prepared to evaluate the performance of proposed methods. P1 consisted

of 500 bilingual examples from the English–Chinese version of the Longman

Dictionary of Contempory English (LDOCE) [22]. P2 consisted of 300 alignedsentences from Scientific American, USA and Taiwan Editions. 2 P3 consisted

of 300 aligned sentences from the Sinorama Corpus [23].

In the experiment, we dealt with personal and place names as well as their

transliterations from the parallel corpora. The performance of transliteration

extraction was evaluated based on the precision rates of transliteration words

or characters. For simplicity, we considered each proper name in the source

sentence in turn and determined its corresponding transliteration indepen-

dently. Table 3 shows some examples from the testing set P1.

4.2. TUs for English and Chinese

The proposed model is based on TUs, which are more linguistically moti-

vated than individual characters. Table 4 lists some of the most frequently

occurring English TUs of length 1 to 3. Table 5 lists some of the most fre-

quently occurring Chinese TUs. Table 6 shows some English–Chinese TU-

mapping probabilities automatically estimated from all of the training data.The automatic learning process resulted in mostly regular monographs and

digraphs found in pronunciation dictionaries, such as the Longman Pronunci-

ation Dictionary (LPD) [29], including ‘‘rh’’ and ‘‘au.’’ However, it also learned

additional TUs, such as ‘‘cq’’ in the personal names ‘‘Jacqueline’’ and ‘‘Jacque-

tta.’’ For example, after the second iteration of EM training, the most likely

TU alignment sequence of the name pair (Jacqueline, Chiehkueilin

‘‘ ’’) is shown in Fig. 9.

It should be noted that an original word may have more than one translit-eration. For instance, the English name ‘‘Beaufort’’ has several possible

Chinese terms ‘‘ ’’ (Paofu), ‘‘ ’’ (Paofo), ‘‘ ’’ (Pufu), ‘‘ ’’

2 Scientific American: ‘‘http://www.sciam.com’’ (USA edition) and ‘‘http://www.sciam.com.tw’’

(Taiwan edition)

80 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 15: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

(Paofote). The TUs of the word ‘‘Beaufort’’ were automatically and dynami-

cally constructed and aligned with their corresponding transliteration TUs

via the proposed model. The results are shown in Fig. 10.

Although Knight and Graehl [12] applied EM to automatically learn simi-larities of English–Japanese name pairs, English words and Japanese katakana

words have to be converted into English sounds and Japanese sounds, respec-

tively, via pronunciation dictionaries. Each English sound can map to one or

more Japanese sounds. Compared with their study, one of the advantages of

our approach is that we do not have to find the exact pronunciations via dic-

tionary lookup or various grapheme-to-phoneme rules. To be more specific, a

set of often-used Chinese characters for transliteration was selected from the

collected corpora. Although many Chinese characters have more than one pro-nunciation, we found that almost all the characters used for transliteration

have unique pronunciations. For those Chinese characters not used for trans-

literation, we choose the most frequently used pronunciation instead. Since we

focus on transliterated words, we do not apply any Chinese pronunciation

Table 2

Some samples from the training set T0

Source word Target word Source word Target word

Abe Agatha

Abbey Acton

Abbot Arkwright

Archer Arabella

Adolf Alaric

Adolphus Alasdair

Adela Alastair

Adelaide Alethea

Arden Alonzo

Albert Ariadne

Alfonso Allegra

Alfie Alister

Alf Allie

Algy Arlene

Algernon Alan

Alma Aloys

Almeric Aloysius

Archie Amadeus

Alva Amabel

Alphonsus Amanda

Alphonso Amelia

Afra Arms

Avril Armstrong

Agnes Anastasia

Argus Arno

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 81

Page 16: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

disambiguation algorithm to decide the exact pronunciation for each character.

Thus, the Romanization of Chinese Characters can be conducted directly via

table lookup instead of using a pronunciation dictionary. Moreover, to accel-

erate the convergence of EM training and reduce noisy TU pairs at grapheme-

level string mapping, we adopt a many-to-many mapping under the constraints

Table 3

Some bilingual examples from the testing set P1

He is a (second) Caesar in speech and leadership

Hamlet kills the king in Act 5 Scene 2

Can you adduce any reason at all for his strange behaviour, Holmes?

To see George, of all people, in the Ritz Hotel!

He has 2 caps for playing cricket for England

They appointed him to catch all the rats in Hamelin

Burlington Arcade is a famous shopping passage in London

The architecture of ancient Greece

Drink Rossignol, the aristocrat of table wines!

Cleopatra was bitten by an asp

I shall soon be leaving for an assignment in India

Our plane stopped at London (airport) on its way to New York

Schoenberg used atonality in the music of his middle period

This tune is usually attributed to J. S. Bach

Now that this painting has been authenticated as a Rembrandt, it�s worth 10 times as much as I

paid for it!

Byron awoke one morning to find himself famous

82 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 17: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

of a limited set of matched types based on phonetic knowledge. The maximumlengths of English and Chinese TUs are 3 and 5, respectively. Table 7 shows the

match types and English and Chinese TUs obtained in our experiments.

Table 4

Some high frequency English TUs

Length of English TU u High frequency TUs

1 a, e, i, n, l, s, o, r, d, t

2 er, ie, ar, ll, th, or, ch, tt, ck, ph

3 lle, sch

Table 5

Some high frequency Chinese TUs

Length of Chinese TU v High frequency TUs

1 i, a, l, n, o, t, e, p, m, u

2 te, ei, ai, ch, ko, hs, ng, ao, pu, fu

3 ssu, erh, ieh, chi, hsi

4 shih

5 chieh

Table 6

English–Chinese TU-mapping probabilities

u v P(vju) u v P(vju)h 0.272 ei i 0.900

ae ei 0.571 eu yu 0.785

ae i 0.214 ew u 0.500

ai a 0.500 ey i 0.998

ai e 0.250 f f 0.586

ar a 0.794 ff f 0.733

au o 0.772 ff fu 0.266

aw ao 0.545 g ko 0.350

aw o 0.454 g ch 0.345

Fig. 9. TU alignment of the name pair (Jacqueline, Chiehkueilin ‘‘ ’’).

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 83

Page 18: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

4.3. Evaluation metric

In the experiment, the performance of transliteration extraction was evalu-

ated based on precision and recall rates at the word and character levels. Since

we considered exactly one proper name in the source language and one trans-

literation in the target language at a time, the word recall rates were same as the

word precision rates:

Fig. 10. TU alignment of ‘‘Beaufort’’ and corresponding transliterations.

Table 7

Examples for each match type

Match type TU pair

0–1 ( ,h), ( ,i), ( ,n), ( ,u)

1–0 (h, ), (k, ), (d, ), (t, )

1–1 (r, l), (y, i), (m, m)

1–2 (j, ch), (f, fu), (d, te)

1–3 (s, ssu), (l, erh), (r, erh)

1–4 (s, shih)

2–0 (gh, )

2–1 (bb, p), (ey, i), (mm, m)

2–2 (dg, ch), (wh, hu), (ck, ko)

2–3 (le, erh), (re, erh), (ce, ssu)

2–4 (ce, shih)

2–5 (ge, chieh)

3–2 (sch, hs)

3–3 (lle, erh)

3–4 (sch, shih)

84 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 19: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

Word precision ðWPÞ ¼ number of correctly extracted words

number of correct words:

The character level recall and precision rates were defined as follows:

Character precision ðCPÞ ¼ number of correctly extracted characters

number of extracted characters;

Character recall ðCRÞ ¼ number of correctly extracted characters

number of correct characters:

4.4. Experimental results and discussion

In the experiment of extracting transliterations on the data set P1, the STM

model achieved, on average, a word precision rate of 86%, a character precision

rate of 94.4%, and a character recall rate of 96.3%, as shown in Table 8. The

performance could be further improved by means of simple statistical and lin-

guistic processing, as shown in Table 8.Table 9 shows some examples of Chinese transliterated words, correctly ex-

tracted using the STM model, from P1. Although, the STM model failed in

some cases, most of these problems could be overcome through the addition

of simple linguistic processing, as shown in Table 10. The error in the case

of ‘‘Quirk’’ occurred because ‘‘Quirk’’ is much closer to ‘‘ (kohoko)’’

than to ‘‘ (KoKo),’’ based on phonetic similarity. In this case, the Chinese

transliteration plainly cannot be correctly extracted. Similar problems, due to

similarities at the grapheme level, occurred with the name pairs (Tom,‘‘tangmu’’) and (John, ‘‘yuehhan’’), as shown in Table 10. It is obvious that

Table 8

The experimental results of transliterated word extraction for P1

Test set Methods WP (%) CP (%) CR (%)

P1 (LDOCE) STM 86.0 94.4 96.3

STM + R1 88.6 95.4 97.7

STM + R2 90.8 97.4 95.9

STM + R1 + R2 94.2 98.3 97.7

P2 (Scientific American) STM 90.7 96.9 97.3

STM + R1 92.7 97.6 97.9

STM + R2 92.0 97.8 97.3

STM + R1 + R2 94.0 98.3 97.9

P3 (Sinorama) STM 86.7 94.2 96.1

STM + R1 89.0 94.9 96.8

STM + R2 87.7 95.8 94.9

STM + R1 + R2 93.0 96.5 96.7

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 85

Page 20: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

a collection of commonly used or highly varying transliterations can be incre-

mentally added to a lookup list to further improve the system performance.

We have also performed the same experiments on the data sets P2 and P3,

and the results are shown in Table 8. Although the performance of the STM

approach on the data sets P1 and P3 are worse than that of P2, obviously,

the integrated scheme (STM + R1 + R2) exhibits considerable robustness in

extracting transliterated words from different data sets in various domains.

The results in Table 11 show the average rates of word and character precisionfor the test sets are around 93.8% and 97.8%, respectively.

Table 9

Some example of Chinese transliterations, correctly extracted by the STM model, from P1

Bilingual sentence Chinese transliteration

He is a second Caesar in speech and leadership (kaisa)

In this case I�m acting for my friend Mr. Smith (shihmissu)

What�s your alibi for being late this time Jones? (chungssu)

Can you adduce any reason at all for his strange behaviour, Holmes? (fuerhmossu)

They appointed him to catch all the rats in Hamelin (hanmulin)

Drink Rossignol, the aristocrat of table wines! (lohsino)

Cleopatra was bitten by an asp

(koliaopeitela)

Schoenberg used atonality in the music of his middle period (sangpoko)

If you have to change trains in London, you may be able to book

through to your last station

(luntun)

This tune is usually attributed to J.S. Bach (paha)

Byron awoke one morning to find himself famous (pailun)

You must have kissed the Blarney Stone to be able to talk like that! (pulani)

Quirk and Greenbaum collaborated on the new grammar (kolinpang)

86 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 21: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

Table 10

Some examples of possible Chinese transliterations extracted by the proposed approaches

Bilingual Sentence STM STM + R1 STM + R2 STM + R1 + R2

David, as you know, writes dictionaries (taweite) (tawei)

The Mediterranean Sea bathes the sunny shores of Italy

(teitalihaian) (tichunghai)

You have borne yourself bravely in this battle, Lord Faulconbridge

(fokenpolichueh) (fokenpoli)

Ancient Rome and Greece (chihsi) (hsila)

Jane is blossoming out into a beautiful girl (cheni) (chen)

Tom likes to boss younger children about (tang)

Quirk and Greenbaum collaborated on the new grammar

(kohoko)

John seems to have made a real conquest of Janet. They�re always together (chen)

‘‘*’’ Means the Chinese transliterated words are not correctly extracted.

C.-J

.Lee

etal./Inform

atio

nScien

ces176(2006)67–90

87

Page 22: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

Compared with the previous work, the proposed approach has three advan-

tages. First, the proposed method learns the parameters of the model automat-

ically from a list of bilingual name pairs without using a pronunciation

dictionary or grapheme-to-phoneme rules for the source words. Second, the

proposed framework is easier to port to other language pairs as long as there

is some transliteration training data. Third, the proposed approach matches

TUs in the two languages directly, therefore accelerates the matching process

by skipping the grapheme-to-phoneme phase.

5. Conclusions

A new statistical modeling approach to the machine transliteration problem

has been presented in this paper. The parameters of the model are automati-

cally learned from a bilingual proper name list using the EM algorithm. More-

over, the model is applicable to the extraction of proper names andtransliterations. The proposed method can be easily extended to other language

pairs that have different sound systems without the assistance of pronunciation

dictionaries. Experimental results indicate that high precision and recall rates

can be achieved by the proposed method.

References

[1] Y. Al-Onaizan, K. Knight, Translating named entities using monolingual and bilingual

resources, in: Proceedings of the 40th Annual Meeting of the Association for Computational

Linguistics (ACL), Philadelphia, 2002, pp. 400–408.

[2] Y.R. Chao, A Grammar of spoken Chinese, University of California Press, Berkeley, 1968.

[3] H.-H. Chen, S.-J. Huang, Y.-W. Ding, S.-C. Tsai, Proper name translation in cross-

language information retrieval, in: Proceedings of 17th COLING and 36th ACL, 1998, pp.

232–236.

[4] T.C. Chuang, G.N. You, J.S. Chang, Adaptive bilingual sentence alignment, Lecture Notes in

Artificial Intelligence 2499 (2002) 21–30.

[5] J.B. Cibelli, R.P. Lanza, M.D. West, C. Ezzell, W. Clones, Scientific American, January 2002.

Available from: <http://www.sciam.com>.

Table 11

The average rates of transliterated word extraction for overall corpora

Methods WP (%) CP (%) CR (%)

STM 87.5 95.1 96.6

STM + R1 89.8 95.9 97.5

STM + R2 90.3 97.1 96.1

STM + R1 + R2 93.8 97.8 97.5

88 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90

Page 23: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

[6] I. Dagan, K.W. Church, W.A. Gale, Robust bilingual word alignment for machine aided

translation, in: Proceedings of the Workshop on Very Large Corpora: Academic and

Industrial Perspectives, Columbus, Ohio, 1993, pp. 1–8.

[7] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the

EM algorithm, Journal of the Royal Statistical Society 39 (1) (1977) 1–38.

[8] G.D. Forney, The Viterbi algorithm, Proceedings of IEEE 61 (1973) 268–278.

[10] L. Huai, Handbook of English Name Knowledge, ISBN 7-5012-0144-7/Z.10, 1st ed., 1989.

, ISBN 7-5012-0144-7/Z.10,

1989 .

[11] B.-J. Kang, K.-S. Choi, Two approaches for the resolution of word mismatch problem caused

by English words and foreign words in Korean information retrieval, International Journal of

Computer Processing of Oriental Languages 14 (2) (2001) 109–131.

[12] K. Knight, J. Graehl, Machine transliteration, Computational Linguistics 24 (4) (1998) 599–

612.

[13] J. Kupiec, An algorithm for finding noun phrase correspondences in bilingual corpora, in:

Proceedings of the 40th Annual Conference of the Association for Computational Linguistics

(ACL), Columbus, Ohio, 1993, pp. 17–22.

[14] C.-J. Lee, J.S. Chang, Acquisition of English–Chinese transliterated word pairs from parallel-

aligned texts using a statistical machine transliteration model, in: Proceedings of HLT-

NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine

Translation and Beyond, Edmonton, Canada, 2003, pp. 96–103.

[15] C.-J. Lee, J.S. Chang, J.-S.R. Jang, A statistical approach to Chinese-to-English Back-

transliteration, in: Proceedings of the 17th Pacific Asia Conference on Language, Information,

and Computation (PACLIC), Singapore, 2003, pp. 310–318.

[16] J.S. Lee, K.-S. Choi, A statistical method to generate various foreign word transliterations in

multilingual information retrieval system, in: Proceedings of the 2nd International Workshop

on Information Retrieval with Asian Languages (IRAL�97), Tsukuba, Japan, 1997, pp. 123–128.

[18] W.-H. Lin, H.-H. Chen, Backward transliteration by learning phonetic similarity, in: CoNLL-

2002, Sixth Conference on Natural Language Learning, Taipei, Taiwan, 2002.

[19] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, first ed.,

MIT Press, Tsdfvdsgc, 1999.

[20] I. Dan Melamed, Automatic construction of clean broad coverage translation lexicons, in:

Proceedings of the 2nd Conference of the Association for Machine Translation in the Americas

(AMTA�96), Montreal, Canada, 1996.

[21] J.-H. Oh, K.-S. Choi, An English–Korean transliteration model using pronunciation and

contextual rules, in: Proceedings of the 19th International Conference on Computational

Linguistics (COLING), Taipei, Taiwan, 2002, pp. 758–764.

[22] P. Proctor, Longman English–Chinese Dictionary of Contemporary English, Longman Group

(Far East) Ltd., Hong Kong, 1988.

[23] Sinorama, Sinorama Magazine. Available from: <http://www.greatman.com.tw/sinorama.

htm>, 2002.

[24] B.G. Stalls, K. Knight, Translating names and technical terms in Arabic text, in: Proceedings

of the COLING/ACL Workshop on Computational Approaches to Semitic Languages, 1998.

[25] F.Z. Smadja, K. McKeown, V. Hatzivassiloglou, Translating collocations for bilingual

lexicons: a statistical approach, Computational Linguistics 22 (1) (1996) 1–38.

[26] K. Tsuji, Automatic extraction of translational Japanese-KATAKANA and English word

pairs from bilingual corpora, International Journal of Computer Processing of Oriental

Languages 15 (3) (2002) 261–279.

[28] S. Wan, C.M. Verspoor, Automatic English–Chinese name transliteration for development of

multilingual resources, in: Proceedings of 17th COLING and 36th ACL, 1998, pp. 1352–1356.

C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 89

Page 24: Extraction of transliteration pairs from parallel corpora using a statistical transliteration model

[29] J.C. Wells, Longman Pronunciation Dictionary (New Edition), Addison Wesley Longman,

Inc, 2001.

[30] D. Wu, X. Xia, Learning an English–Chinese lexicon from a parallel corpus, in: Proceedings of

the First Conference of the Association for Machine Translation in the Americas (AMTA),

1994, pp. 206–213.

90 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90