extraction of transliteration pairs from parallel corpora using a statistical transliteration model
TRANSCRIPT
Extraction of transliteration pairsfrom parallel corpora using astatistical transliteration model
Chun-Jen Lee a,b,*, Jason S. Chang b,Jyh-Shing Roger Jang b
a Telecommunication Labs., Chunghwa Telecom Co., Ltd., 326 Chungli, Taiwanb Department of Computer Science, National Tsing Hua University, 300 Hsinchu, Taiwan
Received 26 August 2003; received in revised form 8 June 2004; accepted 9 October 2004
Abstract
This paper describes a framework for modeling the machine transliteration problem.
The parameters of the proposed model are automatically acquired through statistical
learning from a bilingual proper name list. Unlike previous approaches, the model does
not involve the use of either a pronunciation dictionary for converting source words into
phonetic symbols or manually assigned phonetic similarity scores between source and
target words. We also report how the model is applied to extract proper names and cor-
responding transliterations from parallel corpora. Experimental results show that the
average rates of word and character precision are 93.8% and 97.8%, respectively.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Transliteration pair; Transliteration model; Parallel corpora; Statistical learning;
Machine transliteration
0020-0255/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2004.10.006
* Corresponding author.
E-mail addresses: [email protected] (C.-J. Lee), [email protected] (J.S. Chang),
[email protected] (J.-S.R. Jang).
Information Sciences 176 (2006) 67–90
www.elsevier.com/locate/ins
1. Introduction
Machine transliteration is very important for research in natural language
processing (NLP), such as machine translation (MT), cross-language informa-
tion retrieval (CLIR), question answering (QA), and bilingual lexicon con-
struction. Proper nouns are not often found in existing bilingual dictionaries.Thus, it is difficult to handle transliteration only via simple dictionary lookup.
Unfamiliar personal names, place names, and technical terms are especially dif-
ficult for human translators to transliterate correctly. In CLIR, the accuracy of
transliteration greatly affects the retrieval performance.
Recently, much research has been done on machine transliteration for many
language pairs, such as English/Arabic [24,1], English/Chinese [3,28,18], Eng-
lish/Japanese [12], and English/Korean [16,11,21]. Machine transliteration is
classified into two types based on transliteration direction. Transliteration isthe process that converts an original proper name in the source language into
an approximate phonetic equivalent in the target language, whereas back-
transliteration is the reverse process that converts the transliteration back into
its original proper name. Lee and Choi [16] proposed an automatic learning
procedure for English-to-Korean transliteration with limited evaluation.
Chen et al. [3] proposed a method for Chinese-to-English back-translitera-
tion. In that heuristic approach, letters commonly shared between a Roman-
ized Chinese word and an original English word are considered. Themodel is also enhanced with pronunciation rules. Knight and Graehl [12] ex-
plored a generative model for Japanese-to-English back-transliteration based
on the source-channel framework. Stalls and Knight [24] extended that ap-
proach to Arabic-to-English back-transliteration. Wan and Verspoor [28] pro-
posed a method for English-to-Chinese place name transliteration based on
heuristic rules for relationships between English phonemes and the Chinese
phonetic system. Kang and Choi [11] proposed a method based on decision
trees to learn transliteration and back-transliteration rules between Englishand Korean. Lin and Chen [18] proposed a learning algorithm for Chinese-
to-English back-transliteration using both a pronunciation dictionary and a
speech synthesis system to generate the pronunciation of an English proper
name. Oh and Choi [21] presented an English-to-Korean transliteration model
using a pronunciation dictionary and contextual rules. Al-Onaizan and Knight
[1] presented a spelling-based model for Arabic-to-English named entity trans-
literation. Most of the above approaches require a pronunciation dictionary
for converting a source word into a sequence of pronunciations. However,words with unknown pronunciations may cause problems for transliteration.
In addition, Chen et al. [3] and Oh and Choi [21] used a language-dependent
penalty function to measure the similarity between a proper name and its cor-
responding transliteration. For learning the rules of transliteration and back-
68 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
transliteration, Kang and Choi [11] used a language-dependent penalty func-
tion to perform phonetic alignment between pairs of English words and Kor-
ean transliterations. Wan and Verspoor [28] also used handcrafted heuristic
mapping rules. This may lead to problems when porting to other language
pairs.
In recent years, much research has focused on the study of automaticbilingual lexicon construction based on bilingual corpora. Proper names and
corresponding transliterations can often be found in parallel corpora or
topic-related bilingual comparable corpora. However, as noted by Tsuji [26],
many previous methods [6,13,30,20,25] dealt with this problem based on the
frequencies of words appearing in corpora, an approach which cannot be effec-
tively applied to low-frequency words, such as transliterated words. In this
paper, we present a framework for extracting English and Chinese transliter-
ated word pairs based on the proposed statistical machine transliterationmodel to overcome the problem.
Compared with previous approaches, the method proposed in this paper re-
quires no pronunciation dictionary for converting source words into phonetic
symbols. Additionally, the parameters of the model are automatically learned
from a bilingual proper name list using the Expectation Maximization (EM)
algorithm [7]. No manually assigned phonetic similarity scores between bilin-
gual name pairs are required. Moreover, the learning approach is unsupervised
except for the use of seed constraints based on phonetic knowledge to acceler-ate the convergence of EM training. To capture grapheme-level string mapping
more precisely, a mapping scheme based on transliteration units, instead of
individual characters, is adopted in this study. Furthermore, how the model
can be applied to the extraction of proper names and transliterations from par-
allel corpora is described.
The remainder of the paper is organized as follows: Section 2 gives an over-
view of machine transliteration and describes the proposed approach. Section 3
describes how the model is applied to the extraction of transliterated targetwords from parallel texts. The experimental setup and a quantitative assess-
ment of performance are presented in Section 4. Concluding remarks are made
in Section 5.
2. Statistical machine transliteration model
In this section, we first give an overview of machine transliteration andbriefly illustrate our approach with an example. A formal description of the
proposed transliteration model and a parameter estimation procedure based
on the EM algorithm will be presented in Section 2.2 and Section 2.3,
respectively.
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 69
2.1. Overview of the noisy channel model
One can consider machine transliteration as a noisy channel, as illustrated in
Fig. 1. Briefly, the language model generates a source proper name E, and the
transliteration model converts the proper name E into a target transliteration
C.
Throughout the rest of the paper, we assume that E is written in English,
while C is written in Chinese. Since Chinese and English are not in the samelanguage family, there is no simple or direct way of performing mapping and
comparison. One feasible solution is to adopt a Chinese Romanization sys-
tem 1 to represent the pronunciation of each Chinese character. Among the
many Romanization systems for Chinese, Wade-Giles and Hanyu Pinyin are
the most widely used. The Wade-Giles system is commonly adopted in Taiwan
today and has traditionally been popular among Western scholars. For this
reason, we use the Wade-Giles system to Romanize Chinese characters.
However, the proposed approach is equally applicable to other Romanizationsystems. The language model gives the prior probability P(E) which can be
modeled using maximum likelihood estimation. As for the transliteration
model P(CE), we can approximate it by decomposing E and Romanization
of C into transliteration units (TUs). TU is defined as a sequence of characters
transliterated as a group. For English, a TU can be a monograph, a digraph, or
a trigraph [29]. For Chinese, a TU can be a syllable initial, a syllable final, or a
syllable [2] represented by Romanized characters.
To illustrate how the approach works, we take the TU alignment in Fig. 2 asan example. In this example, the English name ‘‘Smith’’ can be segmented into
four TUs and aligned with the Romanized transliteration. Assuming that
Language
Model
P (E)
Transliteration
Model
P (C|E)
E C
Fig. 1. The noisy channel model in machine transliteration.
1 Ref. sites: ‘‘http://www.romanization.com/index.html’’ and ‘‘http://www.edepot.com/
taoroman.html’’.
70 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
‘‘Smith’’ is segmented into ‘‘S-m-i-th,’’ then a possible alignment with the Chi-
nese transliteration ‘‘ (Shihmissu)’’ is depicted in Fig. 2.Intuitively, the probability of P(([f(x)])j Shihmissu Smith) can be approxi-
mated by the following equation based on TU decomposition:
P ð jSmithÞ ffi PðShihmissujSmithÞffi PðShihjSÞP ðmjmÞPðijiÞP ðssujthÞ; ð1Þ
where ‘‘Shihmissu’’ is the Wade-Giles Romanization of ‘‘ .’’A formal description of this approximation scheme will be given in the next
subsection.
2.2. Formal description: statistical transliteration model (STM)
A proper name E with l characters and a Romanized transliteration C with n
characters are denoted by el1 and cn1, respectively. Assume that the number of
aligned TUs for (E,C) is N, and let M = {m1,m2, . . . ,mN} be an alignment can-didate, where mj is the match type of the jth TU. The match type is defined as a
pair of lengths of TUs in the two languages. For instance, in the case of
(‘‘Smith,’’ ‘‘Shihmissu’’), N is 4, and M is {1-4, 1-1, 1-1, 2-3}. We write E
and C as follows:
E ¼ el1 ¼ uN1 ¼ u1; u2; . . . ; uN ;
C ¼ cn1 ¼ vN1 ¼ v1; v2; . . . ; vN ;
�ð2Þ
where ui and vj are the ith TU of E and the jth TU of C, respectively.Then the probability of C given E, P(CjE), is formulated as follows:
P ðCjEÞ ¼XM
P ðC;M jEÞ ¼XM
P ðCjM ;EÞP ðM jEÞ: ð3Þ
Theoretically, Eq. (3) is computed over all possible alignmentsM. To reduce
the computational cost, one alternative approach is to modify the summation
S m i th
Shih m i ssu
Fig. 2. TU alignment between English and Chinese Romanized character sequences.
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 71
criterion through the best alignment. Therefore, the process of finding the most
probable transliteration C* for a given E can be formulated as:
C� ¼ argmaxC
maxM
P ðCjM ;EÞP ðM jEÞ
¼ argmaxC
maxM
P ðCjM ;EÞP ðMÞ: ð4Þ
We can approximate P(CjM,E)P(M) as follows:
P ðCjM ;EÞPðMÞ ¼ P ðvN1 juN1 ÞP ðm1;m2; . . . ;mN Þ �YN
i¼1P ðvijuiÞP ðmiÞ: ð5Þ
Therefore, we have
C� ¼ argmaxC
maxM
YNi¼1
P ðvijuiÞP ðmiÞ; ð6Þ
Then, the transliteration score function for C, given E, is formulated as
ScoreSTMðCÞ � maxM
logYNi¼1
P ðvijuiÞP ðmiÞ !
¼ maxM
XNi¼1
ðlog P ðvijuiÞ þ log P ðmiÞÞ: ð7Þ
Let S(i, j) be the maximum accumulated log score between the first i charac-ters of E and the first j characters of C. Then, ScoreSTM(C) = S(l,n), the max-
imum accumulated log score among all possible alignment paths of E with
length l and of C with length n, can be computed using a dynamic program-
ming strategy, as shown in the following:
Step 1 (Initialization)
Sð0; 0Þ ¼ 0; ð8Þ
Step 2 (Recursion)
Sði; jÞ ¼ maxh;k
½Sði� h; j� kÞ þ log Pðcjj�kjeii�hÞ þ log P ðh; kÞ�;
0 6 i 6 l; 0 6 j 6 n; ð9Þ
where logP(h,k) is defined as the log score of the match type ‘‘h-k’’, which cor-responds to the last term in Eq. (7).
Step 3 (Termination)
ScoreSTMðCÞ ¼ Sðl; nÞ: ð10Þ
In practice, the values of h and k are limited to a small set.
72 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
2.3. Estimation of model parameters
In the following, we describe the iterative procedure for re-estimation of
P(vjjui) and P(mi). We first define the following functions:
count(ui,vj) = the number of occurrences of aligned pair ui and vi in the trainingset;
count(ui) = the number of occurrences of ui in the training set;
count(h,k) = the total number of occurrences of match type ‘‘h-k’’ in the train-
ing set.
Therefore, P(vjjui) can be approximated as follows:
P ðvjjuiÞ ¼countðui; vjÞcountðuiÞ
: ð11Þ
Similarly, P(h,k) can be estimated as follows:
P ðh; kÞ ¼ countðh; kÞPi
Pjcountði; jÞ : ð12Þ
Because count(ui,vj) is unknown initially, a reasonable approach to obtain-ing a rough estimate of the parameters of the translation model is to constrain
the TU alignments of a word pair (E,C) within a position distance d [16]. As-
sume that ui ¼ epþh�1p and vj ¼ cqþk�1
q , and that the aligned pair (ui,vj) is con-
strained as follows:
p � q�ln
�� �� < d; and
ðp þ h� 1Þ � ðqþk�1Þ�ln
��� ��� < d;
8<: ð13Þ
where l and n are the length of the source word E and the target word C,
respectively.
To accelerate the convergence of the model training process and reduce the
number of noisy TU aligned pairs (ui,vj), we restrict the combination of TU
pairs to limited patterns in the beginning. Based on the assumption that the
articulatory representations of phonemes are very similar across languages,
we consider that the phonemes of TUs are independent from the underlying
languages. In this approach, the similarities of phonemes of TUs are initiallyclassified based on phonetic knowledge. Consonant TU pairs only with the
same or similar phonemes can be matched. An English consonant can also
be matched with a Chinese syllable beginning with the same or similar
phonemes. An English semivowel TU can either be matched with a Chinese
consonant or with a vowel with the same or similar phonemes, or can be
matched with a Chinese syllable beginning with the same or similar phonemes.
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 73
In the initialization phase, P(h,k) is set to uniform distribution, as shown in
the following:
P ðh; kÞ ¼ 1
T; ð14Þ
where T is the total number of match types allowed.Based on the EM algorithm with Viterbi decoding [8], the iterative param-
eter estimation procedure is described as follows:
Step 1 (Initialization): Use Eq. (13) to generate likely TU alignment pairs.
Calculate the initial model parameters, P(vjjui) and P(h,k), using
Eqs. (11) and (14), respectively.
Step 2 (Expectation): Based on the current model parameters, find the best
Viterbi path for each E and C word pair in the training set.Step 3 (Maximization): Based on all the TU alignment pairs obtained in Step
2, calculate the new model parameters using Eqs. (11) and (12).
Replace the model parameters with the new model parameters. If a
stopping criterion or a predefined number of iterations is reached, then
stop the training procedure. Otherwise, go back to Step 2.
In the first iteration, TUs in English and Chinese are constrained based on
phonetic knowledge. However, in the subsequent iterations, the whole trainingprocess is run in a totally unsupervised manner. Therefore, some new TUs are
automatically discovered from the training data within the constraints of
match types, as demonstrated in Section 4.
3. Extractions of transliterations from parallel corpora
The proposed transliteration model can be applied to the tasks of the extrac-tion of bilingual name and transliteration pairs [14], and back-transliteration
[15]. These tasks become more challenging for language pairs with different
sound systems, such as Chinese/English, Japanese/English, and Arabic/English.
For clarity of the paper, we focus on the extraction of English–Chinese name
and transliteration pairs. However, the proposed framework is easily extend-
able to other language pairs.
3.1. Overall process
For the purpose of extracting name and transliteration pairs from parallel
corpora, a sentence alignment procedure is applied first to align parallel texts
at the sentence level. Then, we use a part of speech tagger to identify proper
nouns in the source text. After that, the machine transliteration model is
74 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
applied to isolate the transliteration in the target text. In general, the proposed
transliteration model can be further augmented by linguistic processing, which
will be described in more detail in the next subsection. The overall process is
summarized in Fig. 3.
An excerpt from the magazine Scientific American [5] is given in the
following:
Source language sentence: ‘‘Rudolf Jaenisch, a cloning expert at the Whitehead
Institute for Biomedical Research at the Massachusetts Institute of Technol-
ogy, concurred:’’
Target language sentence:
‘‘ :’’.
In the above excerpt, three English proper nouns, ‘‘Jaenisch,’’ ‘‘Whitehead,’’and ‘‘Massachusetts,’’ were identified from the results of tagging. Utilizing Eq.
(7) and Viterbi decoding, we found that the target word ‘‘ (huaihaite)’’
most likely corresponded to ‘‘Whitehead.’’ The other word pair (Jaenisch,
‘‘chiehnihsi’’) can also be extracted through a similar process. How-
ever, the third word pair (Massachusetts, ‘‘masheng’’) failed to be ex-
tracted by the proposed approach. The reason is that ‘‘ ’’ is an
abbreviation of ‘‘ (masachusaichou)’’ which is a well established
Bilingual Corpus
Sentence Alignment
Source Language Sentences
Target LanguageSentences
Proper Names: Word Extraction
Source Words
Proper Names: Source & Target Words
LinguisticProcessing
TransliterationModel
Pre -process
MainProcess
Fig. 3. The overall process for extracting name and transliteration pairs from parallel corpora.
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 75
popular translated name of ‘‘Massachusetts.’’ Therefore, the proposed model is
incapable of resolving the abbreviation mentioned above.
In order to retrieve the transliteration for a given proper noun, we need to
keep track of the optimal TU decoding sequence associated with the given Chi-
nese term for each word pair under the proposed method. It can be easily ob-
tained via backtracking the best Viterbi path [19]. For the name-transliterationpair (Whitehead, ) mentioned above, the alignments of the TU match-
ing pairs via the Viterbi path are illustrated in Figs. 4 and 5.
In this example, the word ‘‘Whitehead’’ is decomposed into seven TUs,
‘‘Wh-i-t-e-h-e-a-d,’’ and aligned with the Romanization ‘‘huaihaite’’ of the
transliteration ‘‘ .’’
3.2. Linguistic processing
Some language-dependent knowledge can be integrated to further improve
the performance, especially when we focus on specific language pairs.
3.2.1. Linguistic processing rule 1 (R1)
Some source words have both transliterations and translations, which are
equally acceptable and can be used interchangeably. For example, the transla-
tion and the transliteration of the source word ‘‘England’’ are ‘‘ (Ying-
kou)’’ and ‘‘ (Yingkolan),’’ respectively, as shown in Fig. 6. Since the
Match Type TU Pair:
0 - 1 , -- y0 - 1 , -- u0 - 1 , -- a0 - 1 , -- n2 - 2 , Wh -- hu1 - 1 , i -- a1 - 0 , t --1 - 1 , e -- i1 - 1 , h -- h0 - 1 , -- a2 - 1 , ea -- i1 - 2 , d -- te0 - 1 , -- s0 - 1 , -- h0 - 1 , -- e0 - 1 , -- n0 - 1 , -- g
Fig. 4. The alignments of the TU matching pairs via the Viterbi path.
76 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
proposed model is designed specifically for transliteration, such cases may
cause problems. One way to overcome this limitation is to handle these cases
by using a list of commonly used proper names and translations. A portion
of the list is shown in Table 1.
3.2.2. Linguistic processing rule 2 (R2)
From error analysis of the aligned results of the training set, we have found
that the proposed approach suffers from fluid TUs, such as ‘‘t,’’ ‘‘d,’’ ‘‘tt,’’‘‘dd,’’ ‘‘te,’’ and ‘‘de.’’ Sometimes they are omitted during transliteration,
and sometimes they are transliterated as Chinese characters. For instance,
W
h
i
t
e
h
e
a
d
m a s h e n g l i
…
y u a n h u a i h a i t e s
…
… …
Fig. 5. The Viterbi alignment path.
Fig. 6. Examples of mixed usages of translation and transliteration.
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 77
‘‘d’’ is usually transliterated as ‘‘ ,’’ ‘‘ ,’’ or ‘‘ ’’ corresponding to the Chi-
nese TU of ‘‘te.’’ The English TU ‘‘d’’ is transliterated as ‘‘ ’’ in (Clifford,
), but left out in (Radford, ). This phenomenon causes prob-
lems; in the example shown in Fig. 7, the TU ‘‘d’’ in ‘‘David’’ is mistakenly
matched up with ‘‘ .’’
Similarly, the English TU ‘‘s’’ or ‘‘se’’ is likely to misalign with ‘‘ ’’ (TU
‘‘shih’’) as in ‘‘ (Athens was one of the
most powerful city-states of ancient Greece.).’’ See Fig. 8 for more details.However, the problem caused by fluid TUs can be partly overcome by add-
ing more linguistic constraints in the post-processing phase. We calculate the
Chinese character distributions of proper nouns from the corpus. A small set
of Chinese characters is often used for transliteration. Therefore, it is possible
to improve the performance by pruning extra tailing characters, which do not
belong to the transliterated character set, from the transliteration candidates.
For instance, the probability of ‘‘ ’’ being used in translitera-
tion is very low. Therefore, the correct transliteration ‘‘ ’’ for the sourceword ‘‘David’’ can be extracted by removing the character ‘‘ ’’ We denote this
strategy as Rule 2 (R2).
Table 1
A portion of the list for translation
Source word Target word Source word Target word
Afghanistan England
America France
Asia Greece
Canada India
China Spanish
Christ Yugoslavia
Fig. 7. Example of transliterated word extraction for ‘‘David.’’
78 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
3.3. Work flow of integrating linguistic and statistical information
Combining the linguistic processing and transliteration model, we present
the algorithm for transliteration extraction as follows:
Step 1: Look up the translation list as stated in R1. If the translation of a
source word appears in both the entry of the translation list and the
aligned target sentence (or paragraph), then pick the translation asthe target word. Otherwise, go to Step 2.
Step 2: Pass the source word and its aligned target sentence (or paragraph)
through the proposed model to extract the target word. Once this is
done, go to Step 3.
Step 3: Apply linguistic processing R2 to remove superfluous tailing characters
from the extracted transliterations.
After the above steps are completed, the performance of source-target wordextraction is significantly improved.
4. Experiments
In this section, we focus on the setup for the experiments and a performance
evaluation of the proposed model applied to extract bilingual word pairs from
parallel corpora.
4.1. Experimental setup
Several corpora were collected to estimate the parameters of the proposed
models and to evaluate the performance of the proposed approach. The corpus
Fig. 8. Example of transliterated word extraction for ‘‘Athens.’’
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 79
T0 for training consisted of 2,430 pairs of English names and transliterations in
Chinese. The training corpus, composed of a bilingual proper name list, was
collected from ‘‘Handbook of English Name Knowledge’’ edited by Huai
[10]. The bilingual proper name list consists of first names, last names, and
nicknames. For example, (Adolf, ‘‘Ataofu’’) and (Adelaide,
‘‘Atelaite’’) are first names, (Abbey, ‘‘Api’’) and (Adela,‘‘Atela’’) are last names, and (Archie, ‘‘Aerhchi’’) and (Allie,
‘‘Ali’’) are nicknames, for males and females, respectively. Some first
names are also used as last names. For instance, ‘‘Abel’’ can be either a first
name or a last name. Table 2 shows some samples of the training corpus.
In the experiment, three sets of parallel-aligned texts [4], P1, P2, and P3,
were prepared to evaluate the performance of proposed methods. P1 consisted
of 500 bilingual examples from the English–Chinese version of the Longman
Dictionary of Contempory English (LDOCE) [22]. P2 consisted of 300 alignedsentences from Scientific American, USA and Taiwan Editions. 2 P3 consisted
of 300 aligned sentences from the Sinorama Corpus [23].
In the experiment, we dealt with personal and place names as well as their
transliterations from the parallel corpora. The performance of transliteration
extraction was evaluated based on the precision rates of transliteration words
or characters. For simplicity, we considered each proper name in the source
sentence in turn and determined its corresponding transliteration indepen-
dently. Table 3 shows some examples from the testing set P1.
4.2. TUs for English and Chinese
The proposed model is based on TUs, which are more linguistically moti-
vated than individual characters. Table 4 lists some of the most frequently
occurring English TUs of length 1 to 3. Table 5 lists some of the most fre-
quently occurring Chinese TUs. Table 6 shows some English–Chinese TU-
mapping probabilities automatically estimated from all of the training data.The automatic learning process resulted in mostly regular monographs and
digraphs found in pronunciation dictionaries, such as the Longman Pronunci-
ation Dictionary (LPD) [29], including ‘‘rh’’ and ‘‘au.’’ However, it also learned
additional TUs, such as ‘‘cq’’ in the personal names ‘‘Jacqueline’’ and ‘‘Jacque-
tta.’’ For example, after the second iteration of EM training, the most likely
TU alignment sequence of the name pair (Jacqueline, Chiehkueilin
‘‘ ’’) is shown in Fig. 9.
It should be noted that an original word may have more than one translit-eration. For instance, the English name ‘‘Beaufort’’ has several possible
Chinese terms ‘‘ ’’ (Paofu), ‘‘ ’’ (Paofo), ‘‘ ’’ (Pufu), ‘‘ ’’
2 Scientific American: ‘‘http://www.sciam.com’’ (USA edition) and ‘‘http://www.sciam.com.tw’’
(Taiwan edition)
80 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
(Paofote). The TUs of the word ‘‘Beaufort’’ were automatically and dynami-
cally constructed and aligned with their corresponding transliteration TUs
via the proposed model. The results are shown in Fig. 10.
Although Knight and Graehl [12] applied EM to automatically learn simi-larities of English–Japanese name pairs, English words and Japanese katakana
words have to be converted into English sounds and Japanese sounds, respec-
tively, via pronunciation dictionaries. Each English sound can map to one or
more Japanese sounds. Compared with their study, one of the advantages of
our approach is that we do not have to find the exact pronunciations via dic-
tionary lookup or various grapheme-to-phoneme rules. To be more specific, a
set of often-used Chinese characters for transliteration was selected from the
collected corpora. Although many Chinese characters have more than one pro-nunciation, we found that almost all the characters used for transliteration
have unique pronunciations. For those Chinese characters not used for trans-
literation, we choose the most frequently used pronunciation instead. Since we
focus on transliterated words, we do not apply any Chinese pronunciation
Table 2
Some samples from the training set T0
Source word Target word Source word Target word
Abe Agatha
Abbey Acton
Abbot Arkwright
Archer Arabella
Adolf Alaric
Adolphus Alasdair
Adela Alastair
Adelaide Alethea
Arden Alonzo
Albert Ariadne
Alfonso Allegra
Alfie Alister
Alf Allie
Algy Arlene
Algernon Alan
Alma Aloys
Almeric Aloysius
Archie Amadeus
Alva Amabel
Alphonsus Amanda
Alphonso Amelia
Afra Arms
Avril Armstrong
Agnes Anastasia
Argus Arno
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 81
disambiguation algorithm to decide the exact pronunciation for each character.
Thus, the Romanization of Chinese Characters can be conducted directly via
table lookup instead of using a pronunciation dictionary. Moreover, to accel-
erate the convergence of EM training and reduce noisy TU pairs at grapheme-
level string mapping, we adopt a many-to-many mapping under the constraints
Table 3
Some bilingual examples from the testing set P1
He is a (second) Caesar in speech and leadership
Hamlet kills the king in Act 5 Scene 2
Can you adduce any reason at all for his strange behaviour, Holmes?
To see George, of all people, in the Ritz Hotel!
He has 2 caps for playing cricket for England
They appointed him to catch all the rats in Hamelin
Burlington Arcade is a famous shopping passage in London
The architecture of ancient Greece
Drink Rossignol, the aristocrat of table wines!
Cleopatra was bitten by an asp
I shall soon be leaving for an assignment in India
Our plane stopped at London (airport) on its way to New York
Schoenberg used atonality in the music of his middle period
This tune is usually attributed to J. S. Bach
Now that this painting has been authenticated as a Rembrandt, it�s worth 10 times as much as I
paid for it!
Byron awoke one morning to find himself famous
82 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
of a limited set of matched types based on phonetic knowledge. The maximumlengths of English and Chinese TUs are 3 and 5, respectively. Table 7 shows the
match types and English and Chinese TUs obtained in our experiments.
Table 4
Some high frequency English TUs
Length of English TU u High frequency TUs
1 a, e, i, n, l, s, o, r, d, t
2 er, ie, ar, ll, th, or, ch, tt, ck, ph
3 lle, sch
Table 5
Some high frequency Chinese TUs
Length of Chinese TU v High frequency TUs
1 i, a, l, n, o, t, e, p, m, u
2 te, ei, ai, ch, ko, hs, ng, ao, pu, fu
3 ssu, erh, ieh, chi, hsi
4 shih
5 chieh
Table 6
English–Chinese TU-mapping probabilities
u v P(vju) u v P(vju)h 0.272 ei i 0.900
ae ei 0.571 eu yu 0.785
ae i 0.214 ew u 0.500
ai a 0.500 ey i 0.998
ai e 0.250 f f 0.586
ar a 0.794 ff f 0.733
au o 0.772 ff fu 0.266
aw ao 0.545 g ko 0.350
aw o 0.454 g ch 0.345
Fig. 9. TU alignment of the name pair (Jacqueline, Chiehkueilin ‘‘ ’’).
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 83
4.3. Evaluation metric
In the experiment, the performance of transliteration extraction was evalu-
ated based on precision and recall rates at the word and character levels. Since
we considered exactly one proper name in the source language and one trans-
literation in the target language at a time, the word recall rates were same as the
word precision rates:
Fig. 10. TU alignment of ‘‘Beaufort’’ and corresponding transliterations.
Table 7
Examples for each match type
Match type TU pair
0–1 ( ,h), ( ,i), ( ,n), ( ,u)
1–0 (h, ), (k, ), (d, ), (t, )
1–1 (r, l), (y, i), (m, m)
1–2 (j, ch), (f, fu), (d, te)
1–3 (s, ssu), (l, erh), (r, erh)
1–4 (s, shih)
2–0 (gh, )
2–1 (bb, p), (ey, i), (mm, m)
2–2 (dg, ch), (wh, hu), (ck, ko)
2–3 (le, erh), (re, erh), (ce, ssu)
2–4 (ce, shih)
2–5 (ge, chieh)
3–2 (sch, hs)
3–3 (lle, erh)
3–4 (sch, shih)
84 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
Word precision ðWPÞ ¼ number of correctly extracted words
number of correct words:
The character level recall and precision rates were defined as follows:
Character precision ðCPÞ ¼ number of correctly extracted characters
number of extracted characters;
Character recall ðCRÞ ¼ number of correctly extracted characters
number of correct characters:
4.4. Experimental results and discussion
In the experiment of extracting transliterations on the data set P1, the STM
model achieved, on average, a word precision rate of 86%, a character precision
rate of 94.4%, and a character recall rate of 96.3%, as shown in Table 8. The
performance could be further improved by means of simple statistical and lin-
guistic processing, as shown in Table 8.Table 9 shows some examples of Chinese transliterated words, correctly ex-
tracted using the STM model, from P1. Although, the STM model failed in
some cases, most of these problems could be overcome through the addition
of simple linguistic processing, as shown in Table 10. The error in the case
of ‘‘Quirk’’ occurred because ‘‘Quirk’’ is much closer to ‘‘ (kohoko)’’
than to ‘‘ (KoKo),’’ based on phonetic similarity. In this case, the Chinese
transliteration plainly cannot be correctly extracted. Similar problems, due to
similarities at the grapheme level, occurred with the name pairs (Tom,‘‘tangmu’’) and (John, ‘‘yuehhan’’), as shown in Table 10. It is obvious that
Table 8
The experimental results of transliterated word extraction for P1
Test set Methods WP (%) CP (%) CR (%)
P1 (LDOCE) STM 86.0 94.4 96.3
STM + R1 88.6 95.4 97.7
STM + R2 90.8 97.4 95.9
STM + R1 + R2 94.2 98.3 97.7
P2 (Scientific American) STM 90.7 96.9 97.3
STM + R1 92.7 97.6 97.9
STM + R2 92.0 97.8 97.3
STM + R1 + R2 94.0 98.3 97.9
P3 (Sinorama) STM 86.7 94.2 96.1
STM + R1 89.0 94.9 96.8
STM + R2 87.7 95.8 94.9
STM + R1 + R2 93.0 96.5 96.7
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 85
a collection of commonly used or highly varying transliterations can be incre-
mentally added to a lookup list to further improve the system performance.
We have also performed the same experiments on the data sets P2 and P3,
and the results are shown in Table 8. Although the performance of the STM
approach on the data sets P1 and P3 are worse than that of P2, obviously,
the integrated scheme (STM + R1 + R2) exhibits considerable robustness in
extracting transliterated words from different data sets in various domains.
The results in Table 11 show the average rates of word and character precisionfor the test sets are around 93.8% and 97.8%, respectively.
Table 9
Some example of Chinese transliterations, correctly extracted by the STM model, from P1
Bilingual sentence Chinese transliteration
He is a second Caesar in speech and leadership (kaisa)
In this case I�m acting for my friend Mr. Smith (shihmissu)
What�s your alibi for being late this time Jones? (chungssu)
Can you adduce any reason at all for his strange behaviour, Holmes? (fuerhmossu)
They appointed him to catch all the rats in Hamelin (hanmulin)
Drink Rossignol, the aristocrat of table wines! (lohsino)
Cleopatra was bitten by an asp
(koliaopeitela)
Schoenberg used atonality in the music of his middle period (sangpoko)
If you have to change trains in London, you may be able to book
through to your last station
(luntun)
This tune is usually attributed to J.S. Bach (paha)
Byron awoke one morning to find himself famous (pailun)
You must have kissed the Blarney Stone to be able to talk like that! (pulani)
Quirk and Greenbaum collaborated on the new grammar (kolinpang)
86 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
Table 10
Some examples of possible Chinese transliterations extracted by the proposed approaches
Bilingual Sentence STM STM + R1 STM + R2 STM + R1 + R2
David, as you know, writes dictionaries (taweite) (tawei)
The Mediterranean Sea bathes the sunny shores of Italy
(teitalihaian) (tichunghai)
You have borne yourself bravely in this battle, Lord Faulconbridge
(fokenpolichueh) (fokenpoli)
Ancient Rome and Greece (chihsi) (hsila)
Jane is blossoming out into a beautiful girl (cheni) (chen)
Tom likes to boss younger children about (tang)
Quirk and Greenbaum collaborated on the new grammar
(kohoko)
John seems to have made a real conquest of Janet. They�re always together (chen)
‘‘*’’ Means the Chinese transliterated words are not correctly extracted.
C.-J
.Lee
etal./Inform
atio
nScien
ces176(2006)67–90
87
Compared with the previous work, the proposed approach has three advan-
tages. First, the proposed method learns the parameters of the model automat-
ically from a list of bilingual name pairs without using a pronunciation
dictionary or grapheme-to-phoneme rules for the source words. Second, the
proposed framework is easier to port to other language pairs as long as there
is some transliteration training data. Third, the proposed approach matches
TUs in the two languages directly, therefore accelerates the matching process
by skipping the grapheme-to-phoneme phase.
5. Conclusions
A new statistical modeling approach to the machine transliteration problem
has been presented in this paper. The parameters of the model are automati-
cally learned from a bilingual proper name list using the EM algorithm. More-
over, the model is applicable to the extraction of proper names andtransliterations. The proposed method can be easily extended to other language
pairs that have different sound systems without the assistance of pronunciation
dictionaries. Experimental results indicate that high precision and recall rates
can be achieved by the proposed method.
References
[1] Y. Al-Onaizan, K. Knight, Translating named entities using monolingual and bilingual
resources, in: Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics (ACL), Philadelphia, 2002, pp. 400–408.
[2] Y.R. Chao, A Grammar of spoken Chinese, University of California Press, Berkeley, 1968.
[3] H.-H. Chen, S.-J. Huang, Y.-W. Ding, S.-C. Tsai, Proper name translation in cross-
language information retrieval, in: Proceedings of 17th COLING and 36th ACL, 1998, pp.
232–236.
[4] T.C. Chuang, G.N. You, J.S. Chang, Adaptive bilingual sentence alignment, Lecture Notes in
Artificial Intelligence 2499 (2002) 21–30.
[5] J.B. Cibelli, R.P. Lanza, M.D. West, C. Ezzell, W. Clones, Scientific American, January 2002.
Available from: <http://www.sciam.com>.
Table 11
The average rates of transliterated word extraction for overall corpora
Methods WP (%) CP (%) CR (%)
STM 87.5 95.1 96.6
STM + R1 89.8 95.9 97.5
STM + R2 90.3 97.1 96.1
STM + R1 + R2 93.8 97.8 97.5
88 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90
[6] I. Dagan, K.W. Church, W.A. Gale, Robust bilingual word alignment for machine aided
translation, in: Proceedings of the Workshop on Very Large Corpora: Academic and
Industrial Perspectives, Columbus, Ohio, 1993, pp. 1–8.
[7] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the
EM algorithm, Journal of the Royal Statistical Society 39 (1) (1977) 1–38.
[8] G.D. Forney, The Viterbi algorithm, Proceedings of IEEE 61 (1973) 268–278.
[10] L. Huai, Handbook of English Name Knowledge, ISBN 7-5012-0144-7/Z.10, 1st ed., 1989.
, ISBN 7-5012-0144-7/Z.10,
1989 .
[11] B.-J. Kang, K.-S. Choi, Two approaches for the resolution of word mismatch problem caused
by English words and foreign words in Korean information retrieval, International Journal of
Computer Processing of Oriental Languages 14 (2) (2001) 109–131.
[12] K. Knight, J. Graehl, Machine transliteration, Computational Linguistics 24 (4) (1998) 599–
612.
[13] J. Kupiec, An algorithm for finding noun phrase correspondences in bilingual corpora, in:
Proceedings of the 40th Annual Conference of the Association for Computational Linguistics
(ACL), Columbus, Ohio, 1993, pp. 17–22.
[14] C.-J. Lee, J.S. Chang, Acquisition of English–Chinese transliterated word pairs from parallel-
aligned texts using a statistical machine transliteration model, in: Proceedings of HLT-
NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine
Translation and Beyond, Edmonton, Canada, 2003, pp. 96–103.
[15] C.-J. Lee, J.S. Chang, J.-S.R. Jang, A statistical approach to Chinese-to-English Back-
transliteration, in: Proceedings of the 17th Pacific Asia Conference on Language, Information,
and Computation (PACLIC), Singapore, 2003, pp. 310–318.
[16] J.S. Lee, K.-S. Choi, A statistical method to generate various foreign word transliterations in
multilingual information retrieval system, in: Proceedings of the 2nd International Workshop
on Information Retrieval with Asian Languages (IRAL�97), Tsukuba, Japan, 1997, pp. 123–128.
[18] W.-H. Lin, H.-H. Chen, Backward transliteration by learning phonetic similarity, in: CoNLL-
2002, Sixth Conference on Natural Language Learning, Taipei, Taiwan, 2002.
[19] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, first ed.,
MIT Press, Tsdfvdsgc, 1999.
[20] I. Dan Melamed, Automatic construction of clean broad coverage translation lexicons, in:
Proceedings of the 2nd Conference of the Association for Machine Translation in the Americas
(AMTA�96), Montreal, Canada, 1996.
[21] J.-H. Oh, K.-S. Choi, An English–Korean transliteration model using pronunciation and
contextual rules, in: Proceedings of the 19th International Conference on Computational
Linguistics (COLING), Taipei, Taiwan, 2002, pp. 758–764.
[22] P. Proctor, Longman English–Chinese Dictionary of Contemporary English, Longman Group
(Far East) Ltd., Hong Kong, 1988.
[23] Sinorama, Sinorama Magazine. Available from: <http://www.greatman.com.tw/sinorama.
htm>, 2002.
[24] B.G. Stalls, K. Knight, Translating names and technical terms in Arabic text, in: Proceedings
of the COLING/ACL Workshop on Computational Approaches to Semitic Languages, 1998.
[25] F.Z. Smadja, K. McKeown, V. Hatzivassiloglou, Translating collocations for bilingual
lexicons: a statistical approach, Computational Linguistics 22 (1) (1996) 1–38.
[26] K. Tsuji, Automatic extraction of translational Japanese-KATAKANA and English word
pairs from bilingual corpora, International Journal of Computer Processing of Oriental
Languages 15 (3) (2002) 261–279.
[28] S. Wan, C.M. Verspoor, Automatic English–Chinese name transliteration for development of
multilingual resources, in: Proceedings of 17th COLING and 36th ACL, 1998, pp. 1352–1356.
C.-J. Lee et al. / Information Sciences 176 (2006) 67–90 89
[29] J.C. Wells, Longman Pronunciation Dictionary (New Edition), Addison Wesley Longman,
Inc, 2001.
[30] D. Wu, X. Xia, Learning an English–Chinese lexicon from a parallel corpus, in: Proceedings of
the First Conference of the Association for Machine Translation in the Americas (AMTA),
1994, pp. 206–213.
90 C.-J. Lee et al. / Information Sciences 176 (2006) 67–90