machine transliteration bhargava reddy 110050078 b.tech 4 th year ug

Machine Transliteration

Bhargava Reddy110050078

B.Tech 4th year UG

Contents

• Fundamental Definition of Machine Transliteration• History of Machine Transliteration• Modelling of the Transliteration• Bridge Transliteration System• Syllabification and the use of CRF• Substring alignment and re-raking methods• Using Hybrid Models

Definition of Machine Transliteration

• Conversion of a given name in source language to a name in target language such that the target language is:

1. Phonemically equivalent to the source language2. Conforms to the phonology of the target language3. Matches the user intuition of the equivalent of the source language name in

the target language, considering the culture and orthographic character usage in the target language

We need to note that all are equivalent in it’s own kind

Ref: Report of NEWS 2012 Machine Transliteration Shared Task

Brief History of the work carried out

• Early Models for MT1. Grapheme-based transliteration model (ψG)2. Phoneme-based transliteration model (ψP)3. Hybrid transliteration model (ψH)4. Correspondence-based transliteration model (ψC)

ψG is known as direct method because it directly transforms source language graphemes to target languageψP is called as pivot method because it uses source language as pivot when it produces target language graphemes

Ref: A comparison of Different Machine Transliteration Models. 2006

Hybrid and Correspondence Models

• Both the model were combined as resulted in ψC and ψH

• ψH directly combines phoneme based transliteration probability Pr(ψP) and grapheme based transliteration probability Pr(ψG) using linear interpolation (Dependence between them is not considered)• ψC made use of the correspondence between a source grapheme and

a source phoneme when it produces target language graphemes

Ref: 1. Improving back-transliteration by combining information sources 2. An English-Korean transliteration model using pronunciation and contextual rules

Graphical Representation


Modelling the Components

• Maximum entropy model (MEM) is a widely used probability model that can incorporate heterogeneous information effectively. Thus used in Hybrid model• Decision-Tree Learning used for creating the training set for the

models• Memory-based Learning (MBL) also known as “instance based

learning” and “case-based learning”, is an example-based learning method. Useful for computation of φ(SP)T


Study of MT through Bridging LanguagesData is available between a language pair due to one of the following three reasons:1. Politically related languages:

Due to the political dominance of English it is easy to obtain parallel names data between English and most languages

2. Genealogically related languages: Languages sharing the same origin. Might have significant

overlap between their phonemes and graphemes

3. Demographically related languages: Hindi and Telugu. Might not have the same origin but due to

the shared culture and demographics there will be similaritiesRef: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Bridge Transliteration Methodology• Assume X is the source language, Y is the target language and Z is the

pivot language. Then:

• In the above expression we assume that Y is independent of X given Z• The linguistic intuition behind this assumption is that the top-k

outputs of the X → Z system corresponding to a string in X, capture all the transliteration information necessary for transliterating to Y

Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Results for the Bridge System

• We must remember that Machine Transliteration is a lossy conversion• In the bridge system we can assume that we will get loss in

information and thus the accuracy score will drop down• The results have shown that there has been a drop in accuracy of

about 8-9%(ACC1) and about 1-3%(Mean F-score)• NEWS 2009 was used as a dataset for this training and evaluation of

the results


Stepping though an intermediate language


Syllabification?

• No gold standard syllable segmentation• Yang et al. (2009) applied N-gram joint source channel and EM

algorithm • Aramaki and Abekawwa (2009) made use of word alignment tool in

GIZA++ to obtain a syllable segmentation and alignment corpus from the training data given• Yand et al. (2010) proposed a joint optimization method to reduce the

propaganda of alignment error• The paper made syllabification of Chinese words

Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs

• The transliteration is implemented as a 2 phase CRF • The first CRF splits the word into chunks(Similar to syllabification)• The second CRF to label what target characters are transliterated• Final transliteration is the sequence of all the target characters

Using CRF models for MT

• Hindi to English Machine transliteration model using CRF’s has been proposed by Manikrao, Shantanu and Tushar in the paper: “Hindi to English Machine Transliteration of named entities using CRFs” during the International Journal of Computer Applications(0975-8887) on June 2012• The description is show as follows:

Model Flow

Results of the CRF model proposed

English-Korean Transliteration using substring alignment and re-ranking

methodsChun-Kai Wu, Yu-Chun Wang, Richard Tzong-Han Tsai described in their paper the approach for the MT.It consisted of 4 parts:

1. Pre-Processing2. Letter-to-phoneme alignment3. DirecTL-p training4. Re-ranking results

Hybrid models

• Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku Oonishi, Masanobu Nakamura and Sadaoki Furui of Computer Science Department in Tokyo Institute of Technology have combined the 2-step CRF model with a joint source channel model for Machine Transliteration

References

• Report of NEWS 2012 Machine Transliteration Shared Task(2012), Min Zhang, haizhou Li, A Kumaran and Ming Lui. ACL 2012• A comparison of Different Machine Transliteration Models (2006), A Comparison

of Different Machine Transliteration Models• Improving back-transliteration by combining information sources. (2004). Bilac

S., & Tanaka, H. In Proceedings of IJCNLP2004, pp. 542–547 • An English-Korean transliteration model using pronunciation and contextual

rules. (2002). Oh, J. H., & Choi, K. S. In Proceedings of COLING2002, pp. 758–764• Everybody loves a rich cousin: An empirical study of transliteration through

bridge languages. (2010). Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya

References

• Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs. (2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW• Hindi to English Machine transliteration model of named entities using

CRFs(2012). Manikrao, Shantanu and Tushar. International Journal of Computer Applications(0975-8887) on June 2012• English-korean named entity transliteration using substring alignment and re-

ranking methods. (2012). Chun-Kai Wu, Yu-Chun Wang, and Richard TzongHan Tsai. In Proc. Named Entities Workshop at ACL 2012• Combining a two-step CRF and a joint source channel model for machine

transliteration. (2009). D Yang, P Dixon, YC Pan, R Oonishi, M Nakamura in NEWS ‘09 proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration

machine transliteration bhargava reddy 110050078 b.tech 4 th year ug

Documents

hybrid transliteration

source language graphemes

target language graphemes

language pair

probability model

hybrid models

source phoneme

correspondence models