machine transliteration bhargava reddy 110050078 b.tech 4 th year ug
TRANSCRIPT
Machine Transliteration
Bhargava Reddy110050078
B.Tech 4th year UG
Contents
• Fundamental Definition of Machine Transliteration• History of Machine Transliteration• Modelling of the Transliteration• Bridge Transliteration System• Syllabification and the use of CRF• Substring alignment and re-raking methods• Using Hybrid Models
Definition of Machine Transliteration
• Conversion of a given name in source language to a name in target language such that the target language is:
1. Phonemically equivalent to the source language2. Conforms to the phonology of the target language3. Matches the user intuition of the equivalent of the source language name in
the target language, considering the culture and orthographic character usage in the target language
We need to note that all are equivalent in it’s own kind
Ref: Report of NEWS 2012 Machine Transliteration Shared Task
Brief History of the work carried out
• Early Models for MT1. Grapheme-based transliteration model (ψG)2. Phoneme-based transliteration model (ψP)3. Hybrid transliteration model (ψH)4. Correspondence-based transliteration model (ψC)
ψG is known as direct method because it directly transforms source language graphemes to target languageψP is called as pivot method because it uses source language as pivot when it produces target language graphemes
Ref: A comparison of Different Machine Transliteration Models. 2006
Hybrid and Correspondence Models
• Both the model were combined as resulted in ψC and ψH
• ψH directly combines phoneme based transliteration probability Pr(ψP) and grapheme based transliteration probability Pr(ψG) using linear interpolation (Dependence between them is not considered)• ψC made use of the correspondence between a source grapheme and
a source phoneme when it produces target language graphemes
Ref: 1. Improving back-transliteration by combining information sources 2. An English-Korean transliteration model using pronunciation and contextual rules
Graphical Representation
Ref: A comparison of Different Machine Transliteration Models. 2006
Modelling the Components
• Maximum entropy model (MEM) is a widely used probability model that can incorporate heterogeneous information effectively. Thus used in Hybrid model• Decision-Tree Learning used for creating the training set for the
models• Memory-based Learning (MBL) also known as “instance based
learning” and “case-based learning”, is an example-based learning method. Useful for computation of φ(SP)T
Ref: A comparison of Different Machine Transliteration Models. 2006
Study of MT through Bridging LanguagesData is available between a language pair due to one of the following three reasons:1. Politically related languages:
Due to the political dominance of English it is easy to obtain parallel names data between English and most languages
2. Genealogically related languages: Languages sharing the same origin. Might have significant
overlap between their phonemes and graphemes
3. Demographically related languages: Hindi and Telugu. Might not have the same origin but due to
the shared culture and demographics there will be similaritiesRef: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Bridge Transliteration Methodology• Assume X is the source language, Y is the target language and Z is the
pivot language. Then:
• In the above expression we assume that Y is independent of X given Z• The linguistic intuition behind this assumption is that the top-k
outputs of the X → Z system corresponding to a string in X, capture all the transliteration information necessary for transliterating to Y
Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Results for the Bridge System
• We must remember that Machine Transliteration is a lossy conversion• In the bridge system we can assume that we will get loss in
information and thus the accuracy score will drop down• The results have shown that there has been a drop in accuracy of
about 8-9%(ACC1) and about 1-3%(Mean F-score)• NEWS 2009 was used as a dataset for this training and evaluation of
the results
Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Stepping though an intermediate language
Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Syllabification?
• No gold standard syllable segmentation• Yang et al. (2009) applied N-gram joint source channel and EM
algorithm • Aramaki and Abekawwa (2009) made use of word alignment tool in
GIZA++ to obtain a syllable segmentation and alignment corpus from the training data given• Yand et al. (2010) proposed a joint optimization method to reduce the
propaganda of alignment error• The paper made syllabification of Chinese words
Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs
• The transliteration is implemented as a 2 phase CRF • The first CRF splits the word into chunks(Similar to syllabification)• The second CRF to label what target characters are transliterated• Final transliteration is the sequence of all the target characters
Using CRF models for MT
• Hindi to English Machine transliteration model using CRF’s has been proposed by Manikrao, Shantanu and Tushar in the paper: “Hindi to English Machine Transliteration of named entities using CRFs” during the International Journal of Computer Applications(0975-8887) on June 2012• The description is show as follows:
Model Flow
Results of the CRF model proposed
English-Korean Transliteration using substring alignment and re-ranking
methodsChun-Kai Wu, Yu-Chun Wang, Richard Tzong-Han Tsai described in their paper the approach for the MT.It consisted of 4 parts:
1. Pre-Processing2. Letter-to-phoneme alignment3. DirecTL-p training4. Re-ranking results
Hybrid models
• Dong Yang, Paul Dixon, Yi-Cheng Pan, Tasuku Oonishi, Masanobu Nakamura and Sadaoki Furui of Computer Science Department in Tokyo Institute of Technology have combined the 2-step CRF model with a joint source channel model for Machine Transliteration
References
• Report of NEWS 2012 Machine Transliteration Shared Task(2012), Min Zhang, haizhou Li, A Kumaran and Ming Lui. ACL 2012• A comparison of Different Machine Transliteration Models (2006), A Comparison
of Different Machine Transliteration Models• Improving back-transliteration by combining information sources. (2004). Bilac
S., & Tanaka, H. In Proceedings of IJCNLP2004, pp. 542–547 • An English-Korean transliteration model using pronunciation and contextual
rules. (2002). Oh, J. H., & Choi, K. S. In Proceedings of COLING2002, pp. 758–764• Everybody loves a rich cousin: An empirical study of transliteration through
bridge languages. (2010). Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya
References
• Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs. (2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW• Hindi to English Machine transliteration model of named entities using
CRFs(2012). Manikrao, Shantanu and Tushar. International Journal of Computer Applications(0975-8887) on June 2012• English-korean named entity transliteration using substring alignment and re-
ranking methods. (2012). Chun-Kai Wu, Yu-Chun Wang, and Richard TzongHan Tsai. In Proc. Named Entities Workshop at ACL 2012• Combining a two-step CRF and a joint source channel model for machine
transliteration. (2009). D Yang, P Dixon, YC Pan, R Oonishi, M Nakamura in NEWS ‘09 proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration