btp stage 1 machine transliteration & entropy final presentation bhargava reddy 110050078
TRANSCRIPT
BTP Stage 1Machine Transliteration
&Entropy
Final PresentationBhargava Reddy
110050078
Contents
• What is Machine Transliteration? • Classical Methods of Transliteration• CRF and its use in Transliteration• Study of Transliteration through Bridging Languages• Study of Entropy in Information Theory • Entropy of a Language• Transliterability and Transliteration Performance• Conclusion and Future Works
Definition of Machine Transliteration
• Conversion of a given name in source language to a name in target language such that the target language is:
1. Phonemically equivalent to the source language2. Conforms to the phonology of the target language3. Matches the user intuition of the equivalent of the source language name in
the target language, considering the culture and orthographic character usage in the target language
We need to note that all are equivalent in it’s own kind
Ref: Report of NEWS 2012 Machine Transliteration Shared Task
Classical Machine Transliteration Models• Work on machine transliteration started between 1995-2000.• 4 classical Machine Transliteration Models have been proposed:
1. Grapheme Based Transliteration Model (ψG) 2. Phoneme Based Transliteration Model (ψP)3. Hybrid Transliteration Model (ψH)4. Correspondence Based Transliteration Model (ψC)
Ref: A comparison of Different Machine Transliteration Models (2006), A Comparisonof Different Machine Transliteration Models.
Grapheme and Phoneme
Phoneme:Smallest contrastive linguistic unit which may bring about a
change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference.Grapheme:
Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.
Phoneme-based Transliteration Model
• This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation• This model was first proposed by Knight and Graehl in 1997• They modelled it for English – Japanese and Japanese – English
Transliteration• In these methods the main transliteration key is pronunciation (or)
source phoneme rather than spelling or source grapheme
Katakana Words and Japanese Language• Katakana words are those words which are imported from other
languages (primarily English)• This language has a lot of issues with when pronunciation is
concerned• In Japanese the words L,R are pronounced the same. (Pronounced as
something in between)• Same goes with H,F either
Katakana Words
• Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ• Johnson is pronounced as jyo-n-s-o-n --- ジョンソン• Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム• What have we observed in the transliteration?• We can say that there has been a lot of information loss in the
process of conversion from English to Japanese• So when we do the back-transliteration we may fall into trouble
The steps to convert from English to Katakana
We do the following steps for the conversion:1. – Generates English word sequences2. – Pronounces English word sequences3. – Converts English sounds into Japanese sounds4. – Converts Japanese sounds to katakana writing5. – Introduces misspelling caused by Optical Character Recognition
Fixing Back-Transliteration• We do the 5 steps for converting an English language word to Japanese• Knight and Graehl used Bayes theorem to do the reverse• For a given katakana word ‘o’ they maximized the following expression:
• They implemented using WFSA (Weighted Finite State Acceptor) – with
weights and symbols on the transition making some output sequence likely than others
• The other probabilities are implemented using WFST (Weighted Finite State Transducer) with a pair of symbols on each transition, one input and one output. Inputs and outputs may include the empty symbol ϵ
Example for Back Transliteration
• Consider the Japanese word マスターズトーナメント• This is first identified my an OCR which may give errors• Now this is passed to model which converts it as :
m a s u t a a z u t o o ch I m e n t o• Now comes the model. It converts the text as:
M AE S T AE AR DH UH T AO CH IH M EH N T AO• Now comes the model.
Masters tone am ent awe• Finally using the P(w) model. The word becomes: Masters Tournament
Study of MT through Bridging LanguagesData is available between a language pair due to one of the following three reasons:1. Politically related languages:
Due to the political dominance of English it is easy to obtain parallel names data between English and most languages
2. Genealogically related languages: Languages sharing the same origin. Might have significant
overlap between their phonemes and graphemes
3. Demographically related languages: Hindi and Telugu. Might not have the same origin but due to
the shared culture and demographics there will be similaritiesRef: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Bridge Transliteration Methodology• Assume X is the source language, Y is the target language and Z is the
pivot language. Then:
• In the above expression we assume that Y is independent of X given Z• The linguistic intuition behind this assumption is that the top-k
outputs of the X → Z system corresponding to a string in X, capture all the transliteration information necessary for transliterating to Y
Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Results for the Bridge System
• We must remember that Machine Transliteration is a lossy conversion• In the bridge system we can assume that we will get loss in
information and thus the accuracy score will drop down• The results have shown that there has been a drop in accuracy of
about 8-9%(ACC1) and about 1-3%(Mean F-score)• NEWS 2009 was used as a dataset for this training and evaluation of
the results
Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
Stepping though an intermediate language
Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010
What is Entropy
• Entropy is the amount of information obtained in each message received• It characterizes our uncertainty about our source of information
(Randomness)• Expected value function of information content in random variable
Based on Shannon's: A Mathematical Theory of Communication
Properties and Mathematical Formulation• Assume a set of event whose probability of occurrences are p1,p2,p3,
…,pn
• is the entropy and satisfies the following properties:1. should be continuous in pi
2. If all the pi are equal, , then H should be a monotonic increasing function of n
3. If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Based on Shannon's: A Mathematical Theory of Communication
The Formula for Entropy
Where K is a positive constant
Note: This is the only equation satisfying the properties mentioned Based on Shannon's: A Mathematical Theory of Communication
Motivation for Entropy in English
• Aoccdrnig to rseearchat at Elingsh uinervtisy, it deosn't mttaer in waht odrer the ltteers ina wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer is at the rghit pclae. The rset canbe a toatl mses and youcansitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe.
Entropy of Language
• If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number or binary digits required per letter of the original language.• The redundancy measures the amount of constraint imposed on a
text in the language due to its statistical structure• Ex: In english: The frequency of letter E, Strong tendency that H
follows T or U to follow Q
Based on Shannon's: Prediction and Entropy of Printed English
Entropy calculation from the statistics of English• Entropy H can be calculated by a series of approximations , ,…, • Each values above successively takes more and more of the statistics
of the language into account and approach H as a limit• FN is called the N-gram entropy: It measures the amount of
information or entropy due to statistics extending over N adjacent letters of text.
Based on Shannon's: Prediction and Entropy of Printed English
Entropy of English
Here: bi is a block of N-1 letters
j is an arbitrary letter following bi
is the probability of the N-gram bi ,j
is the conditional probability of the letter j after the block bi , which is given by
Based on Shannon's: Prediction and Entropy of Printed English
Interpretation of the equation
• It is the average uncertainty (conditional entropy) of the next letter j when the preceding N-1 letter are known• As N is increased, FN includes longer and longer range statistics and
the entropy, H, is given by the limiting value of FN as
Based on Shannon's: Prediction and Entropy of Printed English
Calculations of the FN
• If we ignore the spaces and punctuation then we can give a definition to it as : bits per letter• This involves just the letter frequencies can thus be calculated as:
bits per letter• Di-gram assumption is taken here
bits per letter
Based on Shannon's: Prediction and Entropy of Printed English
Letter Frequencies in English
Source: Wikipedia’s article on letter frequency in English
Calculation of higher FN
• Similar calculations for F3 gives the value as 3.3 bits• The tables of N-gram frequencies are not available for N>3 as a result
F4,F5,F6 cannot be calculated the same way• Word frequencies are used to calculate to assist in such situations• Let us look at the log-log paper of the probabilities of words against
frequency rank
Based on Shannon's: Prediction and Entropy of Printed English
Word Frequencies• The most frequent word in english “the” has
a probability .071 and this is plotted against 1
• The next one is “of” with .034 and is plotted against 2 and so on
• If we used log scales for both the x,y axes we have observed that we have a straight line
• Thus if pn is the frequency of the nth most frequent word we have roughly
The above equation is termed as Zipf’s law
Based on Shannon's: Prediction and Entropy of Printed English
Inferences
• The above equation can definitely not be true if we take it indefinitely since we know that is infinite• We will assume that the equation is true up to the value such that the
sum is equal to 1 and the other probabilities are 0• Calculations have determined that it occurs at the value occurs at the
word rank 8727• Thus the entropy is then:
• Which is 11.82/4.5 = 2.62 bits per letter since the average word in English is 4.5 letters
Entropy for various world languages• From the data we can infer that english languages
has the least entropy and Finnish language has the highest entropy
• But all the languages have a comparable entropy when we take Shannon’s experiment into consideration
Finnish (fi), German (de), Swedish (sv), Dutch (nl), English (en), Italian (it), French (fr), Spanish (es), Portuguese (pt) and Greek (el)
Based on Word-length entropies and correlations of natural language written texts. 2014
Ziff like plots for various world languages
Based on Word-length entropies and correlations of natural language written texts. 2014
Entropy of Telugu Language
• Indian languages are highly phonetic which makes the computation of the entropy to be a rather difficult task• Thus entropy for Telugu has been calculated by converting it into
english language and using Shannon’s experiment. The entropy is calculated in 2 ways:
1. Converting into English and then considering them as English letters2. Converting into English and then considering them as Telugu letters
Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011
Telugu Language Entropy
Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011
Inferences
• The entropy of Telugu is higher than that of English, which means that Telugu is more succinct than English and each syllable in Telugu(as in other Indian languages) contains more information compared to English
Indus Script• Very less has been known about the script
from the ancient time
• But no inferences has been made about weather it is a linguistic script or not
• But from the diagram besides we can check that Indus script lies somewhere near most of the world languages
• We can thus infer that the Indus script is a one which can be noted as a linguistic script but we have no solid proof for it
Based on Entropy, the Indus Script and Language: A Reply to R.Sproat
Transliterability and Transliteration performance• We need to keep in mind that transliteration between a pair of
languages is a non-trivial task1. Phonemic set of the two languages are rarely the same2. The mapping between phonemes and graphemes in
respective languages are rarely one-to-one• Some languages pairs such as have near-equal phonemes and an
almost one-to-one mapping between their character sets• Some might share similar, but unequal phoneme sets, but similar
orthography possibly due to common evolution, such as Hindi and Kannada (Many phonological features borrowed from Sanskrit)
Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
Transliterability Measure
• The measure with the desirable qualities which could measure the ease of Transliterability among languages:
1. Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora)
2. Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings)
3. Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely
The Transliterability measure Weighted Average Entropy (WAVE), does out work
Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
WAVE
• WAVE will depend upon the n-gram that is being used as the unit of source and target language names• We denote WAVE1, WAVE2, WAVE3 for unigram, bigram and trigrams where, alphabet = Set of uni-, bi- or tri- grams
i,j = Source Language Unit (uni-, bi- or tri-grams)k = Target Language Unit (uni-, bi- or tri-grams)
Mappings(i) = Set of target language uni-, bi- or tri-grams that i can map toRef: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
Motivation• From the adjacent table we can conclude that frequency
of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’
• Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’
• The term ‘frequency(i)’ captures this effect
• Table IV shows the mappings from the source to target languages
• We can observe that the uni-gram c has mapping to 2 characters स and क
• Whereas p has only one which is प
• The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p
Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya
Plot Between WAVE and Transliteration Quality
• The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka
• We can see that as the value of WAVE decreases the accuracy is decreasing exponentially
• The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them
• We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model
• Based on these observations we can term two languages with small WAVE1 measure as more easily transliterable.
Conclusions
• In this presentation we introduced the concept of Machine Transliteration. • We have looked over the classical methods of Transliteration starting from
the phoneme based transliteration models to Combined CRF models.• We have introduced the concept of Entropy and its usefulness in
Transliteration model. • We studied how phonology and syllabification helps in creating chunks
which would be useful for transliteration. • We introduced the concept of WAVE which determines the ease of
Transliteration between a pair of languages.
Future Work
• As a part of future work, I would implement transliteration performance between languages based on entropy measures between then.• Find a measure like WAVE which could be used to measure the
transliterability performance without actual transliteration
References
• Report of NEWS 2012 Machine Transliteration Shared Task (2012), Min Zhang,haizhou Li, A Kumaran and Ming Lui. ACL 2012
• A comparison of Different Machine Transliteration Models (2006), A Comparisonof Different Machine Transliteration Models.
• Machine Transliteration (1997). Kevin Knight and Jonathan Graehl, Journal Computational Linguistics• Improving back-transliteration by combining information sources. (2004). Bilac S., & Tanaka, H. In
Proceedings of IJCNLP2004, pp. 542–547 • An English-Korean transliteration model using pronunciation and contextual rules. (2002). Oh, J. H., &
Choi, K. S. In Proceedings of COLING2002, pp. 758–764• Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. (2010).
Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya• Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs.
(2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW• Hindi to English Machine transliteration model of named entities using CRFs (2012). Manikrao, Shantanu
and Tushar. International Journal of Computer Applications(0975-8887) on June 2012
References (2)
• Linguistics, An Introduction to Language and Communication (5th Edition), Adrian Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish. MIT University Press. 2001.
• A Mathematical Theory of Communication (1948), C.E.Shannon , The Bell System Technical Journal, July 1948
• Prediction and Entropy of Printed English. C.E.Shannon. The Bell System Technical Journal. January 1951• Word-length entropies and correlations of natural language written texts. Maria Kalimeri, Vassilios
Constantoudis, Constantinos Papadimitrious. ArXiV conference 2014• Entropy of Telugu. Venkata Ravinder Paruchuri. 2011• Entropy, the Indus Script and Language: A Reply to R.Sproat. Rajesh PN Rao, Nisha Yadav, Mayank
Vahia, Hrishikesh. Computational Linguistics 36(4). 2010• Compositional Machine Transliteration, Transactions on Asian Language Information Processing (TALIP
Journal), A.Kumaran, Mitesh M. Khapra and Pushpak Bhattacharyya, , September 2010• Wiki articles
Extra Slides
Explanation of property 3
• Assume that an event has 3 cases with probabilities of then we can measure the entropy as
1/2
1/6
1/2
1/31/2
1/3
2/3
1/2
1/6
1/3
𝐻 ( 12 , 13 , 16 )=𝐻 ( 12 , 12 )+ 12𝐻 ( 23 , 13 )Based on Shannon's: A Mathematical Theory of Communication