btp stage 1 machine transliteration & entropy final presentation bhargava reddy 110050078

BTP Stage 1Machine Transliteration

&Entropy

Final PresentationBhargava Reddy

110050078

Contents

• What is Machine Transliteration? • Classical Methods of Transliteration• CRF and its use in Transliteration• Study of Transliteration through Bridging Languages• Study of Entropy in Information Theory • Entropy of a Language• Transliterability and Transliteration Performance• Conclusion and Future Works

Definition of Machine Transliteration

• Conversion of a given name in source language to a name in target language such that the target language is:

1. Phonemically equivalent to the source language2. Conforms to the phonology of the target language3. Matches the user intuition of the equivalent of the source language name in

the target language, considering the culture and orthographic character usage in the target language

We need to note that all are equivalent in it’s own kind

Ref: Report of NEWS 2012 Machine Transliteration Shared Task

Classical Machine Transliteration Models• Work on machine transliteration started between 1995-2000.• 4 classical Machine Transliteration Models have been proposed:

1. Grapheme Based Transliteration Model (ψG) 2. Phoneme Based Transliteration Model (ψP)3. Hybrid Transliteration Model (ψH)4. Correspondence Based Transliteration Model (ψC)

Ref: A comparison of Different Machine Transliteration Models (2006), A Comparisonof Different Machine Transliteration Models.

Grapheme and Phoneme

Phoneme:Smallest contrastive linguistic unit which may bring about a

change of meaning. Kiss and Kill are two completely contrasting words. The phonemes here are /l/ and /s/ which make the difference.Grapheme:

Grapheme is the smallest semantically distinguishing unit in a written language. Analogous to the phonemes of spoken languages. A grapheme may or may not carry meaning by itself and may or may not correspond to single phoneme.

Phoneme-based Transliteration Model

• This model is basically source grapheme – source grapheme and source grapheme – target grapheme transformation• This model was first proposed by Knight and Graehl in 1997• They modelled it for English – Japanese and Japanese – English

Transliteration• In these methods the main transliteration key is pronunciation (or)

source phoneme rather than spelling or source grapheme

Katakana Words and Japanese Language• Katakana words are those words which are imported from other

languages (primarily English)• This language has a lot of issues with when pronunciation is

concerned• In Japanese the words L,R are pronounced the same. (Pronounced as

something in between)• Same goes with H,F either

Katakana Words

• Golf bag is pronounced as go-ru-hu-ba-ggu ---- ゴルフバッグ• Johnson is pronounced as jyo-n-s-o-n --- ジョンソン• Ice cream is pronounced as a-i-su-ku-ri-i-mu アイスクリーム• What have we observed in the transliteration?• We can say that there has been a lot of information loss in the

process of conversion from English to Japanese• So when we do the back-transliteration we may fall into trouble

The steps to convert from English to Katakana

We do the following steps for the conversion:1. – Generates English word sequences2. – Pronounces English word sequences3. – Converts English sounds into Japanese sounds4. – Converts Japanese sounds to katakana writing5. – Introduces misspelling caused by Optical Character Recognition

Fixing Back-Transliteration• We do the 5 steps for converting an English language word to Japanese• Knight and Graehl used Bayes theorem to do the reverse• For a given katakana word ‘o’ they maximized the following expression:

• They implemented using WFSA (Weighted Finite State Acceptor) – with

weights and symbols on the transition making some output sequence likely than others

• The other probabilities are implemented using WFST (Weighted Finite State Transducer) with a pair of symbols on each transition, one input and one output. Inputs and outputs may include the empty symbol ϵ

Example for Back Transliteration

• Consider the Japanese word マスターズトーナメント• This is first identified my an OCR which may give errors• Now this is passed to model which converts it as :

m a s u t a a z u t o o ch I m e n t o• Now comes the model. It converts the text as:

M AE S T AE AR DH UH T AO CH IH M EH N T AO• Now comes the model.

Masters tone am ent awe• Finally using the P(w) model. The word becomes: Masters Tournament

Study of MT through Bridging LanguagesData is available between a language pair due to one of the following three reasons:1. Politically related languages:

Due to the political dominance of English it is easy to obtain parallel names data between English and most languages

2. Genealogically related languages: Languages sharing the same origin. Might have significant

overlap between their phonemes and graphemes

3. Demographically related languages: Hindi and Telugu. Might not have the same origin but due to

the shared culture and demographics there will be similaritiesRef: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Bridge Transliteration Methodology• Assume X is the source language, Y is the target language and Z is the

pivot language. Then:

• In the above expression we assume that Y is independent of X given Z• The linguistic intuition behind this assumption is that the top-k

outputs of the X → Z system corresponding to a string in X, capture all the transliteration information necessary for transliterating to Y

Ref: Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. NAACL 2010

Results for the Bridge System

• We must remember that Machine Transliteration is a lossy conversion• In the bridge system we can assume that we will get loss in

information and thus the accuracy score will drop down• The results have shown that there has been a drop in accuracy of

about 8-9%(ACC1) and about 1-3%(Mean F-score)• NEWS 2009 was used as a dataset for this training and evaluation of

the results


Stepping though an intermediate language


What is Entropy

• Entropy is the amount of information obtained in each message received• It characterizes our uncertainty about our source of information

(Randomness)• Expected value function of information content in random variable

Based on Shannon's: A Mathematical Theory of Communication

Properties and Mathematical Formulation• Assume a set of event whose probability of occurrences are p1,p2,p3,

…,pn

• is the entropy and satisfies the following properties:1. should be continuous in pi

2. If all the pi are equal, , then H should be a monotonic increasing function of n

3. If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Based on Shannon's: A Mathematical Theory of Communication

The Formula for Entropy

Where K is a positive constant

Note: This is the only equation satisfying the properties mentioned Based on Shannon's: A Mathematical Theory of Communication

Motivation for Entropy in English

• Aoccdrnig to rseearchat at Elingsh uinervtisy, it deosn't mttaer in waht odrer the ltteers ina wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer is at the rghit pclae. The rset canbe a toatl mses and youcansitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by it slef but the wrod as a wlohe.

Entropy of Language

• If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average number or binary digits required per letter of the original language.• The redundancy measures the amount of constraint imposed on a

text in the language due to its statistical structure• Ex: In english: The frequency of letter E, Strong tendency that H

follows T or U to follow Q

Based on Shannon's: Prediction and Entropy of Printed English

Entropy calculation from the statistics of English• Entropy H can be calculated by a series of approximations , ,…, • Each values above successively takes more and more of the statistics

of the language into account and approach H as a limit• FN is called the N-gram entropy: It measures the amount of

information or entropy due to statistics extending over N adjacent letters of text.


Entropy of English

Here: bi is a block of N-1 letters

j is an arbitrary letter following bi

is the probability of the N-gram bi ,j

is the conditional probability of the letter j after the block bi , which is given by


Interpretation of the equation

• It is the average uncertainty (conditional entropy) of the next letter j when the preceding N-1 letter are known• As N is increased, FN includes longer and longer range statistics and

the entropy, H, is given by the limiting value of FN as


Calculations of the FN

• If we ignore the spaces and punctuation then we can give a definition to it as : bits per letter• This involves just the letter frequencies can thus be calculated as:

bits per letter• Di-gram assumption is taken here

bits per letter


Letter Frequencies in English

Source: Wikipedia’s article on letter frequency in English

Calculation of higher FN

• Similar calculations for F3 gives the value as 3.3 bits• The tables of N-gram frequencies are not available for N>3 as a result

F4,F5,F6 cannot be calculated the same way• Word frequencies are used to calculate to assist in such situations• Let us look at the log-log paper of the probabilities of words against

frequency rank


Word Frequencies• The most frequent word in english “the” has

a probability .071 and this is plotted against 1

• The next one is “of” with .034 and is plotted against 2 and so on

• If we used log scales for both the x,y axes we have observed that we have a straight line

• Thus if pn is the frequency of the nth most frequent word we have roughly

The above equation is termed as Zipf’s law


Inferences

• The above equation can definitely not be true if we take it indefinitely since we know that is infinite• We will assume that the equation is true up to the value such that the

sum is equal to 1 and the other probabilities are 0• Calculations have determined that it occurs at the value occurs at the

word rank 8727• Thus the entropy is then:

• Which is 11.82/4.5 = 2.62 bits per letter since the average word in English is 4.5 letters

Entropy for various world languages• From the data we can infer that english languages

has the least entropy and Finnish language has the highest entropy

• But all the languages have a comparable entropy when we take Shannon’s experiment into consideration

Finnish (fi), German (de), Swedish (sv), Dutch (nl), English (en), Italian (it), French (fr), Spanish (es), Portuguese (pt) and Greek (el)

Based on Word-length entropies and correlations of natural language written texts. 2014

Ziff like plots for various world languages

Based on Word-length entropies and correlations of natural language written texts. 2014

Entropy of Telugu Language

• Indian languages are highly phonetic which makes the computation of the entropy to be a rather difficult task• Thus entropy for Telugu has been calculated by converting it into

english language and using Shannon’s experiment. The entropy is calculated in 2 ways:

1. Converting into English and then considering them as English letters2. Converting into English and then considering them as Telugu letters

Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011

Telugu Language Entropy

Based on Entropy of Telugu. Venkata Ravinder Paruchuri. 2011

Inferences

• The entropy of Telugu is higher than that of English, which means that Telugu is more succinct than English and each syllable in Telugu(as in other Indian languages) contains more information compared to English

Indus Script• Very less has been known about the script

from the ancient time

• But no inferences has been made about weather it is a linguistic script or not

• But from the diagram besides we can check that Indus script lies somewhere near most of the world languages

• We can thus infer that the Indus script is a one which can be noted as a linguistic script but we have no solid proof for it

Based on Entropy, the Indus Script and Language: A Reply to R.Sproat

Transliterability and Transliteration performance• We need to keep in mind that transliteration between a pair of

languages is a non-trivial task1. Phonemic set of the two languages are rarely the same2. The mapping between phonemes and graphemes in

respective languages are rarely one-to-one• Some languages pairs such as have near-equal phonemes and an

almost one-to-one mapping between their character sets• Some might share similar, but unequal phoneme sets, but similar

orthography possibly due to common evolution, such as Hindi and Kannada (Many phonological features borrowed from Sanskrit)

Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Transliterability Measure

• The measure with the desirable qualities which could measure the ease of Transliterability among languages:

1. Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora)

2. Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings)

3. Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely

The Transliterability measure Weighted Average Entropy (WAVE), does out work


WAVE

• WAVE will depend upon the n-gram that is being used as the unit of source and target language names• We denote WAVE1, WAVE2, WAVE3 for unigram, bigram and trigrams where, alphabet = Set of uni-, bi- or tri- grams

i,j = Source Language Unit (uni-, bi- or tri-grams)k = Target Language Unit (uni-, bi- or tri-grams)

Mappings(i) = Set of target language uni-, bi- or tri-grams that i can map toRef: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

Motivation• From the adjacent table we can conclude that frequency

of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’

• Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’

• The term ‘frequency(i)’ captures this effect

• Table IV shows the mappings from the source to target languages

• We can observe that the uni-gram c has mapping to 2 characters स and क

• Whereas p has only one which is प

• The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p


Plot Between WAVE and Transliteration Quality

• The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka

• We can see that as the value of WAVE decreases the accuracy is decreasing exponentially

• The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them

• We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model

• Based on these observations we can term two languages with small WAVE1 measure as more easily transliterable.

Conclusions

• In this presentation we introduced the concept of Machine Transliteration. • We have looked over the classical methods of Transliteration starting from

the phoneme based transliteration models to Combined CRF models.• We have introduced the concept of Entropy and its usefulness in

Transliteration model. • We studied how phonology and syllabification helps in creating chunks

which would be useful for transliteration. • We introduced the concept of WAVE which determines the ease of

Transliteration between a pair of languages.

Future Work

• As a part of future work, I would implement transliteration performance between languages based on entropy measures between then.• Find a measure like WAVE which could be used to measure the

transliterability performance without actual transliteration

References

• Report of NEWS 2012 Machine Transliteration Shared Task (2012), Min Zhang,haizhou Li, A Kumaran and Ming Lui. ACL 2012

• A comparison of Different Machine Transliteration Models (2006), A Comparisonof Different Machine Transliteration Models.

• Machine Transliteration (1997). Kevin Knight and Jonathan Graehl, Journal Computational Linguistics• Improving back-transliteration by combining information sources. (2004). Bilac S., & Tanaka, H. In

Proceedings of IJCNLP2004, pp. 542–547 • An English-Korean transliteration model using pronunciation and contextual rules. (2002). Oh, J. H., &

Choi, K. S. In Proceedings of COLING2002, pp. 758–764• Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. (2010).

Mitesh M. Khapra, A Kumaran, Pushpak Bhattacharyya• Forward-Backward Machine Transliteration between English and Chinese based on Combined CRFs.

(2011). Ying Qin, Guohua Chen. Nov’ 12 FWBW• Hindi to English Machine transliteration model of named entities using CRFs (2012). Manikrao, Shantanu

and Tushar. International Journal of Computer Applications(0975-8887) on June 2012

References (2)

• Linguistics, An Introduction to Language and Communication (5th Edition), Adrian Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish. MIT University Press. 2001.

• A Mathematical Theory of Communication (1948), C.E.Shannon , The Bell System Technical Journal, July 1948

• Prediction and Entropy of Printed English. C.E.Shannon. The Bell System Technical Journal. January 1951• Word-length entropies and correlations of natural language written texts. Maria Kalimeri, Vassilios

Constantoudis, Constantinos Papadimitrious. ArXiV conference 2014• Entropy of Telugu. Venkata Ravinder Paruchuri. 2011• Entropy, the Indus Script and Language: A Reply to R.Sproat. Rajesh PN Rao, Nisha Yadav, Mayank

Vahia, Hrishikesh. Computational Linguistics 36(4). 2010• Compositional Machine Transliteration, Transactions on Asian Language Information Processing (TALIP

Journal), A.Kumaran, Mitesh M. Khapra and Pushpak Bhattacharyya, , September 2010• Wiki articles

Extra Slides

Explanation of property 3

• Assume that an event has 3 cases with probabilities of then we can measure the entropy as

1/2

1/6

1/2

1/31/2

1/3

2/3

1/2

1/6

1/3

𝐻 ( 12 , 13 , 16 )=𝐻 ( 12 , 12 )+ 12𝐻 ( 23 , 13 )Based on Shannon's: A Mathematical Theory of Communication

btp stage 1 machine transliteration & entropy final presentation bhargava reddy 110050078

Documents

transliteration slide

transliteration model

transliteration model

source language

source grapheme slide

target language

grapheme source grapheme

main transliteration