machine translation breaking the communication barrier

77
Machine Translation Breaking the Communication Barrier By Dr. Gurpreet S. Josan Punjabi University, Patiala

Upload: portia

Post on 24-Feb-2016

87 views

Category:

Documents


0 download

DESCRIPTION

By Dr. Gurpreet S. Josan Punjabi University, Patiala. Machine Translation Breaking the Communication Barrier. Communication. Communication is the activity of conveying meaningful information. Communication requires a sender, a message, and an intended recipient - PowerPoint PPT Presentation

TRANSCRIPT

Slide 1

Use Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStop Working of Target Word GeneratorSelected word in redCommunicationCommunication is the activity of conveying meaningful information.

Communication requires a sender, a message, and an intended recipient

The communication process is complete once the receiver has understood the sender.Machine Translation-Breaking the language Barrier2

Mode of CommunicationNonverbal communication- includes gesture, body language or posture; facial expressionVisual communication- includes signs, typography, drawing, colours etc.Oral communication- spoken verbal communicationWritten communication- includes alphabets symbols, grammar etc.

Machine Translation-Breaking the language Barrier3

3Machine Translation-Breaking the language Barrier4Barrier in Communication

4The EffectMachine Translation-Breaking the language Barrier5Language is a barrier for information dissemination. All the major source of information/ discoveries are in English.We are unable to reach the masses in rural area who do not know English.

Youre a scientist who has just clicked a revolutionary new idea. How do you find out if a scientist anywhere in world has already filed a patent on a similar idea in their native language?

Machine Translation-Breaking the language Barrier6Who Can Help?A TranslatorManualToo SlowLimited AvailableCostlyAccurate

MachineFastEconomicalNot accurate

6IssueskJfmmfj mmmvvv nnnffn333Uj iheale eleee mnster vensi credurBaboi oi cestnitze Coovoel2^ ekk; ldsllk lkdf vnnjfj?Fgmflmllk mlfm kfre xnnn!

IssuesComputers Lack Knowledge!Computers see text in English the same you have seen the previous text!People have no trouble understanding languageCommon sense knowledgeReasoning capacityExperienceComputers have No common sense knowledgeNo reasoning capacity

IssuesMachine Translation-Breaking the language Barrier9

Which ones are going to be difficult for computers to deal with? Grammar or Lexicon?

Grammar (Rules for putting words together into sentences)How many rules are there? 100, 1000, 10000, more Do we have all the rules written down somewhere?

Lexicon (Dictionary)How many words do we need to know?1000, 10000, 100000

Issues the dog ate my homework - Who did what to whom?Identify the part of speech (POS)Dog = noun ; ate = verb ; homework = nounEnglish POS tagging: 95%

Try to tag this text manuallyI can, can the can.2. Identify collocationsmother in law, hot dog

Seemingly similar sentences may differ radically in meaning: The CEO was fired up about his new role. The CEO was fired from his new role.Seemingly different sentences can have the same meaning: IBMs PC division was acquired by Lenovo. Lenovo bought the PC division of IBM.Machine Translation-Breaking the language Barrier11Issues

IssuesMachine Translation-Breaking the language Barrier12

AmbiguityStructural ambiguity I saw the man with the telescope

Word level ambiguity

AmbiguityVarious Meaning of word in Punjabi Machine Translation-Breaking the language Barrier13

AmbiguityIf more than one ambiguous word is present in a sentence, the number of potential interpretations of the sentence explodes: the number of interpretations is the product of all possible meanings of the words.

Consider the sentence

and assume that only {va} and {pa} are ambiguous in this sentence, and that they both have 4 senses. This brings the number of possible interpretations to 16. Machine Translation-Breaking the language Barrier14AmbiguityImagine what happens if there are more senses to be taken into account or if the sentence gets longer.

Machine Translation-Breaking the language Barrier15 (Sukhbir) (has) (twine)

(thigh) (crease) (FootPath) (sultriness) (destroy) (door-leaf) (silk) IssuesAnaphora Resolution:

The dog ate the bird and it died.

Gender Conversion

Idioms & Phrases

IssuesNamed Entity Recognition. Dr. Plant Singh vs Dr. Buta Singh

Foreign words vs

Spelling Variation, , etc.

Machine Translation-Breaking the language Barrier17

IssuesRhyming Reduplication-, -Other issuesIn Indian Languages-no fixed fontFor example the word can be written in following manners:

Machine Translation-Breaking the language Barrier18

+ + + =

+ + = + + = + + + + + + + + + + + + + + + + + +

Three MT Approaches: Direct, Transfer, InterlingualInterlinguaSemanticStructureSemanticStructureSyntacticStructureSyntacticStructureWordStructureWordStructureSource TextTarget TextSemanticCompositionSemanticDecompositionSemanticAnalysisSemanticGenerationSyntacticAnalysisSyntacticGenerationMorphologicalAnalysisMorphologicalGenerationSemanticTransferSyntacticTransferDirect

Machine Translation-Breaking the language Barrier1919Direct MT: Pros and ConsProsFastSimpleInexpensiveRobustNo translation rules hidden in lexicon

ConsUnreliableNot powerfulRule proliferationRequires too much contextMajor restructuring after lexical substitution

Machine Translation-Breaking the language Barrier2020Transfer MT: Pros and ConsProsDont need to find language-neutral representationRelatively fastConsLarge no. of transfer rules: Difficult to extendProliferation of language-specific rules in lexicon and syntax

Machine Translation-Breaking the language Barrier2121Interlingual MT: Pros and ConsProsPortable Lexical rules and structural transformations stated more simply on normalized representationExplanatory AdequacyConsDifficult to deal with terms. Deciding what should be added is difficult. What will be the universal knowledge format? How do we encode?Must decompose and reassemble concepts

Machine Translation-Breaking the language Barrier2222Current TechniquesCorpus-based approachesStatistics-based Machine Translation (SMT): Every target language string, , is a possible translation of . Every string is given a number called probability.We select the string which has maximum probability.

= argmax [Pr() Pr( | )]Where is a source language and is a target language

These are known as the language modeling problem, the translation modeling problem, and the search problem.Machine Translation-Breaking the language Barrier23

Current TechniquesCorpus-based approachesExample Based Machine TranslationTranslation by Analogy.System is given a set of sentences in the source language and their corresponding translations in the target languageSystem uses those examples to translate other, similar source-language sentences into the target language.Hybrid methodsCombination of Rule Based and Statistical Methods

Machine Translation-Breaking the language Barrier24

Punjabi to Hindi Machine TranslationPunjabi to Hindi Machine translation system is a direct translation system based on various lexical resources and rule-base.

The system is modular with clear separation of data from process.

The central idea is to select words from source language and do the minimal analysis required like extracting the root word, lexical category and contextual information i.e. tokens at left and right side of the current token. Machine Translation-Breaking the language Barrier25

Word sense disambiguation module is called for ambiguous words.

Equivalents of source token in target language are found out from the lexicon and are replaced to get target language.

The rules are applied to the output for making it appropriate for the target language. Machine Translation-Breaking the language Barrier26

Punjabi to Hindi Machine TranslationSystem ArchitectureNormalized Source TextTokenizationNamed Entity RecognitionRepetitive Construct HandlingLexicon Look upAmbiguity ResolutionTransliterationHit?Ambiguous?Post EditingTarget TextNoNoYesYesRoot word & Inflectional Form DBBigram & Trigram DBAmbiguous Word DBAppend in Output and retrieve next tokenIf token presentYesNoPre ProcessingTranslation EnginePostProcessingWhy Direct System?For a given language pair and text type what kind of system is required is largely an empirical and a practical question.

General requirements on MT systems such as modularity, separation of data from processes, reusability of resources and modules, robustness, corpus-based derivation of data and so on, do not, provide conclusive arguments for either one of the models.

The available resources are one of the key factors for deciding the approach.Machine Translation-Breaking the language Barrier28

Why Direct System?In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing.

Keeping in view, the similarity in Punjabi and Hindi language pair, a simpler, direct model is our obvious choice for Punjabi to Hindi machine translation system.

Machine Translation-Breaking the language Barrier29

The LexiconThe lexicon contains information about the primary component of languages, i.e. words.

Most NLP applications use dictionaries. For example, morphological analyzers use a lexicon containing morphemes, and tagging systems use probability data, and parsers use lexical/semantic information or co-occurrence information, and MT systems use Translation Memory and a transfer dictionary.

Machine Translation-Breaking the language Barrier30

The LexiconMachine Translation-Breaking the language Barrier31The bilingual dictionary prepared by LTRC dept of IIIT Hyderabad in ISCII format containing about 22000 entries.

Adopted and extended for our system.

Converted in to Unicode format.

The entries are extended to about 33000 covering almost all the root words of Punjabi language. Root Table{ Field name: PW Field Type: Text Field name: gnp Field Type: Text Field name: cat Field Type: Text Field name: HW Field Type: Text}Root TablepwgnpcatHwmsnmnsadjfsnfnsadjfnf mnmfnfAmb

Inflectional Form TablepwroothwTable of all the inflected forms of Punjabi root words. Contains all the inflectional forms of Punjabi root words and along with their roots. The corresponding Hindi words are entered manually. It comprise of about 65,000 entries. Inflectional Form Table{ Field name: PW Field Type: Text Field name: ROOT Field Type: Text Field name: HW Field Type: Text}Where ROOT is one of the entry from Root table.The Lexicon

Machine Translation-Breaking the language Barrier32For all the ambiguous words in root table as well as inflectional form table, the entry for target word contains a symbol amb. It triggers the disambiguation process for the given word. A table of ambiguous words is prepared for this purpose that contains most frequent meaning followed by all other possible meanings of a given word.

The LexiconAmbigous word table{Field name: PWField Type: TextField name: s1Field Type: TextField name: s2Field Type: Text}Ambigous wordsPws1s2 // // / The Lexicon

Machine Translation-Breaking the language Barrier33To help the disambiguation module, bigram and trigram tables are created.

They contains the context of ambiguous words along with their meaning in that context and frequency obtained from a corpus of about 30 lakh words. Bigram Table{ Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number}Trigram Table{ Field name: PREV2 Field Type: Text Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number}Bigramprev1pwhwcount,15,18,25,22,7Trigramprev2prev1pwhwCount ,,2,,3,,2,,19,,10,,54The Lexicon

The lexicon also contains a rule-base.

It contains all the rules to handle different grammatical dissimilarities between two languages at post processing.

Replacementorgtxtreptxt - Replacement Table{ Field name: orgtext Field Type: Text Field name: reptxt Field Type: Text}The Lexicon

Machine Translation-Breaking the language Barrier35Text normalizationThe text should be in a normalized way i.e. there should be only one way to represent a syllable.

Having several identical pieces of text represented by differing underlying byte sequences makes analysis of the text much more difficult.

For example, under the AnmolLipi font, the Latin character A would appear as . Conversely, under the DrChatrikWeb font it would appear as .

This cause a problem while scanning a text. Machine Translation-Breaking the language Barrier36

Text normalizationSo the source text is normalized by converting it into Unicode format.

It gives us three fold advantages; First it will reduce the text scanning complexity. Secondly it also helps in internationalizing the system. Thirdly it eases the transliteration task.Machine Translation-Breaking the language Barrier37

Text NormalizationSpelling normalization There may be the chances that the word is present in database but with different spellings like {prkhia} [examination] In database only one may appear and other may not. The purpose of spelling normalization is to find the missing variant. Soundex technique is used for the spelling normalization.Machine Translation-Breaking the language Barrier38

SoundexIn this technique, a unique number is assigned to each character of alphabet. Similar sounding letters get same number. Then codes for each string are generated. All the strings with same code are the spelling variations of a same one string.Machine Translation-Breaking the language Barrier39

Soundexc1c14c30C32c2c15c26C38c11c16c31kbc4c17c32,Uc6c18c33,Sc5c19c34Lc8c20c35,Oc9c21c36Kc8c22c37Lc1c23c38C26c3c24c39, , HALANTNo Codec7c25c40c41c26c10Hc27c13c12c28c14c13c29c19Machine Translation-Breaking the language Barrier40

SoundexWith this table the code for came out to be c31c37sc13sc4 enabling the system to detect the variant present in database. For example, if the database contains {prkhia} as Punjabi word then the code c31c37sc13sc4 is stored against it. If a user enters as input, which is not present in database, its code will be generated on the fly by the system and checked in the database. If code appears in the database the corresponding Punjabi word is selected as spelling variant. Machine Translation-Breaking the language Barrier41

Word Sense DisambiguationIn order to achieve this, we make use of the information contained in the context similar to what humans do.

A standalone word sense disambiguation module that is capable for performing its work without any help from outside.

To start with, all we have is a raw corpus of Punjabi text. So the statistical approach is the obvious choice for us.

Machine Translation-Breaking the language Barrier42

Word Sense DisambiguationWe use the words surrounding the ambiguous word to build a statistical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts.

The basic idea of statistical methodologies is that, given a sentence with ambiguous words, it is possible to determine the most likely sense for each word.

One of such statistical model is n gram model.

Machine Translation-Breaking the language Barrier43

Word Sense DisambiguationAn n-gram is simply a sequence of successive n words along with their count i.e. number of occurrences in training data.

An n-gram of size 2 is a bigram; size 3 is a trigram; and size 4 or more is simply called an n-gram or (n1)-order Markov model.

N-grams are used as probability estimators which estimates likeliness of a word(s) to follow a certain point in a document.What is the optimum value of n?Machine Translation-Breaking the language Barrier44

Word Sense DisambiguationConsider predicting the word "" from the three sentences:(1) . (2) , , . (3)

In (1), the prediction can be done with a bigram (2-gram) language model (n=2), but (2) requires n=4 and (3) require n > 9.

Machine Translation-Breaking the language Barrier45

Word Sense DisambiguationNumber of words to be considered at n positions is important

Factors of concern are

Larger the value of n, higher is the probability of getting correct word sense i.e. for the general domain; more training data will always improve the result. But on the other hand most of the higher order n grams do not occur in training data. This is the problem of sparseness of data.

Machine Translation-Breaking the language Barrier46

Word Sense DisambiguationAs training data size increase, the size of model also increase which can lead to models that are too large for practical use. The total number of potential n grams scales exponentially with n. A large n require huge amount of memory space and time.

Does the model get much better if we use a longer word history for modeling an n-gram?

Do we have enough data to estimate the probabilities for the longer history?Machine Translation-Breaking the language Barrier47

Word Sense DisambiguationAn experiment for optimum value of n for Punjabi language is performed.

Different n gram models were generated where n ranges from 1 to 6

This was observed that as the value of n increases, its ability to disambiguate a word decreases.

This is due to sparseness of data.

Machine Translation-Breaking the language Barrier48

Word Sense DisambiguationAnother interesting point observed is that instead of making and using a higher order n gram models, we can improve the efficiency of the system tremendously by utilizing lower order models jointly.

We can use tri-gram model in the first place to disambiguate a word. If it fails to disambiguate then we move to lower order model i.e. bi-gram model for WSD. If it also fails, we can use the unigram model.

With this technique we get only 7.96% of incorrectly disambiguated words

This approach is adopted for the word sense disambiguation module.Machine Translation-Breaking the language Barrier49

Word Sense DisambiguationThree models viz. Unigram, Bigram and Trigram of the ambiguous words to tap the words in context of any ambiguous word are created from a corpus of about 3 million words generated by including different types of articles like essays, stories, editorials, News, novels, office letters, court orders etc.

In order to reduce the size of n-grams, we retain only those context which leads to less frequent meaning of ambiguous words. Machine Translation-Breaking the language Barrier50

Word Sense DisambiguationThe idea is to check the contextual information for the least frequent meaning. If it fails to disambiguate then we use the most frequent meaning by default.

For example {d} in Punjabi can be used as post position as well as verb, but its usage as verb is very less frequent. So we place all those bigrams and trigrams in database that leads to the disambiguation of less frequent meaning i.e. {d} as verb. Machine Translation-Breaking the language Barrier51

Word Sense DisambiguationMachine Translation-Breaking the language Barrier52It contains all those entries for which has less frequent meaning. All such meanings are entered manually in Trigram and Bigram Model. If the word cannot be disambiguated by bigram and trigram then most frequent meaning is selected by default. There are the chances when the previous words in the context leads to one sense but next words are producing the other sense. For such cases again the sense with higher probability is selected. bigramprev1pwhwcount,4,2,12,2,2

TransliterationTransliteration is a solution of OUT-OF-VOCABULAY words.

Transliteration is a process wherein an input string in some alphabet is converted to a string in another alphabet, usually based on the phonetics of the original word.

If the target language contains all the phonemes used in the source language, the transliteration is straightforward e.g. the Hindi transliteration of Punjabi word (Room) is which is essentially pronounced in the same way. Machine Translation-Breaking the language Barrier53

TransliterationFor missing sounds or extra sounds in the target language are generally mapped to the most phonetically similar letter e.g. in Hindi we have alphabets which have double sound associated with them like which is a combination of sound of and . In Punjabi, generally the letter is used to denote such sounds e.g. in (Alphabet) which is transliterated to .

A single foreign word can have many different transliterations. E.g. (Mehfooz) can be transliterated as , , , etc.

Machine Translation-Breaking the language Barrier54

Transliteration-Approaches Direct MappingRule BasedSoundex BasedMachine Translation-Breaking the language Barrier55

GurmukhiDevanagariGurmukhiDevanagari,,,,,,,,Vowel Mapping Punjabi contains 10 vowel symbols and nine dependent vowel sounds. Hindi has the one to one representations of all Punjabi vowel symbols and sounds.Transliteration

Machine Translation-Breaking the language Barrier56Consonant MappingConsonant mapping is shown in Table BelowGurmukhiDevanagariGurmukhiDevanagariGurmukhiDevanagariGurmukhiDevanagariGurmukhiDevanagari----No letter in Punjabi is present for Hindi letters . This means these letters can never be mapped in letter to letter based approach. Similar is the case for some double sound producing letters like .Transliteration

Sub Joins MappingThere are three sub joins (PAIREEN) in Gurmukhi, Haahaa, Vaavaa, Raaraa shown in table below

PAIREENPunjabiHindiEnglishLips SequentiallySelf-respectIn Punjabi They are represented by the virama (or halant) character before the consonant. Similar viram character is also present in Hindi which indicate that the inherent vowel is omitted (or 'killed'). PAIREEN Haahaa and Vaavaa are replaced with full consonants while their previous consonant is shown in half form, whereas PAIREEN Raaraa takes the position below the previous consonant in Hindi similar to as in Punjabi. Transliteration

Machine Translation-Breaking the language Barrier58Other Symbols

Adhak ( ) is used to duplicate the sound of a consonant in Punjabi. No such character is present in Hindi. Sound duplication is represented by half form of consonants in Hindi. Punctuation marks and digits are same in both scripts.

A special character called as visarga () is present in Hindi but not in Punjabi. So it will never be mapped in letter to letter based scheme.

Beside this Gurmukhi has two separate nasal characters Bindi () and Tippi ( ). Hindi has also two nasal characters i.e. Bindi () and chander bindu (). Both nasal characters of Punjabi are mapped to this single nasal character in Hindi. Transliteration

Letter to letter mapping produce quite good results. But we can improve the results by making them more nearer to target language in term of spellings and choice of alphabets by using some set of rules. Rule Based ApproachAlphabets whose mapping is not available in Hindi : and are two such alphabets. They are replaced by their most phonetically equivalent characters i.e. and .A character adhak is present in Punjabi and used to show the stress on the next character. No alphabet in Hindi is present to represent this character. The purpose is solved by placing a half character before the stressed character. e.g. is transliterated as . There is an exception for this rule. If the next character of adhak is then instead of placing half character, a half is placed. e.g. tranliterated to . Also if next character is then half character is replaced by half . e.g. as in which is transliterated to . Similarly if next character is then half character is replaced by half . e.g. as is transliterated as .Transliteration

Machine Translation-Breaking the language Barrier60Rule for tippi if followed by or then is replaced by half as in and if followed by then is replaced by half as in .Rule for character when followed or preceded by is translitrated differently. In case, it is followed by then is ommited from the transliterated text e.g. is transliterated to . if it is preceded by then is mapped to in the transliterated text. e.g. is transliterated to .Miscellaneous RulesIf this combination of letters appears at the last position in a word then instead of mapping to this letter mapped to . e.g. is transliterated to .if at last position then it is replaced by e.g. in if at last position or 2nd last position then it is replaced by e.g. in . Transliteration

Machine Translation-Breaking the language Barrier61SoundexThe Soundex concept is extended for searching the correct spelling variant of a given transliteration. The transliterations are produced by the methods discussed earlier.we have developed a unigram table from a corpus of about 10 million wordsThe codes are generated for each of the word in unigramFor the comparison, letters in word are converted into phonetic codeMachine Translation-Breaking the language Barrier62

SoundexThen this code is looked for in unigram table. The entry with maximum frequency is selected as the correct variant of given input For example consider the word {arpha} (Draft) written in Punjabi.The strings {arpha} is produced by the baseline module. The code 2541483623 is generated for this string. This code is looked for in unigram database. This database contains two entries against this code i.e. and with frequency 12 and 8. The string with higher frequency is selected as correct output for this input. Machine Translation-Breaking the language Barrier63

SoundexMachine Translation-Breaking the language Barrier64frweIvr= Direct Mapping= Rule Base Enhancement= Soundex Based Enhancement=

Post ProcessingRules are applied to cover the minor grammatical differences between languages.

The general structure of the rules is context dependent replacement.

The corresponding phrase or word in a given context replaces one phrase or a word.

Ordering of rules does not matter.

Machine Translation-Breaking the language Barrier65

Input = Output = An Example

Machine Translation-Breaking the language Barrier66Use Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStopWorking of Target Word GeneratorSelected word in redUse Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStopWorking of Target Word GeneratorSelected word in redThree trigram are possible as shown below

Only 2nd & 3rd trigram can resolve ambiguity. The meaning with higher count is selected. In this case both trigram produce same result.Use Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStopWorking of Target Word GeneratorSelected word in redPunjabi to Hindi MT system http://translate.jgmatrix.inMachine Translation-Breaking the language Barrier71

MT System Breaking Language BarrierMachine Translation-Breaking the language Barrier72

MT System Breaking Language BarrierMachine Translation-Breaking the language Barrier73

MT System Breaking Language BarrierMachine Translation-Breaking the language Barrier74

ULTIMATE GOAL OF MTThe ultimate goal of MT fraternity is to develop MT systems and integrate them in order to break the language Barrier.

Any person can have required information in his own native language and can share his/her knowledge across the world without learning the other languages.Machine Translation-Breaking the language Barrier75

A Glimpse of Future Possibilities

THANKS !!Machine Translation-Breaking the language Barrier77