machine translation breaking the communication barrier
DESCRIPTION
By Dr. Gurpreet S. Josan Punjabi University, Patiala. Machine Translation Breaking the Communication Barrier. Communication. Communication is the activity of conveying meaningful information. Communication requires a sender, a message, and an intended recipient - PowerPoint PPT PresentationTRANSCRIPT
Slide 1
Use Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStop Working of Target Word GeneratorSelected word in redCommunicationCommunication is the activity of conveying meaningful information.
Communication requires a sender, a message, and an intended recipient
The communication process is complete once the receiver has understood the sender.Machine Translation-Breaking the language Barrier2
Mode of CommunicationNonverbal communication- includes gesture, body language or posture; facial expressionVisual communication- includes signs, typography, drawing, colours etc.Oral communication- spoken verbal communicationWritten communication- includes alphabets symbols, grammar etc.
Machine Translation-Breaking the language Barrier3
3Machine Translation-Breaking the language Barrier4Barrier in Communication
4The EffectMachine Translation-Breaking the language Barrier5Language is a barrier for information dissemination. All the major source of information/ discoveries are in English.We are unable to reach the masses in rural area who do not know English.
Youre a scientist who has just clicked a revolutionary new idea. How do you find out if a scientist anywhere in world has already filed a patent on a similar idea in their native language?
Machine Translation-Breaking the language Barrier6Who Can Help?A TranslatorManualToo SlowLimited AvailableCostlyAccurate
MachineFastEconomicalNot accurate
6IssueskJfmmfj mmmvvv nnnffn333Uj iheale eleee mnster vensi credurBaboi oi cestnitze Coovoel2^ ekk; ldsllk lkdf vnnjfj?Fgmflmllk mlfm kfre xnnn!
IssuesComputers Lack Knowledge!Computers see text in English the same you have seen the previous text!People have no trouble understanding languageCommon sense knowledgeReasoning capacityExperienceComputers have No common sense knowledgeNo reasoning capacity
IssuesMachine Translation-Breaking the language Barrier9
Which ones are going to be difficult for computers to deal with? Grammar or Lexicon?
Grammar (Rules for putting words together into sentences)How many rules are there? 100, 1000, 10000, more Do we have all the rules written down somewhere?
Lexicon (Dictionary)How many words do we need to know?1000, 10000, 100000
Issues the dog ate my homework - Who did what to whom?Identify the part of speech (POS)Dog = noun ; ate = verb ; homework = nounEnglish POS tagging: 95%
Try to tag this text manuallyI can, can the can.2. Identify collocationsmother in law, hot dog
Seemingly similar sentences may differ radically in meaning: The CEO was fired up about his new role. The CEO was fired from his new role.Seemingly different sentences can have the same meaning: IBMs PC division was acquired by Lenovo. Lenovo bought the PC division of IBM.Machine Translation-Breaking the language Barrier11Issues
IssuesMachine Translation-Breaking the language Barrier12
AmbiguityStructural ambiguity I saw the man with the telescope
Word level ambiguity
AmbiguityVarious Meaning of word in Punjabi Machine Translation-Breaking the language Barrier13
AmbiguityIf more than one ambiguous word is present in a sentence, the number of potential interpretations of the sentence explodes: the number of interpretations is the product of all possible meanings of the words.
Consider the sentence
and assume that only {va} and {pa} are ambiguous in this sentence, and that they both have 4 senses. This brings the number of possible interpretations to 16. Machine Translation-Breaking the language Barrier14AmbiguityImagine what happens if there are more senses to be taken into account or if the sentence gets longer.
Machine Translation-Breaking the language Barrier15 (Sukhbir) (has) (twine)
(thigh) (crease) (FootPath) (sultriness) (destroy) (door-leaf) (silk) IssuesAnaphora Resolution:
The dog ate the bird and it died.
Gender Conversion
Idioms & Phrases
IssuesNamed Entity Recognition. Dr. Plant Singh vs Dr. Buta Singh
Foreign words vs
Spelling Variation, , etc.
Machine Translation-Breaking the language Barrier17
IssuesRhyming Reduplication-, -Other issuesIn Indian Languages-no fixed fontFor example the word can be written in following manners:
Machine Translation-Breaking the language Barrier18
+ + + =
+ + = + + = + + + + + + + + + + + + + + + + + +
Three MT Approaches: Direct, Transfer, InterlingualInterlinguaSemanticStructureSemanticStructureSyntacticStructureSyntacticStructureWordStructureWordStructureSource TextTarget TextSemanticCompositionSemanticDecompositionSemanticAnalysisSemanticGenerationSyntacticAnalysisSyntacticGenerationMorphologicalAnalysisMorphologicalGenerationSemanticTransferSyntacticTransferDirect
Machine Translation-Breaking the language Barrier1919Direct MT: Pros and ConsProsFastSimpleInexpensiveRobustNo translation rules hidden in lexicon
ConsUnreliableNot powerfulRule proliferationRequires too much contextMajor restructuring after lexical substitution
Machine Translation-Breaking the language Barrier2020Transfer MT: Pros and ConsProsDont need to find language-neutral representationRelatively fastConsLarge no. of transfer rules: Difficult to extendProliferation of language-specific rules in lexicon and syntax
Machine Translation-Breaking the language Barrier2121Interlingual MT: Pros and ConsProsPortable Lexical rules and structural transformations stated more simply on normalized representationExplanatory AdequacyConsDifficult to deal with terms. Deciding what should be added is difficult. What will be the universal knowledge format? How do we encode?Must decompose and reassemble concepts
Machine Translation-Breaking the language Barrier2222Current TechniquesCorpus-based approachesStatistics-based Machine Translation (SMT): Every target language string, , is a possible translation of . Every string is given a number called probability.We select the string which has maximum probability.
= argmax [Pr() Pr( | )]Where is a source language and is a target language
These are known as the language modeling problem, the translation modeling problem, and the search problem.Machine Translation-Breaking the language Barrier23
Current TechniquesCorpus-based approachesExample Based Machine TranslationTranslation by Analogy.System is given a set of sentences in the source language and their corresponding translations in the target languageSystem uses those examples to translate other, similar source-language sentences into the target language.Hybrid methodsCombination of Rule Based and Statistical Methods
Machine Translation-Breaking the language Barrier24
Punjabi to Hindi Machine TranslationPunjabi to Hindi Machine translation system is a direct translation system based on various lexical resources and rule-base.
The system is modular with clear separation of data from process.
The central idea is to select words from source language and do the minimal analysis required like extracting the root word, lexical category and contextual information i.e. tokens at left and right side of the current token. Machine Translation-Breaking the language Barrier25
Word sense disambiguation module is called for ambiguous words.
Equivalents of source token in target language are found out from the lexicon and are replaced to get target language.
The rules are applied to the output for making it appropriate for the target language. Machine Translation-Breaking the language Barrier26
Punjabi to Hindi Machine TranslationSystem ArchitectureNormalized Source TextTokenizationNamed Entity RecognitionRepetitive Construct HandlingLexicon Look upAmbiguity ResolutionTransliterationHit?Ambiguous?Post EditingTarget TextNoNoYesYesRoot word & Inflectional Form DBBigram & Trigram DBAmbiguous Word DBAppend in Output and retrieve next tokenIf token presentYesNoPre ProcessingTranslation EnginePostProcessingWhy Direct System?For a given language pair and text type what kind of system is required is largely an empirical and a practical question.
General requirements on MT systems such as modularity, separation of data from processes, reusability of resources and modules, robustness, corpus-based derivation of data and so on, do not, provide conclusive arguments for either one of the models.
The available resources are one of the key factors for deciding the approach.Machine Translation-Breaking the language Barrier28
Why Direct System?In general, if the two languages are structurally similar, in particular as regards lexical correspondences, morphology and word order, the case for abstract syntactic analysis seems less convincing.
Keeping in view, the similarity in Punjabi and Hindi language pair, a simpler, direct model is our obvious choice for Punjabi to Hindi machine translation system.
Machine Translation-Breaking the language Barrier29
The LexiconThe lexicon contains information about the primary component of languages, i.e. words.
Most NLP applications use dictionaries. For example, morphological analyzers use a lexicon containing morphemes, and tagging systems use probability data, and parsers use lexical/semantic information or co-occurrence information, and MT systems use Translation Memory and a transfer dictionary.
Machine Translation-Breaking the language Barrier30
The LexiconMachine Translation-Breaking the language Barrier31The bilingual dictionary prepared by LTRC dept of IIIT Hyderabad in ISCII format containing about 22000 entries.
Adopted and extended for our system.
Converted in to Unicode format.
The entries are extended to about 33000 covering almost all the root words of Punjabi language. Root Table{ Field name: PW Field Type: Text Field name: gnp Field Type: Text Field name: cat Field Type: Text Field name: HW Field Type: Text}Root TablepwgnpcatHwmsnmnsadjfsnfnsadjfnf mnmfnfAmb
Inflectional Form TablepwroothwTable of all the inflected forms of Punjabi root words. Contains all the inflectional forms of Punjabi root words and along with their roots. The corresponding Hindi words are entered manually. It comprise of about 65,000 entries. Inflectional Form Table{ Field name: PW Field Type: Text Field name: ROOT Field Type: Text Field name: HW Field Type: Text}Where ROOT is one of the entry from Root table.The Lexicon
Machine Translation-Breaking the language Barrier32For all the ambiguous words in root table as well as inflectional form table, the entry for target word contains a symbol amb. It triggers the disambiguation process for the given word. A table of ambiguous words is prepared for this purpose that contains most frequent meaning followed by all other possible meanings of a given word.
The LexiconAmbigous word table{Field name: PWField Type: TextField name: s1Field Type: TextField name: s2Field Type: Text}Ambigous wordsPws1s2 // // / The Lexicon
Machine Translation-Breaking the language Barrier33To help the disambiguation module, bigram and trigram tables are created.
They contains the context of ambiguous words along with their meaning in that context and frequency obtained from a corpus of about 30 lakh words. Bigram Table{ Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number}Trigram Table{ Field name: PREV2 Field Type: Text Field name: PREV1 Field Type: Text Field name: PW Field Type: Text Field name: HW Field Type: Text Field name: COUNT Field Type: Number}Bigramprev1pwhwcount,15,18,25,22,7Trigramprev2prev1pwhwCount ,,2,,3,,2,,19,,10,,54The Lexicon
The lexicon also contains a rule-base.
It contains all the rules to handle different grammatical dissimilarities between two languages at post processing.
Replacementorgtxtreptxt - Replacement Table{ Field name: orgtext Field Type: Text Field name: reptxt Field Type: Text}The Lexicon
Machine Translation-Breaking the language Barrier35Text normalizationThe text should be in a normalized way i.e. there should be only one way to represent a syllable.
Having several identical pieces of text represented by differing underlying byte sequences makes analysis of the text much more difficult.
For example, under the AnmolLipi font, the Latin character A would appear as . Conversely, under the DrChatrikWeb font it would appear as .
This cause a problem while scanning a text. Machine Translation-Breaking the language Barrier36
Text normalizationSo the source text is normalized by converting it into Unicode format.
It gives us three fold advantages; First it will reduce the text scanning complexity. Secondly it also helps in internationalizing the system. Thirdly it eases the transliteration task.Machine Translation-Breaking the language Barrier37
Text NormalizationSpelling normalization There may be the chances that the word is present in database but with different spellings like {prkhia} [examination] In database only one may appear and other may not. The purpose of spelling normalization is to find the missing variant. Soundex technique is used for the spelling normalization.Machine Translation-Breaking the language Barrier38
SoundexIn this technique, a unique number is assigned to each character of alphabet. Similar sounding letters get same number. Then codes for each string are generated. All the strings with same code are the spelling variations of a same one string.Machine Translation-Breaking the language Barrier39
Soundexc1c14c30C32c2c15c26C38c11c16c31kbc4c17c32,Uc6c18c33,Sc5c19c34Lc8c20c35,Oc9c21c36Kc8c22c37Lc1c23c38C26c3c24c39, , HALANTNo Codec7c25c40c41c26c10Hc27c13c12c28c14c13c29c19Machine Translation-Breaking the language Barrier40
SoundexWith this table the code for came out to be c31c37sc13sc4 enabling the system to detect the variant present in database. For example, if the database contains {prkhia} as Punjabi word then the code c31c37sc13sc4 is stored against it. If a user enters as input, which is not present in database, its code will be generated on the fly by the system and checked in the database. If code appears in the database the corresponding Punjabi word is selected as spelling variant. Machine Translation-Breaking the language Barrier41
Word Sense DisambiguationIn order to achieve this, we make use of the information contained in the context similar to what humans do.
A standalone word sense disambiguation module that is capable for performing its work without any help from outside.
To start with, all we have is a raw corpus of Punjabi text. So the statistical approach is the obvious choice for us.
Machine Translation-Breaking the language Barrier42
Word Sense DisambiguationWe use the words surrounding the ambiguous word to build a statistical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts.
The basic idea of statistical methodologies is that, given a sentence with ambiguous words, it is possible to determine the most likely sense for each word.
One of such statistical model is n gram model.
Machine Translation-Breaking the language Barrier43
Word Sense DisambiguationAn n-gram is simply a sequence of successive n words along with their count i.e. number of occurrences in training data.
An n-gram of size 2 is a bigram; size 3 is a trigram; and size 4 or more is simply called an n-gram or (n1)-order Markov model.
N-grams are used as probability estimators which estimates likeliness of a word(s) to follow a certain point in a document.What is the optimum value of n?Machine Translation-Breaking the language Barrier44
Word Sense DisambiguationConsider predicting the word "" from the three sentences:(1) . (2) , , . (3)
In (1), the prediction can be done with a bigram (2-gram) language model (n=2), but (2) requires n=4 and (3) require n > 9.
Machine Translation-Breaking the language Barrier45
Word Sense DisambiguationNumber of words to be considered at n positions is important
Factors of concern are
Larger the value of n, higher is the probability of getting correct word sense i.e. for the general domain; more training data will always improve the result. But on the other hand most of the higher order n grams do not occur in training data. This is the problem of sparseness of data.
Machine Translation-Breaking the language Barrier46
Word Sense DisambiguationAs training data size increase, the size of model also increase which can lead to models that are too large for practical use. The total number of potential n grams scales exponentially with n. A large n require huge amount of memory space and time.
Does the model get much better if we use a longer word history for modeling an n-gram?
Do we have enough data to estimate the probabilities for the longer history?Machine Translation-Breaking the language Barrier47
Word Sense DisambiguationAn experiment for optimum value of n for Punjabi language is performed.
Different n gram models were generated where n ranges from 1 to 6
This was observed that as the value of n increases, its ability to disambiguate a word decreases.
This is due to sparseness of data.
Machine Translation-Breaking the language Barrier48
Word Sense DisambiguationAnother interesting point observed is that instead of making and using a higher order n gram models, we can improve the efficiency of the system tremendously by utilizing lower order models jointly.
We can use tri-gram model in the first place to disambiguate a word. If it fails to disambiguate then we move to lower order model i.e. bi-gram model for WSD. If it also fails, we can use the unigram model.
With this technique we get only 7.96% of incorrectly disambiguated words
This approach is adopted for the word sense disambiguation module.Machine Translation-Breaking the language Barrier49
Word Sense DisambiguationThree models viz. Unigram, Bigram and Trigram of the ambiguous words to tap the words in context of any ambiguous word are created from a corpus of about 3 million words generated by including different types of articles like essays, stories, editorials, News, novels, office letters, court orders etc.
In order to reduce the size of n-grams, we retain only those context which leads to less frequent meaning of ambiguous words. Machine Translation-Breaking the language Barrier50
Word Sense DisambiguationThe idea is to check the contextual information for the least frequent meaning. If it fails to disambiguate then we use the most frequent meaning by default.
For example {d} in Punjabi can be used as post position as well as verb, but its usage as verb is very less frequent. So we place all those bigrams and trigrams in database that leads to the disambiguation of less frequent meaning i.e. {d} as verb. Machine Translation-Breaking the language Barrier51
Word Sense DisambiguationMachine Translation-Breaking the language Barrier52It contains all those entries for which has less frequent meaning. All such meanings are entered manually in Trigram and Bigram Model. If the word cannot be disambiguated by bigram and trigram then most frequent meaning is selected by default. There are the chances when the previous words in the context leads to one sense but next words are producing the other sense. For such cases again the sense with higher probability is selected. bigramprev1pwhwcount,4,2,12,2,2
TransliterationTransliteration is a solution of OUT-OF-VOCABULAY words.
Transliteration is a process wherein an input string in some alphabet is converted to a string in another alphabet, usually based on the phonetics of the original word.
If the target language contains all the phonemes used in the source language, the transliteration is straightforward e.g. the Hindi transliteration of Punjabi word (Room) is which is essentially pronounced in the same way. Machine Translation-Breaking the language Barrier53
TransliterationFor missing sounds or extra sounds in the target language are generally mapped to the most phonetically similar letter e.g. in Hindi we have alphabets which have double sound associated with them like which is a combination of sound of and . In Punjabi, generally the letter is used to denote such sounds e.g. in (Alphabet) which is transliterated to .
A single foreign word can have many different transliterations. E.g. (Mehfooz) can be transliterated as , , , etc.
Machine Translation-Breaking the language Barrier54
Transliteration-Approaches Direct MappingRule BasedSoundex BasedMachine Translation-Breaking the language Barrier55
GurmukhiDevanagariGurmukhiDevanagari,,,,,,,,Vowel Mapping Punjabi contains 10 vowel symbols and nine dependent vowel sounds. Hindi has the one to one representations of all Punjabi vowel symbols and sounds.Transliteration
Machine Translation-Breaking the language Barrier56Consonant MappingConsonant mapping is shown in Table BelowGurmukhiDevanagariGurmukhiDevanagariGurmukhiDevanagariGurmukhiDevanagariGurmukhiDevanagari----No letter in Punjabi is present for Hindi letters . This means these letters can never be mapped in letter to letter based approach. Similar is the case for some double sound producing letters like .Transliteration
Sub Joins MappingThere are three sub joins (PAIREEN) in Gurmukhi, Haahaa, Vaavaa, Raaraa shown in table below
PAIREENPunjabiHindiEnglishLips SequentiallySelf-respectIn Punjabi They are represented by the virama (or halant) character before the consonant. Similar viram character is also present in Hindi which indicate that the inherent vowel is omitted (or 'killed'). PAIREEN Haahaa and Vaavaa are replaced with full consonants while their previous consonant is shown in half form, whereas PAIREEN Raaraa takes the position below the previous consonant in Hindi similar to as in Punjabi. Transliteration
Machine Translation-Breaking the language Barrier58Other Symbols
Adhak ( ) is used to duplicate the sound of a consonant in Punjabi. No such character is present in Hindi. Sound duplication is represented by half form of consonants in Hindi. Punctuation marks and digits are same in both scripts.
A special character called as visarga () is present in Hindi but not in Punjabi. So it will never be mapped in letter to letter based scheme.
Beside this Gurmukhi has two separate nasal characters Bindi () and Tippi ( ). Hindi has also two nasal characters i.e. Bindi () and chander bindu (). Both nasal characters of Punjabi are mapped to this single nasal character in Hindi. Transliteration
Letter to letter mapping produce quite good results. But we can improve the results by making them more nearer to target language in term of spellings and choice of alphabets by using some set of rules. Rule Based ApproachAlphabets whose mapping is not available in Hindi : and are two such alphabets. They are replaced by their most phonetically equivalent characters i.e. and .A character adhak is present in Punjabi and used to show the stress on the next character. No alphabet in Hindi is present to represent this character. The purpose is solved by placing a half character before the stressed character. e.g. is transliterated as . There is an exception for this rule. If the next character of adhak is then instead of placing half character, a half is placed. e.g. tranliterated to . Also if next character is then half character is replaced by half . e.g. as in which is transliterated to . Similarly if next character is then half character is replaced by half . e.g. as is transliterated as .Transliteration
Machine Translation-Breaking the language Barrier60Rule for tippi if followed by or then is replaced by half as in and if followed by then is replaced by half as in .Rule for character when followed or preceded by is translitrated differently. In case, it is followed by then is ommited from the transliterated text e.g. is transliterated to . if it is preceded by then is mapped to in the transliterated text. e.g. is transliterated to .Miscellaneous RulesIf this combination of letters appears at the last position in a word then instead of mapping to this letter mapped to . e.g. is transliterated to .if at last position then it is replaced by e.g. in if at last position or 2nd last position then it is replaced by e.g. in . Transliteration
Machine Translation-Breaking the language Barrier61SoundexThe Soundex concept is extended for searching the correct spelling variant of a given transliteration. The transliterations are produced by the methods discussed earlier.we have developed a unigram table from a corpus of about 10 million wordsThe codes are generated for each of the word in unigramFor the comparison, letters in word are converted into phonetic codeMachine Translation-Breaking the language Barrier62
SoundexThen this code is looked for in unigram table. The entry with maximum frequency is selected as the correct variant of given input For example consider the word {arpha} (Draft) written in Punjabi.The strings {arpha} is produced by the baseline module. The code 2541483623 is generated for this string. This code is looked for in unigram database. This database contains two entries against this code i.e. and with frequency 12 and 8. The string with higher frequency is selected as correct output for this input. Machine Translation-Breaking the language Barrier63
SoundexMachine Translation-Breaking the language Barrier64frweIvr= Direct Mapping= Rule Base Enhancement= Soundex Based Enhancement=
Post ProcessingRules are applied to cover the minor grammatical differences between languages.
The general structure of the rules is context dependent replacement.
The corresponding phrase or word in a given context replaces one phrase or a word.
Ordering of rules does not matter.
Machine Translation-Breaking the language Barrier65
Input = Output = An Example
Machine Translation-Breaking the language Barrier66Use Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStopWorking of Target Word GeneratorSelected word in redUse Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStopWorking of Target Word GeneratorSelected word in redThree trigram are possible as shown below
Only 2nd & 3rd trigram can resolve ambiguity. The meaning with higher count is selected. In this case both trigram produce same result.Use Default MeaningPut Hindi Word into OutputLook into TrigramLook into BigramFoundAmbTransliterateYESNONOYESYESYESYESNONOSentence CompleteCheck vicinity for direct transliterationNeed Translit.Look into Root Data BaseFoundLook into Inflectional DBNONOYESSelect Next WordNOYESStopWorking of Target Word GeneratorSelected word in redPunjabi to Hindi MT system http://translate.jgmatrix.inMachine Translation-Breaking the language Barrier71
MT System Breaking Language BarrierMachine Translation-Breaking the language Barrier72
MT System Breaking Language BarrierMachine Translation-Breaking the language Barrier73
MT System Breaking Language BarrierMachine Translation-Breaking the language Barrier74
ULTIMATE GOAL OF MTThe ultimate goal of MT fraternity is to develop MT systems and integrate them in order to break the language Barrier.
Any person can have required information in his own native language and can share his/her knowledge across the world without learning the other languages.Machine Translation-Breaking the language Barrier75
A Glimpse of Future Possibilities
THANKS !!Machine Translation-Breaking the language Barrier77