linguistic enrichment of statistical transliteration transliteration mtp final stage presentation...

37
Linguistic Enrichment Linguistic Enrichment of Statistical of Statistical Transliteration Transliteration MTP Final Stage Presentation Guided by:- Presented by:- Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902) Department of Computer Science & Engineering IIT Bombay ललललललललललल लललललललललल ललल ललललललललललल ललललललललललल लल

Upload: iris-boone

Post on 28-Dec-2015

231 views

Category:

Documents


2 download

TRANSCRIPT

  • Linguistic Enrichment of Statistical TransliterationMTP Final Stage Presentation

    Guided by:- Presented by:-Prof. Pushpak Bhattacharyya Abhijeet Padhye (06305902)

    Department of Computer Science & EngineeringIIT Bombay

  • Presentation PathwayProblem StatementMotivationWhat is Transliteration?Syllables and their StructureSonority TheoryConcept of SchwaProposed Transliteration ModelExperiments and ResultsDiscussionsConclusion and Future WorkReferences

  • Problem StatementTo exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.

  • MotivationAn important component of Machine TranslationWhen you cannot Translate Transliterate.Critical in tackling problem of OOV words and proper nounsProves acute in translating Named entities for CLIRTransliteration a Phonetic translation process;Apt to exploit phonetic and phonological properties

  • What is Transliteration?A process of phonetically translating words like named entities or technical terms from source to target language alphabet.Examples:-Gandhiji OOV words like - Namaskar

  • Humans translate/transliterate frequently for different reasons

    An example of how transliteration comes to rescue when no translations exist

  • xOverview of TransliterationSource WordTransliterationUnitsTarget WordTransliterationUnitsCharacter n-gramsSyllables

  • Basic of syllablesSyllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.

    Vowels are the heart of a syllable(Most Sonorous Element)Consonants act as sounds attached to vowels.

  • Syllable StructureSimple syllables Baba,

    Complex syllables Andrew Ba + ba + AndrewVC?CVC?Alert!!!Basic Structure doesnt suffice

  • Possible syllable structuresThe Nucleus is always presentOnset and Coda may be absentPossible structuresVCVVCCVC

  • Introduction to sonority theoryThe Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.

    Some sounds are more sonorousWords in a language can be divided into syllablesSonority theory distinguishes syllables on the basis of sounds.

  • Sonority HierarchyObstruents can be further classified into:-FricativesAffricatesStops

  • Sonority sequencing principleThe Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.

    Peak (Nucleus)

    Onset Coda

  • exampleABHIJEETSonority Profile 1 AIE E H JB T

    Sonority Profile 2 AIE E H JB T

  • The concept of schwaFirst alphabet of IAL {a}Unstressed and Toneless neutral vowelSome schwas deleted and some are notSchwa deletion important issue for grapheme to phoneme conversionHandled using a well-established schwa deletion algorithmExample:-Priyatama Last a changes the Gender

  • Proposed Transliteration ModelSyllabification ModulesSRILMMoses TrainingMoses DecoderSource Language WordsTarget Language WordsTarget Language SyllablesTarget Language ModelSource Language WordsPhrase translation tables Transliterated outputSource Language Syllables

  • Transliteration system workflowSyllabification of parallel list of names in Roman and DevanagariUsing these parallel list for:-Alignment of syllablesTraining Moses translation toolkitLanguage model generation using SRILM Decoding using trained phrase-translation tables and language modelComparing results to analyze performance

  • Experiments and ResultsSyllabification of Roman and Devanagari wordsFig : Syllabification Algorithm

  • Syllabification results

    A few examples

    LanguageCorrectIncorrectEnglishSolomanAkbarkhanVenkatachalam

    Hindi

  • Transliteration ProcessSyllabification of list of 10000 parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables.Training Language Models for target language using SRILM toolkit.Training MOSES with aligned corpus of 7500 names and target language model as input.Testing with a list of 2500 proper names using the trained model for transliteration.

  • Roman to Devanagari TransliterationFig : Result for Roman to Devanagari TransliterationFig : Top-n Inclusion results

  • Devanagari to Roman TransliterationFig : Result for Devanagari to Roman TransliterationFig : Top-n translation results

  • Comparison with Character n-gram based model Same Experimental setup; Transliteration units changed to n-grams Bigrams (Sandeep Sa, an, nd, de, ee, ep) Trigrams (Sandeep San, and, nde, dee, eep) Quadrigrams (Sandeep Sand, ande, ndee, deep) Observations suggest performance improvement using syllables as transliteration units n-gram based models prove to be ignorant to phonological properties like unstressed vowels

    Fig : Comparison with N-gram based model

  • Comparison with State-of-the-art SystemsGoogle transliteration engine and Quillpad used as benchmarks for comparisonA list of 1000 words written in Roman alphabet used as test inputOur system outperforms Quillpad and just falls short of Googles results.A more intense training with larger training set might improve system performance.Fig : Comparison with State-of-the-art transliteration systems

  • DiscussionsAccents : Thoda or thora?

    Mapping of soundsMahaan Kahaan -

    Silent LettersPsychatrist -

  • Discussions (cntd)Improper Schwa deletionVenkatachalam

    Improper placement (Onset or Coda) - or Similar phonological structure but different pronunciation and

    + + + + +

  • Conclusion and Future workTransliteration can prove critical in supporting Machine TranslationPhonologically aware transliteration units like syllables show strong signs of performance improvementSyllable-based transliteration performs at least up to the state-of-the-art systems.Syllabification algorithms should be subjected to further improvementDeveloped system should be supplied with larger and more accurate training set.Some linguistic issues discussed above are very challenging cases for future work on transliteration

  • ReferencesPirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K. 2003. Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information RetrievalGao W., Lam W., and Wang K. 2004. Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing.Osamu F. 1975. Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing.Phillip Koehn et.al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic.Laver J. 1994. Principles of Phonetics. Cambridge University Publications. PG. 114.Knight L. and Graehl J. 1997. Machine Transliteration. Proceedings of ACL 1997. Pg 128-135.Stolcke A. 2002. SRILM An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing.Choudhury M. and Bose A. 2002. A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.

  • Background Theory

  • Approaches towards Transliteration

  • Complex Syllable structureFig : Detailed syllable structureFig : Complex syllables fitting in above structure

  • Sonority theory & syllablesA Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.

    Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

  • Sonority Hierarchy for English and HindiFig : Sonority hierarchy for EnglishFig : Sonority hierarchy for Hindi

    Sound SegmentLetters containedVowels{"a", "e", "i", "o", "u"}Liquids{"y", "r", "l", "v", "w"}Nasals{"n", "m"}Fricatives{"s", "z", "f", "th", "h", "sh", "x"}Affricates{"ch", "j"}Stops{"b", "d", "g", "p", "t", "k", "q", c}

    Sound SegmentLetters containedVowels{"","","","","","","","","","","","",""}Matras{"","","","","","","","","",""}Liquids{"","","",""}Nasals{"", "", "" ,"" ,"", ""}Fricatives{"", "", "", "", "", "", "", "", ""}Affricates{"", "", ""}Stops{"", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""}

  • Maximal Onset PrincipleThe Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.

    In case of words having two valid syllable set, one with maximum onset length would be preferred.Example DiplomaDi + plo + maDip + lo + ma

  • Schwa deletion algorithmProcedure delete_schwa (DS) Input : word (String of alphabets) Output : Input word with some schwas deleted. Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U. If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F. If y, r, l or v are marked U and preceded by consonants marked H, then mark them F. If a consonant marked U is followed by a full vowel, then mark that consonant as F. While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F. If the last consonant is marked U, mark it H. If any consonant marked U is immediately followed by a consonant marked H, mark it F. While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F. For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output. End procedure delete_schwa

  • Example of Schwa deletionFig : Application of Schwa deletion Algorithm

  • ExamplesCorrect Transliterations

    Incorrect Transliteration

    Source LanguageTarget LanguageExamplesEnglishHindiREYMOND MUKUNDAN HindiEnglish SALEEMABEGAM RAJGURU

    Source LanguageTarget LanguageExamplesEnglishHindiVENKATACHALAM DHRUVA UNKHindiEnglish COMLATA