cs460/626 : natural language processing/speech …pb/cs626-460-2011/cs626-460...thus, only a limited...

44
CS460/626 : Natural Language Processing/Speech NLP and the Web Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) transliteration) Pushpak Bhattacharyya Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th A il 2011 11 th April, 2011

Upload: hoangkiet

Post on 29-May-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • CS460/626 : Natural Language Processing/Speech NLP and the WebProcessing/Speech, NLP and the Web

    (Lecture 36syllabification and transliteration)transliteration)

    Pushpak BhattacharyyaPushpak BhattacharyyaCSE Dept., IIT Bombay

    11th A il 201111th April, 2011

  • Access from one language to another

    Language L1 Language L2

    T l ti T lit tiTranslation Transliteration

    Gandhiji Boy

  • Cross Lingual Query

    Query Translatio

    n

    Search

    Birthplace of Gandhi Search

    Engine

    Ga d

    English Document

    Translate Document

    Output

  • Sound based transliterationSound based transliteration

    TransliterationScript Script ()(Gandhiji)

    Transliteration using pronunciationConvert script to soundConvert sound to scriptConvert sound to script

    Similar to interlingua based MT

    MeaningSentence in source Language

    Sentence in target LanguageLanguage Language

  • Two approaches to transliterationTwo approaches to transliterationSyllabification

    Rule Machine Based Learning Based

    Word SyllabifiedApple ap pleFundamental fun da men tal

    Use GIZA++ Moses on parallel corpus

  • Phoneme-based approachPhoneme-based approach

    Word in Word in P(w )Source language Target language

    P( ps | ws) P ( wt | pt )

    P(wt)

    Pronunciationin

    Source language

    PronunciationIn

    target languageP ( pt | ps )

    Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )

    Note: Phoneme is the smallest linguistically distinctive unit of sound.

  • Basics of syllablesBasics of syllablesSyllable is a unit of spoken languageSyllable is a unit of spoken languageconsisting of a single uninterrupted soundformed generally by a Vowel and preceded orfollowed by one or more consonants.

    Vowels are the heart of a syllable (MostSonorous Element) (svayam raajate itisvaraH)Consonants act as sounds attached to

    lvowels.

  • Syllable structure

    A syllable consists of 3 major parts:-Onset (C)( )Nucleus (V)Coda (C)( )

    Vowels sit in the Nucleus of a syllableConsonants may get attached as OnsetConsonants may get attached as Onset or Coda.Basic structure - CVBasic structure CV

  • Possible syllable structuresThe Nucleus isThe Nucleus is always presentOnset and CodaOnset and Coda may be absentPossiblePossible structures

    VVCVVCCVC

  • Introduction to sonority theoryThe Sonority of a sound is its loudness

    relative to other sounds with the same length stress and speech length, stress and speech.

    Some sounds are more sonorousSome sounds are more sonorousWords in a language can be divided into syllablesySonority theory distinguishes syllables on the basis of sounds.

  • Sonority hierarchyDefined on the basis of amount of sound associatedThe sonority hierarchy is as follows:-

    Vowels (a, e, i, o, u)Li id ( l )Liquids (y, r, l, v)Nasals (n, m)Fricatives (s z f sh th etc )Fricatives (s, z, f,..sh, th etc.)Affricates (ch, j)Stops (b, d, g, p, t, k)p ( , , g, p, , )

  • Sonority scaleObstruents canObstruents can be further classified into:-classified into:

    FricativesAffricatesAffricatesStops

  • Sonority theory & syllablesA Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements the surrounding lower sonority elements.

    Represented as waves of sonority orRepresented as waves of sonority or Sonority Profile of that syllable

    Nucleus

    Onset Coda

  • Sonority sequencing principle

    The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.( ),

    PeakPeak(Nucleus)

    Onset Coda

  • Maximal onset principleThe Intervocalic consonants are maximally

    assigned to the Onsets of syllables inf it ith U i l d Lconformity with Universal and Language-

    Specific Conditions.Determines underlying syllable divisionDetermines underlying syllable divisionExample

    DIPLOMADIPLOMADIP LO MA & DI PLOMAMA

  • Syllable Structure: a more detailed look

    Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called nucleus.Basic Configuration: (C)V(C).Part of syllable preceding the nucleus is called the onset.Elements coming after the nucleus are called the codaElements coming after the nucleus are called the coda.Nucleus and coda together are referred to as the rhyme.

    S Syllable, O OnsetR Rhyme, N NucleusCo Coda

  • Syllable Structure: ExamplesSyllable Structure: Examples

    word

    sprint

  • Syllable Structure: ExamplesSyllable Structure: Examples

    may

    No Coda.

    opt

    No Onset.No Onset.

    air

    No Coda, No Onset.

  • Syllable StructureSyllable StructureOpen Syllable: ends in vowelClosed syllable: ends in consonant or consonant cluster

    Light Syllable: A syllable which is open and ends in a shortLight Syllable: A syllable which is open and ends in a short vowel

    General Description CV.Example, air.Example, air .

    Heavy Syllable: Closed syllables or syllables ending in diphthong

    Example: optExample: optExample, may

  • Syllabification: Determining Syllable Boundaries

    Given a string of syllables (word), what is the coda of one and the onset of another?

    In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)? ( ) y ( )

    To determine the correct groupings, there are some rules, two of them being the most important and significant:two of them being the most important and significant:

    Maximal Onset Principle, Sonority Hierarchy

  • Rule Based SyllabificationRule Based Syllabification

    (This and statistical syllabification and application to Transliteration: DDP work by

    Ankit Aggrwal 2009)Ankit Aggrwal, 2009)

  • PhonotacticsPhonotactics

    Guidesconstructionofonsetsandcodas

    Restrictscombinationofphonemes.

    Ingeneral,rulesoperatearoundthesonorityhierarchy.E.g.,Fricative/s/isloweronthesonorityhierarchythanthelateral/l/,sothecombination /sl/ is permitted in onsets and /ls/ is permitted in codas Oppositecombination/sl/ispermittedinonsetsand/ls/ispermittedincodas.Oppositeisnotallowed.

    Thus,slipsandpulsearepossibleEnglishwords.

    lsipsandpuslarenotpossible.

  • Constraints on Onsets1

    Oneconsonant:Only//cannotoccurinsyllableinitialposition.

    Twoconsonant:Werefertothescaleofsonority.Sequencernisruledoutsincethereisadecreaseofsonority.

    MinimalSonorityDistance:Distanceinsonoritybetweenthefirstandthesecondelementintheonsetmustbeofatleast2degrees.

    Thus,onlyalimitedno.ofpossibletwoconsonantclusters.

    Threeconsonant:Restrictedtolicensedtwoconsonantonsetsprecededby/s/.

    Also,/s/canonlybefollowedbyavoicelesssound.

    Therefore,only/spl/,/spr/,/str/,/skr/,/spj/,/stj/,/skj/,/skw/,/skl/,/ j/ ill b ll d ( li )/smj/willbeallowed.(splinter,spray,strongetc.)

    While/sbl/,/sbr/,/sdr/,/sgr/,/sr/willberuledout.

  • Constraints on Onsets2

    Possible2consonatclustersinanOnset

  • Constraints on Coda1Constraints on Coda

  • Constraints on Coda2Constraints on Coda

  • Other ConstraintsOther Constraints

    Nucleus:Thefollowingcanoccurasnucleus:Allvowelsounds(monophthongs aswellasdiphthongs).

    Syllabic:yBoththeonsetandthecodaareoptional(asseenpreviously).

    /j/attheendofanonset(/pj/,/bj/,/tj/,/dj/,/kj/,/fj/,/vj/,/j/,/sj/,/zj/,/hj/,/mj/,/nj/,/lj/,/spj/,/stj/,/skj/)mustbefollowedby/u/or//.

    Longvowelsanddiphthongsarenotfollowedby//.//israreinsyllableinitialposition.

    Stop+/w/before/u,,,a/areexcluded.

  • Implementation1p

    1. Identify the first nucleus (single or multiple consecutive vowels) in y ( g p )the word.

    2. Parse all the preceding consonants as the onset of the first syllable.3. Find the next nucleus in the word. If unsuccessful in finding another

    l 'll i l th t t th i ht f thnucleus, we'll simply parse the consonants to the right of the current nucleus as the coda of the first syllable, else we will move to the next step. Now the consonant cluster between these two nuclei have to be divided in two parts, one as coda of 1st syllable and the other onset of 2nd syllable.

    4. If the no. of consonants in the cluster is one, it will be the onset of the second nucleus as per the Maximal Onset Principle and Constrains on OnsetConstrains on Onset.

    5. If it is two, check whether both of these can go to the onset of the second syllable as per the allowable onsets.

  • Implementation2p6. If not , the first consonant will be the coda of the first syllable and

    the second consonant will be the onset of the second syllable.the second consonant will be the onset of the second syllable.

    7. If the no. of consonants in the cluster is three, check if they can b th t f th d ll bl if t h k f th l tbe the onset of the second syllable, if not, check for the last two, if not, parse only the last consonant as the onset of the second syllable.

    8. If the no. of consonants in the cluster is more than three, except the last three consonants, parse all the consonants as the coda of the first syllable. With the remaining 3 consonants, apply step 7step 7.

    9. Continue the procedure for the rest of the string.

  • AccuracyAccuracy=(no.ofwordscorrectlysyllabifiedx100) %

    no.ofwords

    10,000easywordswerechosentoestablishproofofconcept

    91wordswerefoundtobeincorrectlysyllabified.

    Accuracy:99.09%.

    Highprecision,lowrecall(acharacteristicofknowledgebasedAI)

  • Statistical syllabificationStatistical syllabification

  • Statistical Machine Syllabification

    DataSourcesofdata

    1. ElectionCommissionofIndia(ECI)NameList:ThiswebsourceprovidesnativeIndiannameswritteninbothEnglishandHindi.

    ( )2. DelhiUniversity(DU)StudentList:ThiswebsourcesprovidesnativeIndiannameswritteninEnglishonly.Thesenamesweremanuallytransliteratedforthepurposesoftrainingdata.

    3. IndianInstituteofTechnology,Bombay(IITB)StudentList:TheAcademicd a s u e o ec o ogy, o bay ( ) S ude sOfficeofIITBprovidedthisdataofstudentswhograduatedintheyear2007.

    4. NamedEntitiesWorkshop(NEWS)2009EnglishHindiparallelnames:Ali t f i d b t E li h d Hi di f i 11k i id dlistofpairednamesbetweenEnglishandHindiofsize11kisprovided.

    Westartwith8,000randomlychosennamesfromtheECINameli d ll bif h lllistandsyllabifythemmanually.

  • Statistical Machine Syllabification

    EffectofTrainingFormatSyllableseparatedFormat

    Source Targetsudakar sudakarchhagan chhaganj i t h ji t hjitesh jiteshnarayan narayanshiv shivmadhav madhavmohammad mohammadj a y a n t e e d e v i ja yan tee de vi

    SyllablemarkedFormat

    jayanteedevi jayanteedevi

    Source Targetsudakar su_da_karchhagan chha_ganjitesh ji_teshnarayan na_ra_yanshiv shivmadhav ma_dhavmohammad mo_ham_madjayanteedevi ja_yan_tee_de_vi

  • Statistical Machine Syllabification

    Syllableseparated Syllablemarked

    80%85%90%95%100%

    veAccuracy

    y p y

    60%65%70%75%80%

    1 2 3 4 5

    Cumulativ

    AccuracyLevel

    Top1Accuracy:71.8%(Syllableseparated);80.5%(Syllablemarked)

    T 5 A 83 4% (S ll bl d) 90 4% (S ll blTop5Accuracy:83.4%(Syllableseparated);90.4%(Syllablemarked)

  • Statistical Machine Syllabification

    EffectofDataSize4differentexperimentswereperformed:

    1. 8k:NamesfromtheECINamelist.

    2. 12k:Anadditional4knamesweresyllabifiedtoincreasethedatasize.

    3. 18k:IITBStudentListandtheDUStudentListwasincludedandsyllabified.

    4. 23k:SomemorenamesfromECINameListandDUStudentListweresyllabifiedandthisdataactsasthefinaldata forus.

    93.8%97.5% 98.3% 98.5% 98.6%

    90.0%

    95.0%

    100.0%

    ccuracy

    8k 12k 18k 23k

    70.0%

    75.0%

    80.0%

    85.0%

    CumulativeA

    1 2 3 4 5

    AccuracyLevel

  • Statistical Machine Syllabification

    EffectofLanguageModelngramOrder

    99.0%

    3gram 4gram 5gram 6gram 7gram

    95.0%

    97.0%

    ccuracy

    89.0%

    91.0%

    93.0%

    CumulativeAc

    85.0%

    87.0%

    1 2 3 4 5

    AccuracyLevel

  • Eff t f S ll bifi tiEffect of Syllabification on TransliterationTransliteration

  • Transliteration using syllabification

    Input String

    Moses

    Syllabified data

    (sourceside)Automatic S llabifieMoses Syllabifier

    S ll bifi d St i

    Learnt syllabifier

    Syllabified StringLearnt

    transliterator

    Moses Automatic TransliteratorSyllabified

    data(parallel)(parallel)

    Transliterated String

  • Statistical Machine Transliteration

    100%

    Syllableseparated Syllablemarked

    65%70%75%80%85%90%95%100%

    tive

    Accuracy

    45%50%55%60%65%

    1 2 3 4 5 6

    Cumulat

    A L l

    Top1Accuracy:60.1%(Syllableseparated);50.2%(Syllablemarked)

    Top 5 Accuracy: 87 2% (Syllable separated); 79 3% (Syllable marked)

    AccuracyLevel

    Top5Accuracy:87.2%(Syllableseparated);79.3%(Syllablemarked)

  • No syllabification: Baseline Performance

    Topn CorrectCorrect%age

    Cumulative%age

    Additional Additional

    1 1,868 41.5% 41.5%2 520 11.6% 53.1%3 246 5.5% 58.5%4 119 2 6% 61 2%4 119 2.6% 61.2%5 81 1.8% 63.0%

    Below5 1,666 37.0% 100.0%4,500

    TheTop5Accuracyisjust63.0%whichismuchlowerthanrequired.

    40

  • Manual, i.e., ideal syllabification: , , yskyline Performance

    Syllableseparated Syllablemarked

    80%85%90%95%100%

    veAccuracy

    y p y

    60%65%70%75%80%

    1 2 3 4 5

    Cumulativ

    AccuracyLevel

    Top1Accuracy:71.8%(Syllableseparated);80.5%(Syllablemarked)

    T 5 A 87 4% (S ll bl d) 90 4% (S ll blTop5Accuracy:87.4%(Syllableseparated);90.4%(Syllablemarked)

    41

  • Obsevations

    Transliteration likely to be helped by syllabificationsyllabificationRule based syllabification: high precision low recallprecision low recallStatistical syllabification: reasonable accuracy at top 5accuracy at top-5

  • ReferencesReferences1. NasreenAbdulJaleelandLeahS.Larkey.Statisticaltransliterationforenglish

    arabiccrosslanguageinformationretrieval.InConferenceonInformationandKnowledgeManagement,pages139146,2003.

    2. AnnK.FarmerAndrianAkmajian,RichardM.DemersandRobertM.Harnish.Linguistics:AnIntroductiontoLanguageandCommunication.MITPress,5thgu st cs t oduct o to a guage a d Co u cat o ess, 5tedition,2001.

    3. AssociationforComputerLinguistics.CollapsedConsonantandVowelModels:NewApproachesforEnglishPersianTransliterationandBackTransliteration,20072007.

    4. SlavenBilacandHozumiTanaka.Directcombinationofspellingandpronunciationinformationforrobustbacktransliteration.InConferencesonComputationalLinguisticsandIntelligentTextProcessing,pages413424,2005.

    5. IanLaneBingZhao,NguyenBachandStephanVogel.Aloglinearblocktransliterationmodelbasedonbistreamhmms.HLT/NAACL2007,2007.

  • References6. H.L.JinandK.F.Wong.AChinesedictionaryconstructionalgorithmfor

    informationretrieval.InACMTransactionsonAsianLanguageInformationProcessing,pages281296,December2002.

    7. K.KnightandJ.Graehl.Machinetransliteration.InComputationalLinguistics,pages24(4):599612,Dec.1998.pages ( ) 599 6 , ec 998

    8. LeeFengChienLongJiang,MingZhouandChenNiu.Namedentitytranslationwithwebminingandtransliteration.InInternationalJointConferenceonArtificialIntelligence(IJCAL07),pages16291634,2007.

    9. DanMateescu.EnglishPhoneticsandPhonologicalTheory.2003.

    10. DellaPietraP.BrownandR.Mercer.Themathematicsofstatisticalmachinetranslation:Parameterestimation.InComputationalLinguistics,page19(2):263U 311, 1990.19(2):263U 311,1990.

    11. GanapathirajuMadhaviPrahalladLavanyaandPrahalladKishore.AsimpleapproachforbuildingtransliterationeditorsforIndianlanguages.ZhejiangUniversitySCIENCE2005,2005.