presenter gurpreet singh lehal, department of computer science, punjabi university, patiala

23
Presenter Gurpreet Singh Lehal, Department of Computer Science, Punjabi University, Patiala

Upload: merryl-bridges

Post on 28-Dec-2015

238 views

Category:

Documents


5 download

TRANSCRIPT

  • PresenterGurpreet Singh Lehal, Department of Computer Science,Punjabi University, Patiala

  • IntroductionPunjabi language is written in two mutually incomprehensible scripts. Gurmukhi ()Shahmukhi ()Nearly 2 crore people in India use Gurmukhi script, while 6 crore people in Pakistan use Shahmukhi script. Most of the medieval Punjabi literature in Shahmukhi while modern literature is in GurmukhiNecessary to break this script wedge as very few Punjabi readers can read both the scripts.

  • Issues in Shahmukhi-Gurmukhi TransliterationMultiple mappings for some Gurmukhi characters in Shahmukhi character setNo exact equivalent mappings in Gurmukhi for some Shahmukhi characters Missing nukta symbols in Gurmukhi text Difference between pronunciation and orthography Transliteration of proper nouns Next

  • Multiple Mappings for Gurmukhi Characters

  • Multiple Mappings for Gurmukhi CharactersThese lesser frequently occurring similar sounding Shahmukhi characters have 1.45% frequency of occurrence. Average word length of a Shahmukhi word is 3.55Any rule based system which does not take care of these characters, will have 1.45*3.55 = 5.15% error rate at word level, due to multiple mappings. Back

  • Zero Mapping for Some Shahmukhi CharactersThere are certain Shahmukhi characters such as (ain) and (hamza) for which there no exact equivalent Gurmukhi characters. Though hamza can be generated using character combination rules, but there are no specific rules for ain. As for example, consider the following cases: -> -> -> -> It is not possible to specify the rules in above examples for generation of ain in Shahmukhi words and it depends from case to case. From the statistics, we find that ain has 0.42% of frequency of occurrence.Back

  • Missing Nukta Symbols in Gurmukhi Text Five consonants in Gurmukhi scripts ( ) were added to the original 35 characters in Gurmukhi to accommodate the sounds from Arabic and Persian languages. Punjabi speakers now do not make a distinction between , and and drop the nukta symbol and have combined character frequency of occurrence of only 0.17% as compared to 1.35% for their counterparts in Shahmukhi we found that the word occurs 69 times in the Gurmukhi corpus while the word occurs 147 times in the Gurmukhi corpus. So will be transliterated to in Shahmukhi, which is wrong while the actual transliteration is , which is obtained if the correct Gurmukhi spellings are used. Back

  • Difference Between Pronunciation and Orthography In certain cases, the Gurmukhi words are written with short vowels, while they are pronounced with long vowels. The equivalent words in Shahmukhi are also written with long vowels and so the rule based mapping system which converts those short vowels in Gurmukhi Shahmukhi gives wrong results in such cases. Some examples of such words are , and .They are pronounced as but written with short vowels, while the corresponding words in Shahmukhi are written with long vowels as , and respectively. Back

  • Transliteration of Proper Nouns Frequently the proper nouns such as names of persons and places have typical spellings in Shahmukhi and it is not possible to formulate transliteration rules for generation of such spellings. As for example, -> -> ->

  • System Architecture

  • Lexical Resources Used Gurmukhi spell checkerRoot words : 41,253Gurmukhi Normalised SpellingsWords : 1193Gurmukhi-Shahmukhi dictionaryTerms: 10,254Shahmukhi CorpusTotal Words : 97,63,294 Unique words : 1,93,679

  • Pre-Processing StageThe Gurmukhi word is cleaned and prepared for transliteration by passing it through the Gurmukhi spell checker and normalizing it according to the Shahmukhi spellings and pronunciation. -> -> -> -> Resources UsedGurmukhi Spell CheckerGurmukhi Normalised Spellings

  • Processing StageThe normalised Gurmukhi word is transliterated to Shahmukhi by using:Dictionary LookupMapping rulesGurmukhi-Shahmukhi Dictionary used for directly transliterating frequently occuring Gurmukhi words and transliterating Gurmukhi words with typical Shahmukhi spellings:

  • Processing Stage Mapping RulesGurmukhi letters are directly mapped to similar sounding Shahmukhi characters. For example,->, -> -> ->, ->, -> -> , -> , -> , -> In case of multiple equivalent Shahmukhi characters, the most frequently occuring Shahmukhi character is selected.Thus is mapped to and is mapped to As for example, the word will be converted to Shahmukhi as follows: + + + -> + + + = If two vowels in Gurmukhi come together, then the character hamza is placed in between them in Shahmukhi. + -> + + -> +

  • Processing Stage Mapping RulesBesides, these simple mapping rules, some special pronunciation based rules have also been developed. Some of these rules are: + -> ( -> ) + -> (-> ) + -> ( -> ) + -> ( -> )

  • Post-Processing StageThe spellings of the Shahmukhi word generated in the processing stage are checked and corrected. The major source of spellings errors in the transliterated Shahmukhi words is the multiple character mapping in Shahmukhi. As for example the words , , and will be transliterated as , and while the actual spellings are , , and respectively. Resources UsedShahmukhi word frequency list

  • Post-Processing StageFor the Gurmukhi characters with multiple Shahmukhi mappings, word forms using all the possible mappings are generated and the word with the highest frequency of occurrence in the Shahmukhi word frequency list is selected.For example, consider the word . Both the characters and have multiple Shahmukhi mappings. The Shahmukhi word generated in the processing stage is . From this word, all its forms are generated. A search for each of these words in the Shahmukhi corpus reveals that while the word has 2045 occurences, none of the other form has a single occurrence. Thus the word is transliterated to .

  • 00000002045Back

  • An ExampleTo illustrate how the Gurmukhi word is transformed in each of the three stages, we take the following sample sentence in Gurmukhi. After Gurmukhi spell checking in pre-processing stage, the sentence becomes The text after normalization becomes

  • An ExampleSince the word has typical spellings the output after dictionary lookup in Processing stage is : The output from the processing stage after rule based transliteration is : The final output after running the spell checker in the post processing stage is : The output we got from another existing system is:

  • Failure CasesThe system fails for the following cases:Multiple spellings for Gurmukhi wordsFor example, can get mapped to both and , similarly the word get mapped to both and . The correct spellings can be selected after context analysis only.Words with typical spellings, which are not present in Gurmukhi-Shahmukhi dictionary and Shahmukhi corpus

  • Experimental ResultsWe have tested our system on 121 pages of text compiled from newspapers, books and poetry. The results are compared with Puran and Utrans, the two transliteration systems available on the net. Transliteration Accuracy

    We observed that the main of sources of improvements in the transliteration accuracy over the existing systems have been:Pre-processing stage, wherein the wrong Gurmukhi spellings are corrected and spellings of some of the words are modified according to their pronunciation.Development of transliteration rules for special casesUsage of Gurmukhi-Shahmukhi dictionaryCorrection of Shahmukhi spellings with the help of the Shahmukhi corpus.

  • THANK YOU