indradhanush wordnet development for punjabi language

27
INDRADHANUSH WORDNET DEVELOPMENT FOR PUNJABI LANGUAGE Dr. Suman Preet Department of Linguistics and Punjabi Lexicography, Punjabi University, Patiala

Upload: kaveri

Post on 11-Jan-2016

103 views

Category:

Documents


0 download

DESCRIPTION

INDRADHANUSH WORDNET DEVELOPMENT FOR PUNJABI LANGUAGE. Dr. Suman Preet Department of Linguistics and Punjabi Lexicography, Punjabi University, Patiala. Nature of Task. Synset Creation for Nouns, Adjectives, Verbs and Adverbs Creation of Language Specific Synsets Sense Marking - PowerPoint PPT Presentation

TRANSCRIPT

  • INDRADHANUSH WORDNET DEVELOPMENTFOR PUNJABI LANGUAGE Dr. Suman Preet Department of Linguistics and Punjabi Lexicography, Punjabi University, Patiala

  • NATURE OF TASKSynset Creation for Nouns, Adjectives, Verbs and AdverbsCreation of Language Specific SynsetsSense Marking ValidationNew Synset Creation for Hindi WordNet

  • GOALS SET IN THE LAST PRSGTo complete the linking of 36,534 Synsets.Validation of 36,534 Synsets.To create 1000 LSS.Creation and maintenance of Individual WordNet Group WebsitesTo complete sense marking on 1,00,000 words.

  • Presentation Outline Financial Details Sense Marking Details Synset Creation Details Validation Details Problems and Suggestions

  • Total grant sanctioned Rs 22,14,000/-Total grant released Rs 20,23,974/-1st year (released) Rs 11,44,000/-2nd year (released) Rs 08,79,974/-Recently released Rs 1,86,833/-Financial Details

  • Headwise Break-up of Expenditure

  • SENSE MARKING DETAILSTarget: 1,00,000 wordsDivision of Target between Punjabi University and Thapar University

    The sense marking task was divided into two parts with mutual understanding as shown above. The Punjabi University Wordnet Group has achieved its target.

    Punjabi University Thapar UniversityTarget60,00040,000Complete60,18233,097Remaining 06903

  • Sense Marking Status

    Sr. No.DetailsPunjabi UniversityThapar UniversityTotal1No. of Files Used4553982Total Words1,38,73578,1432,16,8783Total words Sense Marked 60,18233,09793,2794Accuracy43.11%42.35%43%5TargetComplete Incomplete

  • RECORD OF SENSE MARKING WORK BY PUNJABI UNIVERSITY Actions taken during Sense Marking Words added in Punjabi Synset File by action one and two = 1132

    Action OneAction TwoAction ThirdAction FourthTotal Actions1102303882691789

    Type of CorpusNo. of FilesNo. of SentencesTotal WordsSense Marked WordsAccuracyNews and Articles 4567161,38,73560,18243.11

  • STATUS OF SYNSET COMPLETED TILL 28 APRIL 2013The synset creation task was divided into two parts with mutual understanding as shown above. Punjabi University Group has completed its synset creation task.

    Sr. No.File NameTotal SynsetsCompleted by Pbi. Uni.Completed by TUComplete synsetsRemaining synsets1.Universal 71684084 3084716802.Pan Indian 1347 674 673 134703.Verb 1798 807 991 179804.Adverb 209 105 104 20905.Adjective 36051802 1803 360506.Noun2205011026 complete5836/11024 incomplete168625188(TU`s task)

  • POS CATEGORY SYNSETS COMPLETED

    CategoryTotal SynsetsNoun19598Verb2836Adjective5828Adverb443Total28705

  • INTERNAL VALIDATION DETAILS The validation task is being done by Punjabi University WordNet Group.

    Sr. NO.File NameNo. of SynsetsValidated SynsetsWords AddedWords Deleted1.Universal File 716871681287 802.Adverb File 209209 43 153.Adjective File 36053605 3401154.Pan Indian File 1347450 2531035.Verb File 179806.Noun File220500Total36177114321923313

  • PUNJABI LANGUAGE SPECIFIC SYNSETS Total: 1010Noun: 961Adjective: 16Verb: 33

  • NEW SYNSET CREATION FOR HINDI WORDNET

    New common synsets created by Punjabi University which were not present in Hindi WordNet (Total 50), , , , , , , , , , , , . .. . . ., , , , , , , , , , , , , , , , , , , , , , , , , , , , , - , , , , , , , , These words are taken from the Different Punjabi online Newspapers like DailyAjit, PunjabiTribune, Charhdikala

  • PROBLEMS OCCURRING IN SENSE MARKINGProblems related to English wordsProblems related to compound wordsProblems related to adjective Problems related to proverbs Problems related to verbs

  • Problems Related to English Words

    Borrowed or Accepted English WordsComparative alternative present in the WordNetNot found in WordNetProper sense not presentAbbreviations

  • FIVE TYPES OF ENGLISH WORDS IN CORPORA

    Sr. no.1. Accepted English Words 2. Comparative alternatives present in the WordNet3. Not found in WordNet4. Proper sense not present in WordNet5. AbbreviationsOnly abbreviation in useFull and short forms both in use1.Wordspen, bus, car, computer, cycle, etchistory, road, city, book, popular, portal, schoolnetworking, Olympian, Nokia, Samsungtower (mobile tower), call (phone call), server (web server), depression (psycho related), interviewVAT,MRF, HMV, HIVPPSC,COAI, NABARD,PSEB, CBSE, DIG,IG, BA, BBA,BCA, IIT,DU,PU,PTU, BBC etc. 2. Problems in sense markingNo problem1.how it should be tagged?Note: If we will add these words as synonyms then there will be thousands of words which are in use.No tagNo tagNo tagNo tagg3. SuggestionsEnglish words should be selected according to their frequency of usage in Indian languages.Creation of new synsetsCreation of new synsets with proper senseCreation of new synsets with short and full forms

  • PROBLEMS RELATED TO COMPOUND WORDS Most of the common compound words do not exist in the WordNet. If we mark these compounds separately, the actual sense they infer is lost. For example:

    -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -, -Translation -, - , - , - , - , - , - , - , - , - , - , - , - , - , - , - , - , - , - , - , (with trans.tool assistance) , - , (), - , - , - , -

  • PROBLEMS RELATED TO ADJECTIVE Feminine Gender: Feminine forms of adjective are not included in the WordNet, but these occur frequently in the text and reference materials. Some examples:

    ,, , ,, , , , , ,,,,, , , , ,

  • PROBLEMS RELATED TO PROVERBS Proverbs are not included in the Hindi WordNet. We are marking them word by word. How we can mark them?

    1. 2. 3. Transliteration 1. 2. 3. English Translationlittle knowledge is a dangerous thing two heads are better than onea miss by an inch is a miss by a mile

  • FOLLOWING FIELDS OF WORDS ADDED IN SYNSET FILE DURING SENSE MARKING TASKSports: , , , , , , , , , , , , , Business: , , , , , , -, -, Politics: , , , , , , , , ...,

  • SUGGESTIONS-I

    There should be a separate button on the IndoWordNet Website for common vocabulary (words that has same sense in all languages) of all the languages.There should be a separate button on the IndoWordNet Website for the word frequency list of word for each language. There should be a separate button on the IndoWordNet Website for the borrowed word list of each language.There should be a separate button on the IndoWordNet Website for the Great Personalities names of all the languages.

  • SUGGESTIONS-IIWe should prepare some parametres about entries of:PlacesInstitutions Famous personalities Famous creations: books, films, paintings, music etc.Famous incidents and datesScientific vocabulary And words from other special fields Etc. These parametres, help us in creating new synsets and Language Specific Synsets (LSS).

  • TEAM COMPOSITIONP.I. detailsDr. Suman Preet, Associate Professor & Head, Dept of Linguistics and Pbi. Lexicography, Punjabi University, Patiala.Co-P.I. detailsDr. Harjeet Gill, Professor Eminence, Pbi. Uni., and Prof. Emeritus JNU.

  • DETAILS OF THE MANPOWER ASSOCIATED WITH THE PROJECT Staff details Miss Balwinder Kaur, M.A. (Pbi.), PhD (in cont.)Designation: Senior LinguistWork Details: Linking synsets, Validating synsets,Creating & monitoring Language Specific SynsetsSalary : 22,000/- p.m.Mr. Satpal Singh, M.A. (Eng, Linguistics), Diploma in Persian, B.Ed.Designation: LexicographerWork Details : Linking synsets, Validating synsets, Sense MarkingSalary : 16,500/- p.m.

  • DETAILS OF THE MANPOWER ASSOCIATED WITH THE PROJECT (CONTD.)Mr. Vinay Hasija, B. Tech. (Computer Engg)Designation: LexicographerWork Details: Validating synsets, Website creation, Sense MarkingSalary: 16,500/- p.m.

  • **