indo wordnet a wordnet for hindi centre for technology development for indian languages computer...
TRANSCRIPT
Indo WordNet A WordNet for Hindi
Centre for Technology Development for Indian Languages
Computer Science and Engineering Department, IIT Bombay
Debasri Chakrabarti, Dipak Kumar Narayan,
Prabhakar Pandey, Madhu Prasad Sharma
Introduction
WordNet – A lexical databaseSearching the dictionary conceptuallyDifferent organizing principle for different syntactic categorySynsets or the Synonymy Sets are the basic building blocksLexical knowledge base is the heart of any intelligent information processing system
WordNet for Hindi
Hindi WordNet is an on-line lexical database for Hindi languageDesign has been inspired by the famous English WordNetUnique features Graded antonyms and meronymy relationships Efficient underlying database design Cross part of speech linkage
Semantic relations in WordNet
Synonymy
Hypernymy / Hyponymy
Antonymy
Meronymy / Holonymy
Gradation
Entailment
Troponymy
Semantic Relations
Synonymy True synonyms are rare Synonymy related to a context
{Gar ‚ kmara}{Gar ‚ Aavaasa}{Gar ‚ janmakuMDlaIya sqaana}
{Gar ‚ svadoSa}
Semantic Relations
Hypernymy and Hyponymy Relation between word meaning (synsets) X is a hyponym of Y if X is a kind of Y Hyponymy is transitive and asymmetrical Hypernymy is inverse of Hyponymy
lionanimalliving entityentity
Saor pSau sajaIva Aist%va
Semantic Relations
Antonymy Oppositeness in meaning Relation between word forms
Meronymy and Holonymy Part-whole relation, branch is a part of tree X is a meronymy of Y if X is a part of Y Meronym is transitive and asymmetrical Holonymy is inverse relation of Meronymy
Troponym and Entailment
Entailment { Kra-Ta laonaa – saaonaa £
Troponym { laÐgaD,anaa ‚ kdmatala krnaa –
calanaa £ ¡ fusafusaanaa – baaolanaa £
Antonymy RelationSize CaoTa – baD,aQuality AcCa – bauraState rat – idnaPersonality rama – ravaNaDirection pUva- – piScamaAction laonaa – donaaAmount kma – jyaadaPlace dUr – pasaTime saubah – SaamaGender baoTa – baoTI
Meronymy Relation
Component-object maaqaa – SarIrStuff-object p%qar – maUit-Member-collection poD, – jaMgalaFeature-Activity BaaYaNa –
samaaraohPlace-Area idllaI – BaartPhase-State javaanaI – ]ma`Resource-process klama – laoKnaPosition-Area icaik%sak – icaik
%saa
GradationState bacapna ‚ javaanaI
‚ bauZ,apaSize baD,a ‚ maÐJalaa
‚ CaoTaLight ]jaalaa
‚ QauÐQalaa ‚ AÐQaora
Gender mad- ‚ napuMsak ‚ AaOrt
Temperature garma ‚ gaunagaunaa ‚ zMDa
Color gaaora ‚ saaÐvalaa ‚ kalaa
Time idna ‚ gaaoQaUila ‚ rat
Quality AcCa ‚ saamaanya ‚ Kraba
Action saaonaa ‚ }ÐGanaa ‚ jaaganaa
Manner tojaI sao ‚ maQyama gait sao ‚ QaIro – QaIro
Classification of verbs
Simple verbs (sarla iËyaa) : saaonaa‚ KanaaConjunct verbs (saMyau@t iËyaa) Compound verbs (samaaisak iËyaa) Á Kanaa–pInaaCausative verbs (p`orNaa%mak iËyaa) Á saulavaanaa
Gloss
AQyana kxa
Hyponymy
Hyponymy
Aavaasa , inavaasa
Sayana kxa
rsaao[-Gar
Gar , gaRh manauYyaaoM ka
Cayaa huAa vah sqaana jaao dIvaaraoM sao Gaor kr banaayaa jaata hO
Aitiqa gaRh
baramada
Aa^Mgana
AaEama
JaaopD,I
saMrcanaa
Meronymy
Hyponymy
Meronymy
Hypernymy
WordNet Sub-Graph
Design and Implementation
Basic relations or lexical links are between synonym sets
Lexical database is stored in MySQL package
Sub-tasks identified Database design Data entry interface Implementation of Organizer Utility Application programs to access and display the
information in the lexical database
Data Entry Interface
GUI designed in Java/JFC
Separate screen for data entry of different categories
Automatic generation of synset id’s
Screen to view the entered data
Synset Entry Interface
Organizer Utility
Designed to preprocess the dataReflexive pointers are generated e.g. if A hypernym of B then B hyponym of A is
automatically generated
Each semantic relation is mapped to a separate table (normalized)Font conversion Roman Hindi DV-TTYogesh
Storage Structure
Relation between Synsets tblNounHypernyms
Relation between Word-forms tblNounAntonyms
Synset_Id HyperSynset_Id
Synset_Id Synset_Word Anto_Id Anto_Word Anto_Type
System Statistics
Over 8500 synsets entered in the database
MySQL used as the back-end database server
Data entry interface designed in Java/JFC
Organizer utility written in perl
Web based data retrieval system developed in HTML and PHP
DV-TTYogesh Font used to display Hindi Text
Application of WordNet
Word Sense Disambiguation
Interface to Internet Search Engines
Text classification
Information Retrieval system
Document Similarity
Conclusion
The structure of Hindi Language have been studied and new features have been introduced in the Hindi WordNetCurrently over 8500 synsets have been inserted into the databaseThe MySQL database has been found to be quite efficientThe web interface for querying the lexical database is under continuous evolution