tools and interfaces for wordnet construction, linking and maintenance abhishek g. nanda 03005031...

75
Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya

Upload: brice-welch

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya
  • Slide 2
  • Wordnet Language - Means of communication using encoded information Words - Units used for communicating information Semantics - Meanings of words and word forms
  • Slide 3
  • Wordnet Dictionary - List of alphabetically arranged words with meanings Thesaurus - List of alphabetically arranged concepts with word forms What is Wordnet?
  • Slide 4
  • Wordnet Lexical database of words Arranged based on concepts Grouped based on synonymy Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank
  • Slide 5
  • Wordnet - Lexical Matrix Word Meanings Word Forms F1F1 F2F2 F3F3 FnFn M1M1 (depend) E 1,1 (bank) E 1,2 (rely) E 1,3 M2M2 (bank) E 2,2 (embankme nt) E 2, M3M3 (bank) E 3,2 E 3,3 MmMm E m,n
  • Slide 6
  • Wordnet - Relations Semantic Relations Hypernymy and Hyponymy Meronymy and Holonymy Entailment Troponymy Coordinate terms Lexical Relations Antonymy Gradation
  • Slide 7
  • Wordnet - Relations Hypernymy and Hyponymy is a kind of leaf is the hypernym of neem leaf neem leaf is the hyponym of leaf Meronymy and Holonymy part-whole root is the meronym of tree tree is the holonym of root
  • Slide 8
  • Wordnet - Relations Entailment implication snore entails sleep Troponymy manner elaboration roar is the troponym of speak Coordinate terms Common hypernym wolf and dog are coordinate terms
  • Slide 9
  • Wordnet - Relations Antonymy opposites fat is the antonym of thin Gradation Intermediate concepts in antonymy morning -> noon -> evening
  • Slide 10
  • Wordnet - Wordnets PWN - Princeton WordNet for English language EuroWordNet - Wordnet for European languages HWN - Hindi Wordnet for Hindi language
  • Slide 11
  • Hindi Wordnet Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. Defines 8 part-whole relationships Defines 3 types of antonymy relations Gradable antonym ( - ) Complementary antonym ( - ) Converse antonym ( - )
  • Slide 12
  • Hindi Wordnet Gradation Intermediate terms Pre-Intermediate terms Post-Intermediate terms Eg. - - - - 10 domains of interpretation. Eg. State, Size, Gender, etc.
  • Slide 13
  • Hindi Wordnet - Verbs Simple Verb - One root. Eg. Compound Verb - Made up of another POS. Eg. Combination Verb - Made of related two verbs. Eg. - Onomatopoeic Verb - Eg. from Conjunct Verb - Hidden sense of action. Eg.
  • Slide 14
  • Hindi Wordnet - Verbs Causative verbs First causative verb - Eg. (to make somebody sleep) Second causative verb - Eg. (to make somebody sleep through the effort of a third person)
  • Slide 15
  • Hindi Wordnet - Creation Principles for Wordnet creation Minimality - Minimal set. Eg. { , , } Coverage - Coverage of words. Eg. { , , } Replaceability - Mutual replaceability in a context. Eg. /
  • Slide 16
  • Sanskrit Wordnet Concept-based Multilingual dictionary Need Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts and are not. Number of lexicographers required - O(n 2 )
  • Slide 17
  • Sanskrit Wordnet - Concept based Multilingual dictionary ConceptsL 1 (English)L 2 (Hindi)L 3 (Sanskrit) Concept ID: Concept description (W 1, W 2, W 3,..)(W 4, W 5, W 6,..)(W 7, W 8, W 9,..) 4066: any of various long- tailed primates (excluding the prosimians) (monkey) ( , , , , , , ,..) ( , , , , , , ,..) 2186: a typical star that is the source of light and heat for the planets in the solar system (sun) ( , , , , , , , ,..) ( , , , , , , , ,..)
  • Slide 18
  • Sanskrit Wordnet - Challenges Observed during construction of Marathi Wordnet: Single word to synthetic expression. Eg. bankrupt -> Culture specific concepts. Eg. girlfriend. Requires transliteration such as Splitting of concepts. Eg. (tasteless) in Hindi -> (less sweet), (less salty), (less spicy) in Marathi
  • Slide 19
  • Sanskrit Wordnet - Challenges Observed during Indo Wordnet workshop at Coimbatore, June 2009: Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. Single-word and multi-word expressions in same language. Eg. In Nepali, and - both mean infatuation.
  • Slide 20
  • Sanskrit Wordnet - Sanskrit Indo-Aryan language Hinduism Buddhism Classical Sanskrit - Panini Vedic Sanskrit - pre-Classical
  • Slide 21
  • Sanskrit Wordnet - Sanskrit Etymology Etymology of Verbs - Ten classes based on how stem is generated - Three groups based on position of tense marker - 22 prepositional particles that modify a root
  • Slide 22
  • Synset Marking Grouping of synsets based on frequency of occurrence and usage in language Universal concepts who and what honesty
  • Slide 23
  • SynsetMarker - Interface
  • Slide 24
  • SynsetMarker - Features Display of synset fields Browsing Search Word ID Marking - Universal, Common, Common in Hindi and Uncommon Save/Exit Shortcuts
  • Slide 25
  • SynsetMarker - API records DefineRecord SynsetRecord operations SynsetOperator RecordReader RecordWriter gui Interface
  • Slide 26
  • SynsetMarker - Process First round divided among 6 people 31000 synsets marked Universal and Common clubbed - 15234 synsets Common in Hindi - 6771 synsets Uncommon - 10987 synsets Second round voting schema Common - 13205 synsets
  • Slide 27
  • Core Synset Selection Bharatiya Vyavahara Kosh English and 15 Indian languages 2000 concepts with domains (game), (animal), (fruit) Link synsets to words in Kosh Polysemy as pineapple fruit as pineapple plant
  • Slide 28
  • DomainClassifier - Interface
  • Slide 29
  • DomainClassifier - Features Display of synset fields Browsing through records Marking right synset for a word and a domain Save/Export
  • Slide 30
  • DomainClassifier - API records DefineRecord SynsetRecord operations SynsetOperator RecordReader RecordWriter gui Interface
  • Slide 31
  • DomainClassifier - Process Groupings Single IDs Multiple IDs No IDs Rounds of marking Common synsets Common in Hindi synsets Uncommon synsets
  • Slide 32
  • DomainClassifier - Process End of process Core - 1969 synsets Common - 11658 synsets
  • Slide 33
  • Online SynsetMarker - Interface
  • Slide 34
  • Slide 35
  • Online SynsetMarker - API Written in PHP login.php - Interface to login as a user or as an admin or to register as a new user process.php - To process login/register data and accordingly direct a user logout.php - To logout a user mainprocess.php - Processing of data to display unmarked synset main.php - Display of synset with buttons to mark as Common or Uncommon admin.php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks adminpassword.php - Password interface to login as admin adminuserprofile.php - Profile data of a particular user
  • Slide 36
  • Online SynsetMarker - Process Threshold for dropping synset as Uncommon Had to be set to 1 Common - 10312 synsets
  • Slide 37
  • Sanskrit Wordnet Interface Interface for creation of Sanskrit Wordnet Based on idea of Concept-based Multilingual dictionary
  • Slide 38
  • User Interface - Configure
  • Slide 39
  • User Interface - Main
  • Slide 40
  • User Interface - Panels Help Panel: Buttons for Commenting, Synchronizing and References tool. Search Panel: Search word or ID or perform advanced search. Font increase/decrease. Synset Panels: Synset data fields and completion status. Tool Panel: English synset, Link tool, Etymology tool. Browse Panel: Browsing through records, saving and exiting.
  • Slide 41
  • User Interface - Features - Reference tool
  • Slide 42
  • User Interface - Features - Synchronize tool
  • Slide 43
  • User Interface - Features - Advanced Search
  • Slide 44
  • User Interface - Features - English synsets tool
  • Slide 45
  • User Interface - Features - Link tool
  • Slide 46
  • User Interface - Features - Etymology tool
  • Slide 47
  • User Interface - Features - Keyboard Shortcuts Undo feature - Monitor keyboard actions and undo on Ctrl-Z Saving feature - Monitor change in field values and save on Ctrl-S Search - Ctrl-F for quick search access
  • Slide 48
  • Interface API Problems and Requirements Huge volumes of data (eg. 30,000 synsets) Links between different data Efficient and user-friendly GUI Sufficient querying Grouping Review separation
  • Slide 49
  • Interface API
  • Slide 50
  • Graphical User Interface JButton saveButton = null; public JButton getSaveButton() { if (saveButton == null) { saveButton = new JButton(); } return saveButton; }
  • Slide 51
  • Graphical User Interface
  • Slide 52
  • Graphical User Interface - Panels
  • Slide 53
  • Graphical User Interface Panels Hierarchical structure Components (within Panels) Classes JButton, JTextField, JCheckBox, etc. Listeners ActionListner - actions performed by user KeyListener - key strokes (undo, search) and shortcuts
  • Slide 54
  • Synset Synset ID: a unique number identifying a synset Category: POS category of the words Concept: The part of the gloss that gives a brief summary of what the synset represents Example: One or more examples of the words in the synset being used in sentences Synset: The set of synonymous words comprised in the synset
  • Slide 55
  • Synset - DSF format ID :: 121 CATEGORY :: NOUN CONCEPT :: EXAMPLE :: SYNSET :: , , ,
  • Slide 56
  • Data structure - SynsetRecord Class SynsetRecord Strings to hold field values Functions: equals(otherObject) isBetterThan(otherObject) isComplete()
  • Slide 57
  • Data structure - DefineRecord
  • Slide 58
  • define-end language Example (description of a book about cricket): define book sixer length :: 700 topic :: cricket define chapter 1 length :: 300 topic :: batting end define chapter 2 length :: 400 topic :: bowling :: scientific end
  • Slide 59
  • Data structure - DefineRecord Example (etymology format): define etymformat verb :: dropdown :: word :: , , :: dropdown :: word :: , , :: dropdown :: synset :: , :: textfield :: word :: dropdown :: word :: , , , , , , , , , , , , , , , , , , , , , :: dropdown :: word :: , , , , end
  • Slide 60
  • Data structure - DefineRecord
  • Slide 61
  • Example (etymology data for synset ID 1476): define etymology 1476 :: finished :: true define word :: :: :: - :: :: :: - end
  • Slide 62
  • Data structure - DefineRecord Data structure to hold parametric and nested data Functions: addField(objectToAdd) - Function to add a parameter or a nested instance of DefineRecord toString() - Function to export a record in the define-end language getParameterField(parameterName) - Function to return a specific parameter field
  • Slide 63
  • Data Operations
  • Slide 64
  • Data Operations - File I/O Unicode text data manipulation - UTF-8 format Classes for file parsing/writing: RecordWriter RecordReader
  • Slide 65
  • Data Operations - File I/O RecordReader SynsetRecord parser DefineRecord parser String converters RecordWriter SynsetRecord parser DefineRecord parser
  • Slide 66
  • Data Operations - RecordModel Interface Model to create mechanism for working with a new data structure Handles parsing, writing, querying and ID retrieval Models written as Classes: SynsetRecordModel EnglishSynsetRecordModel AbstractDefineRecordModel
  • Slide 67
  • Data Operations - RecordModel Interface int getRecordId(E record): Function to return the record ID of a record boolean isBetterThan(E a, E a): Function to return whether a record weighs better than the other boolean isFinished(E a): Function to return whether a record can be set as completed E mergeRecords(E a, E b): Function to merge in data in two separate records into one boolean searchWord(String word, E a): Function to perform a query (defined in String word) on a record E parseRecord(RecordReader fileHandle): Function to parse a record from a file void writeRecord(RecordReader fileHandle, E a): Function to write a record into a file
  • Slide 68
  • Data Operations - RecordOperator Class Operator to provide functionality to work with records of data Load, Browse, Update, Search, Synchronize and Write Two kinds at the GUI level: Parent Operator Linker Operator
  • Slide 69
  • Data Operations - RecordOperator Class Functions for each data type (depending on the corresponding RecordModel): Constructors for ParentOperator and LinkerOperator getRecord() - Function to obtain the current record setCurrentId() and getCurrentId() - Functions to set and obtain ID to work with getFirstId(), getPreviousId(), getNextId() and getLastId() - Functions to browse through records isFinished and isAllFinished() - Functions to obtain completion status of records searchRecords() and advancedSearch() - Functions to perform search operations on the records
  • Slide 70
  • API Overview GUI defines one ParentOperator (eg. source synsets) GUI defines many LinkerOperators (eg. target synsets, link data, etc.) Models attached to the operators Data repositories are defined GUI browses, retrieves and manipulates data using operators.
  • Slide 71
  • Version history
  • Slide 72
  • Future work Tool to generate etymology format GUI functionality to display synsets from multiple languages Advanced commenting based on reviews and completion
  • Slide 73
  • References Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., "Introduction to WordNet: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, No. 4, 1990, pp. 235-244. Ramanand J., Ukey A., Singh B.K., Bhattacharyya P., "Mapping and Structural Analysis of Multilingual Wordnets", IEEE Data Engineering Bulletin, Vol. 30, No. 1, 2007, pp. 30-43. Hindi Wordnet Documentation, http://www.cfilt.iitb.ac.in/wordnet/webhwn/other/hwn_docs_2.doc Chakrabarti D., Narayan D.K., Pandey P., Bhattacharyya P., "Experiences in building the Indo WordNet - A WordNet for Hindi", in First International Wordnet Conference, CIIL, Mysore, India, 2002. Mohanty R.K., Bhattacharyya P., Kalele S., Pandey P., Sharma A., Kopra M., "Synset Based Multilingual Dictionary: Insights, Applications and Challenges", in Proceedings of the Fourth Global WordNet Conference, University of Szeged, Department of Informatics, 2008. Sinha, M., Reddy, M., Bhattacharyya, P., "An Approach towards Construction and Application of Multilingual Indo-WordNet", in Proceedings of the Third Global Wordnet Conference, Jeju Island, Korea, 2006. Staal J.F., "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol. 22, No. 3, 1963, pp. 261275.
  • Slide 74
  • References MacDonell A.A., A History Of Sanskrit Literature, Kessinger Publishing, ISBN 1417906197, 2004. Burrow T., Sanskrit language, Motilal Banarsidass, ISBN 8120817672, 2001. Goldman R.P. and Sutherland S.J., Devavanipravesika: An Introduction to the Sanskrit Language, ISBN 0-944613-40-3, 1999. Macdonell A.A., A Sanskrit Grammar for Students, ISBN 81-246-0094-5, 1997. Monier-Williams M., A Sanskrit English Dictionary, Motilal Banarsidass, (reprint) New Delhi, ISBN 81-208-3105-5, 2005. Katre S.M., Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989. Indian Languages, http://www.english.emory.edu/Bahri/IndLangs.html Wierzbicka A., "Universal human concepts as a tool for exploring bilingual lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp. 7-26. Beckwith R., Miller G.A., Tengi R., "Design and Implementation of the WordNet Lexical Database and Searching Software", Description of WordNet, 1993. JSch - Java Secure Channel, http://www.jcraft.com/jsch
  • Slide 75
  • Thank you