tools and interfaces for wordnet construction, linking and maintenance

75
Tools and Interfaces for Wordnet construction, linking and maintenance Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya

Upload: dior

Post on 09-Jan-2016

31 views

Category:

Documents


2 download

DESCRIPTION

Tools and Interfaces for Wordnet construction, linking and maintenance. Abhishek G. Nanda 03005031 Under the guidance of: Prof. Pushpak Bhattacharyya. Wordnet. Language - Means of communication using encoded information Words - Units used for communicating information - PowerPoint PPT Presentation

TRANSCRIPT

  • Tools and Interfaces for Wordnet construction, linking and maintenance

    Abhishek G. Nanda03005031

    Under the guidance of:Prof. Pushpak Bhattacharyya

  • WordnetLanguage - Means of communication using encoded informationWords - Units used for communicating informationSemantics - Meanings of words and word forms

  • WordnetDictionary - List of alphabetically arranged words with meaningsThesaurus - List of alphabetically arranged concepts with word forms

    What is Wordnet?

  • WordnetLexical database of wordsArranged based on conceptsGrouped based on synonymy

    Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchasePolysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank

  • Wordnet - Lexical Matrix

    Word MeaningsWord FormsF1F2F3FnM1(depend)E1,1(bank)E1,2(rely)E1,3

    M2(bank)E2,2(embankment)E2,

    M3(bank)E3,2E3,3MmEm,n

  • Wordnet - RelationsSemantic RelationsHypernymy and HyponymyMeronymy and HolonymyEntailmentTroponymyCoordinate termsLexical RelationsAntonymyGradation

  • Wordnet - RelationsHypernymy and Hyponymyis a kind ofleaf is the hypernym of neem leafneem leaf is the hyponym of leafMeronymy and Holonymypart-wholeroot is the meronym of treetree is the holonym of root

  • Wordnet - RelationsEntailmentimplicationsnore entails sleepTroponymymanner elaborationroar is the troponym of speakCoordinate termsCommon hypernymwolf and dog are coordinate terms

  • Wordnet - RelationsAntonymyoppositesfat is the antonym of thinGradationIntermediate concepts in antonymymorning -> noon -> evening

  • Wordnet - WordnetsPWN - Princeton WordNet for English languageEuroWordNet - Wordnet for European languagesHWN - Hindi Wordnet for Hindi language

  • Hindi WordnetRelations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc.Defines 8 part-whole relationshipsDefines 3 types of antonymy relationsGradable antonym (-)Complementary antonym (-)Converse antonym (-)

  • Hindi WordnetGradationIntermediate termsPre-Intermediate termsPost-Intermediate termsEg. - - - - 10 domains of interpretation. Eg. State, Size, Gender, etc.

  • Hindi Wordnet - VerbsSimple Verb - One root. Eg. Compound Verb - Made up of another POS. Eg. Combination Verb - Made of related two verbs. Eg. -Onomatopoeic Verb - Eg. from Conjunct Verb - Hidden sense of action. Eg.

  • Hindi Wordnet - VerbsCausative verbsFirst causative verb - Eg. (to make somebody sleep) Second causative verb - Eg. (to make somebody sleep through the effort of a third person)

  • Hindi Wordnet - CreationPrinciples for Wordnet creationMinimality - Minimal set. Eg. {, , }Coverage - Coverage of words. Eg. {, , }Replaceability - Mutual replaceability in a context. Eg. /

  • Sanskrit WordnetConcept-based Multilingual dictionaryNeedLoss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts and are not.Number of lexicographers required - O(n2)

  • Sanskrit Wordnet - Concept based Multilingual dictionary

    ConceptsL1 (English)L2 (Hindi)L3 (Sanskrit)Concept ID: Concept description(W1, W2, W3, ..)(W4, W5, W6, ..)(W7, W8, W9, ..)4066: any of various long-tailed primates (excluding the prosimians) (monkey)(, , , , , , , ..) (, , , , , , , ..) 2186: a typical star that is the source of light and heat for the planets in the solar system (sun)(,, , , , , , , ..) (, , , , , , , , ..)

  • Sanskrit Wordnet - ChallengesObserved during construction of Marathi Wordnet:Single word to synthetic expression. Eg. bankrupt -> Culture specific concepts. Eg. girlfriend. Requires transliteration such as Splitting of concepts. Eg. (tasteless) in Hindi -> (less sweet), (less salty), (less spicy) in Marathi

  • Sanskrit Wordnet - ChallengesObserved during Indo Wordnet workshop at Coimbatore, June 2009:Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community.Single-word and multi-word expressions in same language. Eg. In Nepali, and - both mean infatuation.

  • Sanskrit Wordnet - SanskritIndo-Aryan languageHinduismBuddhismClassical Sanskrit - PaniniVedic Sanskrit - pre-Classical

  • Sanskrit Wordnet - Sanskrit EtymologyEtymology of Verbs - Ten classes based on how stem is generated - Three groups based on position of tense marker - 22 prepositional particles that modify a root

  • Synset MarkingGrouping of synsets based on frequency of occurrence and usage in languageUniversal conceptswho and whathonesty

  • SynsetMarker - Interface

  • SynsetMarker - FeaturesDisplay of synset fieldsBrowsingSearchWordIDMarking - Universal, Common, Common in Hindi and UncommonSave/ExitShortcuts

  • SynsetMarker - APIrecordsDefineRecordSynsetRecordoperationsSynsetOperatorRecordReaderRecordWriterguiInterface

  • SynsetMarker - ProcessFirst round divided among 6 people31000 synsets markedUniversal and Common clubbed - 15234 synsetsCommon in Hindi - 6771 synsetsUncommon - 10987 synsetsSecond round voting schemaCommon - 13205 synsets

  • Core Synset SelectionBharatiya Vyavahara KoshEnglish and 15 Indian languages2000 concepts with domains (game), (animal), (fruit)Link synsets to words in KoshPolysemy as pineapple fruit as pineapple plant

  • DomainClassifier - Interface

  • DomainClassifier - FeaturesDisplay of synset fieldsBrowsing through recordsMarking right synset for a word and a domainSave/Export

  • DomainClassifier - APIrecordsDefineRecordSynsetRecordoperationsSynsetOperatorRecordReaderRecordWriterguiInterface

  • DomainClassifier - ProcessGroupingsSingle IDsMultiple IDsNo IDsRounds of markingCommon synsetsCommon in Hindi synsetsUncommon synsets

  • DomainClassifier - ProcessEnd of processCore - 1969 synsetsCommon - 11658 synsets

  • Online SynsetMarker - Interface

  • Online SynsetMarker - Interface

  • Online SynsetMarker - APIWritten in PHP

    login.php - Interface to login as a user or as an admin or to register as a new user process.php - To process login/register data and accordingly direct a user logout.php - To logout a usermainprocess.php - Processing of data to display unmarked synset main.php - Display of synset with buttons to mark as Common or Uncommon admin.php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks adminpassword.php - Password interface to login as adminadminuserprofile.php - Profile data of a particular user

  • Online SynsetMarker - ProcessThreshold for dropping synset as UncommonHad to be set to 1Common - 10312 synsets

  • Sanskrit Wordnet InterfaceInterface for creation of Sanskrit WordnetBased on idea of Concept-based Multilingual dictionary

  • User Interface - Configure

  • User Interface - Main

  • User Interface - PanelsHelp Panel: Buttons for Commenting, Synchronizing and References tool.Search Panel: Search word or ID or perform advanced search. Font increase/decrease.Synset Panels: Synset data fields and completion status.Tool Panel: English synset, Link tool, Etymology tool.Browse Panel: Browsing through records, saving and exiting.

  • User Interface - Features - Reference tool

  • User Interface - Features - Synchronize tool

  • User Interface - Features - Advanced Search

  • User Interface - Features - English synsets tool

  • User Interface - Features - Link tool

  • User Interface - Features - Etymology tool

  • User Interface - Features - Keyboard ShortcutsUndo feature - Monitor keyboard actions and undo on Ctrl-ZSaving feature - Monitor change in field values and save on Ctrl-SSearch - Ctrl-F for quick search access

  • Interface APIProblems and RequirementsHuge volumes of data (eg. 30,000 synsets)Links between different dataEfficient and user-friendly GUISufficient queryingGroupingReview separation

  • Interface API

  • Graphical User InterfaceJButton saveButton = null;public JButton getSaveButton() {if (saveButton == null) {saveButton = new JButton();}return saveButton;}

  • Graphical User Interface

  • Graphical User Interface - Panels

  • Graphical User InterfacePanelsHierarchical structureComponents (within Panels)Classes JButton, JTextField, JCheckBox, etc.ListenersActionListner - actions performed by userKeyListener - key strokes (undo, search) and shortcuts

  • SynsetSynset ID: a unique number identifying a synset Category: POS category of the words Concept: The part of the gloss that gives a brief summary of what the synset represents Example: One or more examples of the words in the synset being used in sentences Synset: The set of synonymous words comprised in the synset

  • Synset - DSF formatID :: 121CATEGORY :: NOUNCONCEPT :: EXAMPLE :: SYNSET :: ,,,

  • Data structure - SynsetRecordClass SynsetRecord

    Strings to hold field valuesFunctions:equals(otherObject)isBetterThan(otherObject)isComplete()

  • Data structure - DefineRecord

  • define-end languageExample (description of a book about cricket):

    define book sixerlength :: 700topic :: cricketdefine chapter 1length :: 300topic :: battingenddefine chapter 2length :: 400topic :: bowling :: scientificendend

  • Data structure - DefineRecordExample (etymology format):

    define etymformat verb :: dropdown :: word :: , , :: dropdown :: word :: , , :: dropdown :: synset :: , :: textfield :: word :: dropdown :: word :: , , , , , , , , , , , , , , , , , , , , , :: dropdown :: word ::, , , , end

  • Data structure - DefineRecord

  • Data structure - DefineRecordExample (etymology data for synset ID 1476):

    define etymology 1476 :: finished :: truedefine word :: :: :: - :: :: :: -endend

  • Data structure - DefineRecordData structure to hold parametric and nested dataFunctions:addField(objectToAdd) - Function to add a parameter or a nested instance of DefineRecordtoString() - Function to export a record in the define-end languagegetParameterField(parameterName) - Function to return a specific parameter field

  • Data Operations

  • Data Operations - File I/OUnicode text data manipulation - UTF-8 formatClasses for file parsing/writing:RecordWriterRecordReader

  • Data Operations - File I/ORecordReaderSynsetRecord parserDefineRecord parserString convertersRecordWriterSynsetRecord parserDefineRecord parser

  • Data Operations - RecordModel InterfaceModel to create mechanism for working with a new data structureHandles parsing, writing, querying and ID retrievalModels written as Classes:SynsetRecordModelEnglishSynsetRecordModelAbstractDefineRecordModel

  • Data Operations - RecordModel Interfaceint getRecordId(E record): Function to return the record ID of a record boolean isBetterThan(E a, E a): Function to return whether a record weighs better than the other boolean isFinished(E a): Function to return whether a record can be set as completed E mergeRecords(E a, E b): Function to merge in data in two separate records into one boolean searchWord(String word, E a): Function to perform a query (defined in String word) on a record E parseRecord(RecordReader fileHandle): Function to parse a record from a file void writeRecord(RecordReader fileHandle, E a): Function to write a record into a file

  • Data Operations - RecordOperator ClassOperator to provide functionality to work with records of dataLoad, Browse, Update, Search, Synchronize and WriteTwo kinds at the GUI level:Parent OperatorLinker Operator

  • Data Operations - RecordOperator ClassFunctions for each data type (depending on the corresponding RecordModel):

    Constructors for ParentOperator and LinkerOperatorgetRecord() - Function to obtain the current recordsetCurrentId() and getCurrentId() - Functions to set and obtain ID to work withgetFirstId(), getPreviousId(), getNextId() and getLastId() - Functions to browse through recordsisFinished and isAllFinished() - Functions to obtain completion status of recordssearchRecords() and advancedSearch() - Functions to perform search operations on the records

  • API OverviewGUI defines one ParentOperator (eg. source synsets)GUI defines many LinkerOperators (eg. target synsets, link data, etc.)Models attached to the operatorsData repositories are definedGUI browses, retrieves and manipulates data using operators.

  • Version history

  • Future workTool to generate etymology formatGUI functionality to display synsets from multiple languagesAdvanced commenting based on reviews and completion

  • ReferencesMiller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., "Introduction to WordNet: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, No. 4, 1990, pp. 235-244.Ramanand J., Ukey A., Singh B.K., Bhattacharyya P., "Mapping and Structural Analysis of Multilingual Wordnets", IEEE Data Engineering Bulletin, Vol. 30, No. 1, 2007, pp. 30-43.Hindi Wordnet Documentation, http://www.cfilt.iitb.ac.in/wordnet/webhwn/other/hwn_docs_2.docChakrabarti D., Narayan D.K., Pandey P., Bhattacharyya P., "Experiences in building the Indo WordNet - A WordNet for Hindi", in First International Wordnet Conference, CIIL, Mysore, India, 2002.Mohanty R.K., Bhattacharyya P., Kalele S., Pandey P., Sharma A., Kopra M., "Synset Based Multilingual Dictionary: Insights, Applications and Challenges", in Proceedings of the Fourth Global WordNet Conference, University of Szeged, Department of Informatics, 2008.Sinha, M., Reddy, M., Bhattacharyya, P., "An Approach towards Construction and Application of Multilingual Indo-WordNet", in Proceedings of the Third Global Wordnet Conference, Jeju Island, Korea, 2006.Staal J.F., "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol. 22, No. 3, 1963, pp. 261275.

  • ReferencesMacDonell A.A., A History Of Sanskrit Literature, Kessinger Publishing, ISBN 1417906197, 2004.Burrow T., Sanskrit language, Motilal Banarsidass, ISBN 8120817672, 2001. Goldman R.P. and Sutherland S.J., Devavanipravesika: An Introduction to the Sanskrit Language, ISBN 0-944613-40-3, 1999.Macdonell A.A., A Sanskrit Grammar for Students, ISBN 81-246-0094-5, 1997.Monier-Williams M., A Sanskrit English Dictionary, Motilal Banarsidass, (reprint) New Delhi, ISBN 81-208-3105-5, 2005.Katre S.M., Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989.Indian Languages, http://www.english.emory.edu/Bahri/IndLangs.htmlWierzbicka A., "Universal human concepts as a tool for exploring bilingual lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp. 7-26.Beckwith R., Miller G.A., Tengi R., "Design and Implementation of the WordNet Lexical Database and Searching Software", Description of WordNet, 1993.JSch - Java Secure Channel, http://www.jcraft.com/jsch

  • Thank you