wordnet development using a multifunctional tool ivan obradović, ranka stanković...

45
Wordnet Development Wordnet Development Using a Using a Multifunctional Tool Multifunctional Tool Ivan Obradović, Ranka Stanković Ivan Obradović, Ranka Stanković [email protected], [email protected] [email protected], [email protected] University of Belgrade University of Belgrade Faculty of Mining and Geology Faculty of Mining and Geology Đušina 7, 11000 Belgrade, Serbia Đušina 7, 11000 Belgrade, Serbia

Upload: clarissa-claudia-sherling

Post on 10-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

Wordnet Development Wordnet Development Using a Multifunctional Using a Multifunctional

ToolTool

Ivan Obradović, Ranka Stanković Ivan Obradović, Ranka Stanković [email protected], [email protected]@rgf.bg.ac.yu, [email protected]

University of BelgradeUniversity of BelgradeFaculty of Mining and GeologyFaculty of Mining and Geology

Đušina 7, 11000 Belgrade, SerbiaĐušina 7, 11000 Belgrade, Serbia

Page 2: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

22

PWN – the Princeton PWN – the Princeton WordNetWordNet

Conceived in 1985 by George Miller and his associates Conceived in 1985 by George Miller and his associates from the Cognitive Science Laboratoryfrom the Cognitive Science Laboratory

A linguistic database that maps the way the mind A linguistic database that maps the way the mind stores and uses languagestores and uses language

Formalized as a semantic network of Formalized as a semantic network of concepts concepts : : abstract ideas that denote objects in a given category abstract ideas that denote objects in a given category or classor class

Concepts represented by Concepts represented by synsets synsets : sets of : sets of synonymous word-sense pairs accompanied by a synonymous word-sense pairs accompanied by a definition of the conceptdefinition of the concept

Concepts are interconnected by various semantic Concepts are interconnected by various semantic relations, such as hypernym/hyponym (kind of, e.g. relations, such as hypernym/hyponym (kind of, e.g. animal/dog) or holonym/meronym (part of, e.g. animal/dog) or holonym/meronym (part of, e.g. hand/finger) hand/finger)

Contains about 150,000 words organized in over Contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs 115,000 synsets for a total of 207,000 word-sense pairs

Page 3: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

33

The wordnets that The wordnets that followedfollowed Developed for other languages by individual teams or Developed for other languages by individual teams or

through multilingual projectsthrough multilingual projects EuroWordNetEuroWordNet - wordnets for English, Dutch, Italian, - wordnets for English, Dutch, Italian,

Spanish, French, German, Czech, and Estonian based Spanish, French, German, Czech, and Estonian based on PWN and aligned by interconnecting synsets on PWN and aligned by interconnecting synsets representing the same concept in different languages representing the same concept in different languages via an Inter-Lingual-Index (ILI) via an Inter-Lingual-Index (ILI)

ILI also gives access to a shared top-ontology that ILI also gives access to a shared top-ontology that provides a common semantic framework for all the provides a common semantic framework for all the languages with language specific properties maintained languages with language specific properties maintained in the individual wordnetsin the individual wordnets

BalkaNetBalkaNet - wordnets for Bulgarian, Greek, Romanian, - wordnets for Bulgarian, Greek, Romanian, Serbian and Turkish and expanded Czech wordnet, Serbian and Turkish and expanded Czech wordnet, followed an approach similar to EuroWordNet: wordnets followed an approach similar to EuroWordNet: wordnets were developed on basis of PWN and the top-ontology were developed on basis of PWN and the top-ontology accepted in EuroWordNet, and also aligned by using ILIaccepted in EuroWordNet, and also aligned by using ILI

Page 4: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

44

Wordnet development Wordnet development toolstools A number of software tools for wordnets have been A number of software tools for wordnets have been

developed in the past decadesdeveloped in the past decades As it could have been expected the first wordnet As it could have been expected the first wordnet

browser was developed for PWNbrowser was developed for PWN Its latest version is freely distributed with the Its latest version is freely distributed with the

version 2.1 for Windows of the Princeton wordnet, version 2.1 for Windows of the Princeton wordnet, while a web application for PWN browsing is also while a web application for PWN browsing is also availableavailable

Other wordnet tools have been initialized within Other wordnet tools have been initialized within larger projects, such as EuroWordNet and BalkaNetlarger projects, such as EuroWordNet and BalkaNet

There are also many other tools, developed for There are also many other tools, developed for individual languages, such as Russianindividual languages, such as Russian

Page 5: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

55

EuroWordNet toolsEuroWordNet tools

Polaris - used for creating, editing and exporting wordnetsPolaris - used for creating, editing and exporting wordnets– import of wordnets, editing and adding relations and query import of wordnets, editing and adding relations and query

formulationformulation– visualization of semantic relations as a tree-structure that can visualization of semantic relations as a tree-structure that can

directly be editeddirectly be edited– trees and sub-trees can be stored as distinct sets of synsetstrees and sub-trees can be stored as distinct sets of synsets– matching sets of synsets across wordnets via the ILImatching sets of synsets across wordnets via the ILI– licensed from Lernout and Hauspie or from ELRAlicensed from Lernout and Hauspie or from ELRA

Periscope - a graphical database viewer for viewing and Periscope - a graphical database viewer for viewing and exporting wordnetsexporting wordnets– a public viewer used to look at wordnets created by Polarisa public viewer used to look at wordnets created by Polaris– cannot be used for importing or changing wordnetscannot be used for importing or changing wordnets– freely distributedfreely distributed

Other tools, such as WEI (Web EuroWordNet Interface) Other tools, such as WEI (Web EuroWordNet Interface) Development of all tools ceased with the termination of Development of all tools ceased with the termination of

EuroWordNetEuroWordNet

Page 6: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

66

VisDicVisDic

Developed within the framework of the BalkaNet Developed within the framework of the BalkaNet project and used as the main tool for building all project and used as the main tool for building all BalkaNet wordnets BalkaNet wordnets

Primarily aimed at browsing and editing wordnets, but Primarily aimed at browsing and editing wordnets, but expanded into a more general tool for viewing and expanded into a more general tool for viewing and editing various types of dictionary databases stored in editing various types of dictionary databases stored in XML format XML format

Handles simultaneously up to 10 dictionaries, which Handles simultaneously up to 10 dictionaries, which can be monolingual or translational dictionaries, but can be monolingual or translational dictionaries, but also thesauri or plain corpora also thesauri or plain corpora

Available for both Linux and Windows platforms Available for both Linux and Windows platforms The development of VisDic itself has finished but a The development of VisDic itself has finished but a

completely new client-server version of this tool completely new client-server version of this tool DEBVisDic is now being developed, and can be DEBVisDic is now being developed, and can be obtained free of charge, subject to registrationobtained free of charge, subject to registration

Page 7: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

77

The two lexicographer’s The two lexicographer’s problemsproblems The concept placement problem: The concept placement problem:

Where should a new concept be placed and how Where should a new concept be placed and how should links with existing concepts be should links with existing concepts be established?established?

The synonym selection problem:The synonym selection problem:How should the concept be lexicalized, namely, How should the concept be lexicalized, namely, how two select the set of word-sense pairs for how two select the set of word-sense pairs for the synset that represents the concept?the synset that represents the concept?

In some cases wordnet development tools can In some cases wordnet development tools can offer support to the user in solving the first offer support to the user in solving the first problem, but are of very little use for solving the problem, but are of very little use for solving the otherother

Page 8: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

88

On the concept On the concept placement problemplacement problem Many wordnets approached this problem by relying on Many wordnets approached this problem by relying on

the conceptual network of PWN as the basis for the conceptual network of PWN as the basis for developmentdevelopment

If this approach is adopted wordnet development tools If this approach is adopted wordnet development tools can offer support in solving the concept placement can offer support in solving the concept placement problemproblem

Using PWN as a common conceptual network is Using PWN as a common conceptual network is especially convenient in cases of aligned multilingual especially convenient in cases of aligned multilingual wordnets, such as EuroWordNet and BalkaNetwordnets, such as EuroWordNet and BalkaNet

Open questions:Open questions:– Are concepts linguistically independent or not? Are concepts linguistically independent or not? – Are the lexicalization patterns for concepts universal? Are the lexicalization patterns for concepts universal? – Is the structure of PWN valid for other languages as well? Is the structure of PWN valid for other languages as well? – Is the set of semantic relations built in PWN sufficient for Is the set of semantic relations built in PWN sufficient for

all languages? all languages?

Page 9: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

99

On the synonym On the synonym selection problemselection problem Once a concept has been accepted and placed within Once a concept has been accepted and placed within

the conceptual framework of a particular language the the conceptual framework of a particular language the lexicographer is confronted with the problem of its lexicographer is confronted with the problem of its lexicalizationlexicalization

Besides selecting the appropriate synonyms he/she Besides selecting the appropriate synonyms he/she also needs to provide a gloss, and preferably usage also needs to provide a gloss, and preferably usage examples examples

As synset elements appear as word-sense pairs the As synset elements appear as word-sense pairs the lexicographer has to assign senses to all chosen wordslexicographer has to assign senses to all chosen words

The use of linguistic resources, such as electronic The use of linguistic resources, such as electronic dictionaries, bilingual word lists and corpora can be of dictionaries, bilingual word lists and corpora can be of invaluable help to the lexicographer in accomplishing invaluable help to the lexicographer in accomplishing this taskthis task

Page 10: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1010

WS4LR (WorkStation WS4LR (WorkStation for Lexical Resources)for Lexical Resources) A software tool developed within the Human Language A software tool developed within the Human Language

Technology group at the University of BelgradeTechnology group at the University of Belgrade Enables integrated handling of electronic dictionaries, Enables integrated handling of electronic dictionaries,

wordnets, aligned texts and transducerswordnets, aligned texts and transducers When wordnets are concerned, builds on the features When wordnets are concerned, builds on the features

developed by previous tools, especially VisDicdeveloped by previous tools, especially VisDic Differs from other wordnet tools by the fact that Differs from other wordnet tools by the fact that

handling wordnets is only one of its functionalitieshandling wordnets is only one of its functionalities Allows exploitation of other resources during wordnet Allows exploitation of other resources during wordnet

development, giving the lexicographer more support development, giving the lexicographer more support in his/her taskin his/her task

Page 11: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1111

Motivation for WS4LRMotivation for WS4LR

The variety of lexical resources developed for many The variety of lexical resources developed for many years within different projects and different years within different projects and different conceptual and technological frameworksconceptual and technological frameworks

A certain level of heterogeneity despite efforts to keep A certain level of heterogeneity despite efforts to keep the growing pool of resources coherent and the growing pool of resources coherent and standardizedstandardized

The necessity to develop a tool that would facilitate The necessity to develop a tool that would facilitate the maintenance, exploitation and integration of the maintenance, exploitation and integration of available resources as well as their further available resources as well as their further development development

A need for an integrated and easily adjustable tool A need for an integrated and easily adjustable tool that would enhance the potentials of each particular that would enhance the potentials of each particular resourceresource

The idea of exploiting the synergy of various The idea of exploiting the synergy of various resources for different HLT tasks, including wordnet resources for different HLT tasks, including wordnet developmentdevelopment

Page 12: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1212

Structure and Structure and characteristicscharacteristics

Composed of several modules which perform the following functions:Composed of several modules which perform the following functions:– development and refinement of wordnetsdevelopment and refinement of wordnets– management of a system of morphological, bilingual and multilingual management of a system of morphological, bilingual and multilingual

electronic dictionarieselectronic dictionaries– manipulation of parallel aligned textsmanipulation of parallel aligned texts– conversions between different character encodings and resource formatsconversions between different character encodings and resource formats

Developed in C# and operates on the .NET platformDeveloped in C# and operates on the .NET platform Enables invoking command-line routines and external Perl, Awk, and Enables invoking command-line routines and external Perl, Awk, and

XSLT scriptsXSLT scripts

pd WS4LR moduls

WSLR moduls

+ CONVERSION

+ DICTIONARY MANAGMENT

+ WORDNET DEVELOPMENT

+ EXPLOITATION OF ALIGNED TEXTS

(from Use Case View)

DICTIONARY MANAGMENT

+ Simple words manipulation

+ Compound words management

+ Nooj dictionaries management

WORDNET DEVELOPMENT

+ Manipulation of one or two wordnets

+ Synsets retrievement using various methods

+ Navigation by following hypernym/hyponym relations

+ Copy of synsets with translation support

+ Exchange of information with morphological dictionaries

+ Production of Intex/Unitex graphs

+ Consistency checks on wordnets

Page 13: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1313

Dictionary Dictionary managementmanagement The main task of this module is to enable the The main task of this module is to enable the

manipulation of a system of morphological manipulation of a system of morphological dictionaries of canonical forms, or lemmas, for both dictionaries of canonical forms, or lemmas, for both simple and compound words simple and compound words

Morphological dictionaries are of great importance Morphological dictionaries are of great importance for highly inflective languages, such as the group of for highly inflective languages, such as the group of Slavic languages Slavic languages

The absence of morphological information in The absence of morphological information in wordnets has turned out to be a serious flaw in wordnets has turned out to be a serious flaw in many applications many applications

The possibility offered by WS4LR to simultaneously The possibility offered by WS4LR to simultaneously exploit both resources proved to be a great exploit both resources proved to be a great advantage in wordnet developmentadvantage in wordnet development

WS4LR also manipulates bilingual word list and a WS4LR also manipulates bilingual word list and a multilingual dictionary of proper names which can multilingual dictionary of proper names which can also be used in wordnet developmentalso be used in wordnet development

Page 14: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1414

The lemma formatThe lemma format

The lemma in a morphological dictionary of simple The lemma in a morphological dictionary of simple words has the following format: words has the following format:

lemma.Knnn [+SinSem]*lemma.Knnn [+SinSem]*where where lemma lemma is the word form used in traditional is the word form used in traditional dictionaries, dictionaries, KK represents the part of speech (noun, represents the part of speech (noun, verb, adjective, etc.), and verb, adjective, etc.), and nnnnnn the inflectional class the inflectional class code of the lemma, whose characteristics are code of the lemma, whose characteristics are described by a corresponding transducer labeled described by a corresponding transducer labeled KnnnKnnn

++SinSemSinSem is a set of optional tags which describe the is a set of optional tags which describe the syntactic, semantic, derivational and other properties syntactic, semantic, derivational and other properties of the lemmaof the lemma

The format of the lemmas for compound words is The format of the lemmas for compound words is more complex, but it basically relies on the same more complex, but it basically relies on the same principlesprinciples

Page 15: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1515

IntexIntex

The format used in the system of morphological The format used in the system of morphological dictionaries is based on the LADL format developed in dictionaries is based on the LADL format developed in the Laboratoire d'Automatique Documentaire et the Laboratoire d'Automatique Documentaire et Linguistique under the direction of Maurice GrossLinguistique under the direction of Maurice Gross

The first system developed for processing of texts The first system developed for processing of texts using dictionaries in LADL format was a system called using dictionaries in LADL format was a system called Intex Intex

Intex uses dictionaries in combination with regular Intex uses dictionaries in combination with regular expressions and inflectional and morphological finite expressions and inflectional and morphological finite state transducers (FSTs) to locate morphological, state transducers (FSTs) to locate morphological, lexical and syntactic patterns, remove ambiguities, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words in texts and tag simple and compound words in texts

Text parsing possibilities offered by regular Text parsing possibilities offered by regular expressions and FSTs proved very useful in wordnet expressions and FSTs proved very useful in wordnet developmentdevelopment

Page 16: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1616

NooJ and UnitexNooJ and Unitex

Although Intex has been developed for many Although Intex has been developed for many years and used by over 80 HLT laboratories it years and used by over 80 HLT laboratories it does not support the processing of texts in does not support the processing of texts in UnicodeUnicode

As the usage of Unicode became more and more As the usage of Unicode became more and more frequent the development of a new tool that frequent the development of a new tool that could handle text in Unicode became inevitable could handle text in Unicode became inevitable

Building on the functionalities of Intex, but Building on the functionalities of Intex, but allowing the processing of texts in Unicode, such allowing the processing of texts in Unicode, such a new tool has been developed under the name a new tool has been developed under the name of NooJof NooJ

Another system, Unitex, based on LADL format Another system, Unitex, based on LADL format and supporting resources in Unicode has been and supporting resources in Unicode has been developed in parallel, and is also available developed in parallel, and is also available

Page 17: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1717

Integrating the three Integrating the three systems in WS4LRsystems in WS4LR As each of the three systems has some useful specific As each of the three systems has some useful specific

features WS4LR allows the user to activate the features WS4LR allows the user to activate the functions of Intex, Unitex and/or NooJ system, and functions of Intex, Unitex and/or NooJ system, and select a list of dictionaries he/she wants to use,select a list of dictionaries he/she wants to use,

As none of the three systems offers possibilities for As none of the three systems offers possibilities for managing the content of dictionaries themselves, managing the content of dictionaries themselves, WS4LR provides entry, editing and review of lemmas WS4LR provides entry, editing and review of lemmas of simple and compound words, for all three solutionsof simple and compound words, for all three solutions

Dictionaries are organized in a modular fashion - in Dictionaries are organized in a modular fashion - in several sub-dictionaries as separate filesseveral sub-dictionaries as separate files

Smaller files are easier to manipulate, and in text Smaller files are easier to manipulate, and in text recognition by Intex/Unitex the usage of all recognition by Intex/Unitex the usage of all dictionaries is not always necessary, or even dictionaries is not always necessary, or even recommendedrecommended

Page 18: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

1818

Lemma managementLemma management

The user can modify or delete all the information attached to The user can modify or delete all the information attached to a lemma, or the lemma itself, as well as to add new entries a lemma, or the lemma itself, as well as to add new entries

A new entry can be generated from scratch or by copying an A new entry can be generated from scratch or by copying an existing lemma, which in some cases facilitates the work existing lemma, which in some cases facilitates the work

The regular expression or a FST graph describing the The regular expression or a FST graph describing the inflectional properties of the selected lemma can be inflectional properties of the selected lemma can be inspected and corrected if found inadequate inspected and corrected if found inadequate

Subsets of lemmas can be extracted by matching the Subsets of lemmas can be extracted by matching the lemmas, their part of speech, inflectional class code, lemmas, their part of speech, inflectional class code, syntactic and semantic markers or their Boolean syntactic and semantic markers or their Boolean combinationcombination

For instance, one can look for all the dictionary entries For instance, one can look for all the dictionary entries starting or ending with a search string which is particularly starting or ending with a search string which is particularly useful when the inflectional class code of a new lemma is useful when the inflectional class code of a new lemma is being established, since this code depends on the lemma being established, since this code depends on the lemma endingending

Page 19: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty
Page 20: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2020

Page 21: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2121

Compound wordsCompound words

Dictionaries of compound words can be a Dictionaries of compound words can be a valuable resource in the wordnet development valuable resource in the wordnet development tasktask

The form for new entries in these dictionaries The form for new entries in these dictionaries is more complex since more information need is more complex since more information need to be supplied:to be supplied:– information pertaining to the entry as a wholeinformation pertaining to the entry as a whole– information associated to the compound lemma information associated to the compound lemma

constituentsconstituents For inflected compound constituents additional For inflected compound constituents additional

information is needed: the lemma, its information is needed: the lemma, its inflectional class code, as well as the list of inflectional class code, as well as the list of grammatical categories of the form that grammatical categories of the form that appears in the compound lemma appears in the compound lemma

Page 22: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2222

Page 23: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2323

Bilingual word listsBilingual word lists

WS4LR also handles bilingual word lists, WS4LR also handles bilingual word lists, as well as multilingual dictionaries, such as well as multilingual dictionaries, such as Prolex, the multilingual dictionary of as Prolex, the multilingual dictionary of proper names based on an ontology proper names based on an ontology built around the conceptual proper built around the conceptual proper name and its relations name and its relations

This adds additional functionality to the This adds additional functionality to the integration of lexical resources offered integration of lexical resources offered by WS4LR in various tasks, including by WS4LR in various tasks, including wordnet development wordnet development

Page 24: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2424

Management of Management of aligned parallel textsaligned parallel texts Parallel texts, which usually originate from a text in Parallel texts, which usually originate from a text in

one language and its translation in another, are one language and its translation in another, are often aligned at a certain level (paragraph, often aligned at a certain level (paragraph, sentence, etc) by matching the corresponding sentence, etc) by matching the corresponding segments of the original and its translationsegments of the original and its translation

Aligned parallel texts are a valuable lexical resource Aligned parallel texts are a valuable lexical resource which can be used for many HLT tasks, including which can be used for many HLT tasks, including wordnet developmentwordnet development

The WS4LR module for management of aligned The WS4LR module for management of aligned parallel texts uses texts which have previously been parallel texts uses texts which have previously been aligned using Xalign as an alignment toolaligned using Xalign as an alignment tool

The module converts these texts to the Translation The module converts these texts to the Translation Memory eXchange (TMX) format, which is becoming Memory eXchange (TMX) format, which is becoming the standard format for aligned texts the standard format for aligned texts

The module can also use texts that are already in The module can also use texts that are already in that formatthat format

Page 25: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2525

Page 26: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2626

ConversionConversion

Adds to the flexibility of resource exploitationAdds to the flexibility of resource exploitation Conversion from one character encoding set to Conversion from one character encoding set to

another enables the exploitation of language another enables the exploitation of language resources both in Cyrillic and Latin alphabetresources both in Cyrillic and Latin alphabet

The transformation can be applied to only a part of The transformation can be applied to only a part of the file, e.g., when a dictionary type file is the file, e.g., when a dictionary type file is transformed, only lemmas and word forms are transformed, only lemmas and word forms are converted, not the part of speech and grammatical converted, not the part of speech and grammatical codescodes

The module also makes switching between The module also makes switching between resources in Intex and Unitex quick and easyresources in Intex and Unitex quick and easy

The user can also choose a conversion Perl or awk The user can also choose a conversion Perl or awk script suitable for a specific file type, or even script suitable for a specific file type, or even produce his/her own scriptproduce his/her own script

Page 27: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2727

Wordnet managementWordnet management

The wordnet management module supports search The wordnet management module supports search of wordnets, their visualization, as well as their of wordnets, their visualization, as well as their development and refinementdevelopment and refinement

When this module is activated, the main form opens When this module is activated, the main form opens with two wordnet windows, thus offering to the user with two wordnet windows, thus offering to the user the possibility to work with one or two wordnetsthe possibility to work with one or two wordnets

In the current version of WS4LR these two wordnets In the current version of WS4LR these two wordnets are the Serbian and English wordnet, but the tool are the Serbian and English wordnet, but the tool can be easily adapted for any two wordnetscan be easily adapted for any two wordnets

If the user decides to work with both wordnets in If the user decides to work with both wordnets in parallel, he/she can always synchronize them via parallel, he/she can always synchronize them via the ILIthe ILI

The main form for wordnet management also opens The main form for wordnet management also opens a window with a bilingual word lista window with a bilingual word list

Page 28: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2828

Searching wordnetsSearching wordnets

The user can choose to search just one wordnet or The user can choose to search just one wordnet or both of them both of them

Synsets can be retrieved using various methods, Synsets can be retrieved using various methods, from simple string matching to complex Xpath from simple string matching to complex Xpath expressionsexpressions

In simple string matching the user can specify In simple string matching the user can specify whether an exact match is required or not, and in whether an exact match is required or not, and in the latter case the system will also retrieve synsets the latter case the system will also retrieve synsets that contain words which have the specified that contain words which have the specified string(s) as their part string(s) as their part

The user can use Xpath expressions to retrieve The user can use Xpath expressions to retrieve synsets on basis of various other criteria, such as synsets on basis of various other criteria, such as the domain synsets belong tothe domain synsets belong to

WS4LR offers predefined Xpath expressions, but the WS4LR offers predefined Xpath expressions, but the user can also define these expressions him/herselfuser can also define these expressions him/herself

Page 29: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

2929

Page 30: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3030

Adding new concepts Adding new concepts using its hypernymusing its hypernym With a particular concept in mind, the With a particular concept in mind, the

lexicographer can inspect the wordnet for the lexicographer can inspect the wordnet for the existence of its hypernymexistence of its hypernym

If an appropriate hypernym is found, the new If an appropriate hypernym is found, the new synset can be placed as its hyponymsynset can be placed as its hyponym

In order to find such a hypernym the search In order to find such a hypernym the search possibilities offered by WS4LR can be usedpossibilities offered by WS4LR can be used

As synsets can be visualized in various As synsets can be visualized in various forms: as text, XML or hypernym/hyponym forms: as text, XML or hypernym/hyponym the possibility of navigation through the the possibility of navigation through the hypernym/hyponym tree can also be used to hypernym/hyponym tree can also be used to locate a hypernymlocate a hypernym

Page 31: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3131

Page 32: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3232

Adding new concepts Adding new concepts using PWNusing PWN Starting with a word that denotes the concept the user Starting with a word that denotes the concept the user

can locate the candidate PWN synsets available using can locate the candidate PWN synsets available using the bilingual word listthe bilingual word list

Using the option “Match ID” the user can first identify Using the option “Match ID” the user can first identify the synsets in the source and target wordnet that the synsets in the source and target wordnet that already have a matchalready have a match

If the matching PWN synset for the new concept is If the matching PWN synset for the new concept is found, the new synset can be inserted in the found, the new synset can be inserted in the appropriate place in the target wordnet using the appropriate place in the target wordnet using the option “Create synset in the other language” option “Create synset in the other language”

If necessary, this option also creates copies of all its If necessary, this option also creates copies of all its missing hypernyms, to prevent the new synset of missing hypernyms, to prevent the new synset of becoming a “dangling” synsetbecoming a “dangling” synset

The user can then proceed with modificationsThe user can then proceed with modifications

Page 33: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3333

Page 34: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3434

Selection of Selection of synonymous wordssynonymous words WS4LR also offers substantial aid in solving the WS4LR also offers substantial aid in solving the

synonym selection problem - the selection of synonym selection problem - the selection of synonymous words for the synset and the assignment synonymous words for the synset and the assignment of meanings to these wordsof meanings to these words

Although it is reasonable to assume that the wordnet Although it is reasonable to assume that the wordnet developer has a pretty good idea of the candidate developer has a pretty good idea of the candidate words for the synset of the concept he/she wants to words for the synset of the concept he/she wants to add to the wordnet, it is also possible that he/she add to the wordnet, it is also possible that he/she might neglect some of themmight neglect some of them

As the simplest and most straightforward aid the As the simplest and most straightforward aid the bilingual wordlist can be usedbilingual wordlist can be used

Words from the source (English) synset can be Words from the source (English) synset can be matched with words in the target language as matched with words in the target language as probable candidates probable candidates

The multilingual dictionary Prolex could be used in a The multilingual dictionary Prolex could be used in a similar manner similar manner

Page 35: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3535

Page 36: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3636

Using aligned textsUsing aligned texts

The synonym selection problem can be approached by The synonym selection problem can be approached by combining two wordnets and aligned texts combining two wordnets and aligned texts

WS4LR search both aligned texts in parallel using WS4LR search both aligned texts in parallel using selected words from both languagesselected words from both languages

All of the words found in both texts are highlightedAll of the words found in both texts are highlighted A lexicographer can use this option to extract possible A lexicographer can use this option to extract possible

candidate words for a synset by searching aligned candidate words for a synset by searching aligned texts with words from the original PWN synset and texts with words from the original PWN synset and words he/she has already selected for the target words he/she has already selected for the target synset synset

If a highlighted word found in the text in English does If a highlighted word found in the text in English does not have a highlighted match in the text in the target not have a highlighted match in the text in the target language, the lexicographer should inspect the language, the lexicographer should inspect the sentence in the target language for a possible match, sentence in the target language for a possible match, which would then be a new candidate for the synset which would then be a new candidate for the synset

Page 37: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3737

Page 38: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3838

??

??

Page 39: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

3939

Checking synset words Checking synset words in contextin context Once the user has rounded all the candidate words for Once the user has rounded all the candidate words for

the synset he/she might be in doubt whether one or the synset he/she might be in doubt whether one or more words properly fit into the synset more words properly fit into the synset

In that case the user might want to observe these In that case the user might want to observe these words within a context, which can be done by words within a context, which can be done by searching a corpus for these words and obtaining searching a corpus for these words and obtaining concordancesconcordances

By getting the occurrences of the words within the By getting the occurrences of the words within the context, the user will be able to make a better context, the user will be able to make a better assessment whether they are really appropriate or notassessment whether they are really appropriate or not

In WS4LR this can be realized by creating a regular In WS4LR this can be realized by creating a regular expression or FST graph from one or more words, and expression or FST graph from one or more words, and using it to search a text in the target languageusing it to search a text in the target language

Page 40: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

4040

Page 41: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

4141

Page 42: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

4242

Wordnet consistency Wordnet consistency checkschecks The WS4LR wordnet module also performs The WS4LR wordnet module also performs

various consistency checks on wordnetsvarious consistency checks on wordnets For example, when word senses are in For example, when word senses are in

question, WS4LR provides information of the question, WS4LR provides information of the senses that have already been used for a senses that have already been used for a word, so the user can assign a sense tag that word, so the user can assign a sense tag that has not previously been assigned, thus has not previously been assigned, thus preventing duplicate word-sense pairspreventing duplicate word-sense pairs

The wordnet module can also detect dangling The wordnet module can also detect dangling relations, and the use of the same word in a relations, and the use of the same word in a hypernym/hyponym pair, which is not hypernym/hyponym pair, which is not allowed allowed

Page 43: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

4343

Using morphological Using morphological informationinformation Morphological dictionaries extend the search Morphological dictionaries extend the search

possibilities by enabling searches with all inflected possibilities by enabling searches with all inflected forms of the words which is of great importance in the forms of the words which is of great importance in the case of highly inflective languages, such as Serbian case of highly inflective languages, such as Serbian

WS4LR also enables the enrichment of synsets with WS4LR also enables the enrichment of synsets with morphosyntactic information from morphological morphosyntactic information from morphological dictionariesdictionaries

The tool can search for all synset words in The tool can search for all synset words in morphological dictionaries of simple or compound morphological dictionaries of simple or compound lemmas, retrieve their inflectional class codes, and lemmas, retrieve their inflectional class codes, and assign them to synset words using the <LNOTE> XML assign them to synset words using the <LNOTE> XML tagtag

If more lemmas of the same form exist, they are all If more lemmas of the same form exist, they are all offered to the user to choose the appropriate oneoffered to the user to choose the appropriate one

The missing morphosyntactic information can thus be The missing morphosyntactic information can thus be added to wordnets added to wordnets

Page 44: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

4444

Page 45: Wordnet Development Using a Multifunctional Tool Ivan Obradović, Ranka Stanković ivano@rgf.bg.ac.yu, ranka@rgf.bg.ac.yu University of Belgrade Faculty

CALP 07 Workshop, Borovets, CALP 07 Workshop, Borovets, September 30, 2007 September 30, 2007

4545

Concluding remarksConcluding remarks

The desktop version of WS4LR is fully operational and is The desktop version of WS4LR is fully operational and is already being used as the main tool for developing already being used as the main tool for developing resources in Serbian, including the Serbian wordnet, but resources in Serbian, including the Serbian wordnet, but its commercial applications have not yet been its commercial applications have not yet been consideredconsidered

Although a systematic evaluation of WS4LR has not Although a systematic evaluation of WS4LR has not been performed, there have already been several been performed, there have already been several enhancements of the tool on basis of user feedbackenhancements of the tool on basis of user feedback

A full-scale web version of this tool is planned, which A full-scale web version of this tool is planned, which would enable its usage in wordnet development by would enable its usage in wordnet development by several lexicographers concurrently, with all the several lexicographers concurrently, with all the possibilities the desktop version now offers possibilities the desktop version now offers

Presently, some of the WS4LR functions are available on Presently, some of the WS4LR functions are available on the web for searches based on morphological (using the web for searches based on morphological (using dictionaries) semantic (using wordnets) and multilingual dictionaries) semantic (using wordnets) and multilingual (using aligned multilingual wordnets) expansions of the (using aligned multilingual wordnets) expansions of the initial queryinitial query