open dutch wordnet - emiel van miltenburg · we describe open dutch wordnet, which has been derived...

9
Open Dutch WordNet Marten Postma VU Amsterdam Amsterdam, The Netherlands [email protected] Emiel van Miltenburg VU Amsterdam Amsterdam, The Netherlands [email protected] Roxane Segers VU Amsterdam Amsterdam, The Netherlands [email protected] Anneleen Schoen VU Amsterdam Amsterdam, The Netherlands [email protected] Piek Vossen VU Amsterdam Amsterdam, The Netherlands [email protected] Abstract We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We exploited existing equivalence relations between Cornetto synsets and WordNet synsets in order to move the open source content from Cornetto into WordNet synsets. Currently, Open Dutch Wordnet contains 117,914 synsets, of which 51,588 synsets contain at least one Dutch synonym, which leaves 66,326 synsets still to obtain a Dutch synonym. The average polysemy is 1.5. The resource is currently delivered in XML under the CC BY-SA 4.0 license 1 and it has been linked to the Global Wordnet Grid. In order to use the resource, we refer to: https: //github.com/MartenPostma/ OpenDutchWordnet. 1 Introduction The main goal of this project is to convert the Dutch lexical semantic database Cornetto version 2.0 (Vossen et al., 2013) into an open source ver- sion. Cornetto is currently not distributed as open source, because a large portion of the database originates from the commercial publisher Van Dale. 2 The main task of this project is hence to replace the proprietary content of the database with open source content. In order to create Open Dutch WordNet, we used all the synsets and re- lations from WordNet 3.0 (Fellbaum, 1998) as our basis. We then exploited existing equivalence relations between Cornetto synsets and WordNet synsets in order to replace WordNet synonyms by 1 https://creativecommons.org/licenses/ by-sa/4.0/ 2 http://www.vandale.nl/ Dutch synonyms. We further added new concepts that were not matched through hyperonym rela- tions to the WordNet hierarchy. Any new and manually-created semantic relation from Cornetto was added to the database as well. We limited the synonyms, concepts and relations to those on which there are no copy-right claims. In addi- tion, the inter-language links in various external resources were used to add synonyms to the re- source. The result is an open source wordnet that combines the merge and expand method described in (Vossen, 1999). The resource is currently delivered in XML under the CC BY-SA 4.0 license. 3 In order to in- spect and improve the resource, a Python module has been created. This module can be found at : https://github.com/MartenPostma/ OpenDutchWordnet. The outline of this paper is as follows. We start with the motivation to create Open Dutch WordNet in section 2, followed by the method- ology to create the resource in section 3. An overview of the main components will be provided in section 4. Finally, we discuss the process of making the resource and plans to improve the re- source in section 5. 2 Background and motivation The first version of the Dutch WordNet was de- veloped within the EuroWordNet project starting from a database developed by Van Dale publisher. This database already contained synset-like struc- tures and lexical semantic relations that could be used to efficiently derive a wordnet structure. Li- censes were agreed for commercial and research usage. The Dutch WordNet and the Referentie Bestand Nederlands (RBN) (Van der Vliet, 2007) were combined in the Cornetto project (Vossen et al., 2013). RBN has detailed information on 3 https://creativecommons.org/licenses/ by-sa/4.0/ 300

Upload: dobao

Post on 15-Feb-2019

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

Open Dutch WordNetMarten PostmaVU Amsterdam

Amsterdam The Netherlandsmcpostmavunl

Emiel van MiltenburgVU Amsterdam

Amsterdam The Netherlandsemielvanmiltenburgvunl

Roxane SegersVU Amsterdam

Amsterdam The Netherlandsroxanesegersgmailcom

Anneleen SchoenVU Amsterdam

Amsterdam The Netherlandsamschoenvunl

Piek VossenVU Amsterdam

Amsterdam The Netherlandspiekvossenvunl

Abstract

We describe Open Dutch WordNet whichhas been derived from the Cornettodatabase the Princeton WordNet andopen source resources We exploitedexisting equivalence relations betweenCornetto synsets and WordNet synsets inorder to move the open source contentfrom Cornetto into WordNet synsetsCurrently Open Dutch Wordnet contains117914 synsets of which 51588 synsetscontain at least one Dutch synonymwhich leaves 66326 synsets still toobtain a Dutch synonym The averagepolysemy is 15 The resource is currentlydelivered in XML under the CC BY-SA40 license1 and it has been linked tothe Global Wordnet Grid In order touse the resource we refer to httpsgithubcomMartenPostmaOpenDutchWordnet

1 Introduction

The main goal of this project is to convert theDutch lexical semantic database Cornetto version20 (Vossen et al 2013) into an open source ver-sion Cornetto is currently not distributed as opensource because a large portion of the databaseoriginates from the commercial publisher VanDale2 The main task of this project is henceto replace the proprietary content of the databasewith open source content In order to create OpenDutch WordNet we used all the synsets and re-lations from WordNet 30 (Fellbaum 1998) asour basis We then exploited existing equivalencerelations between Cornetto synsets and WordNetsynsets in order to replace WordNet synonyms by

1 httpscreativecommonsorglicensesby-sa40

2 httpwwwvandalenl

Dutch synonyms We further added new conceptsthat were not matched through hyperonym rela-tions to the WordNet hierarchy Any new andmanually-created semantic relation from Cornettowas added to the database as well We limitedthe synonyms concepts and relations to those onwhich there are no copy-right claims In addi-tion the inter-language links in various externalresources were used to add synonyms to the re-source The result is an open source wordnet thatcombines the merge and expand method describedin (Vossen 1999)

The resource is currently delivered in XMLunder the CC BY-SA 40 license3 In order to in-spect and improve the resource a Python modulehas been created This module can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

The outline of this paper is as follows Westart with the motivation to create Open DutchWordNet in section 2 followed by the method-ology to create the resource in section 3 Anoverview of the main components will be providedin section 4 Finally we discuss the process ofmaking the resource and plans to improve the re-source in section 5

2 Background and motivation

The first version of the Dutch WordNet was de-veloped within the EuroWordNet project startingfrom a database developed by Van Dale publisherThis database already contained synset-like struc-tures and lexical semantic relations that could beused to efficiently derive a wordnet structure Li-censes were agreed for commercial and researchusage The Dutch WordNet and the ReferentieBestand Nederlands (RBN) (Van der Vliet 2007)were combined in the Cornetto project (Vossenet al 2013) RBN has detailed information on

3 httpscreativecommonsorglicensesby-sa40

300

morpho-syntactic semantic and pragmatic prop-erties of lexical units with a focus on the combi-natorics The Cornetto database thus provides thesemantic organization of a wordnet and the detailson each synonym in a synset as can be found inlexical unit based lexicons An important charac-teristic of Cornetto is that it has been developed in-dependently from Princeton WordNet (PWN) Thesynsets in Cornetto were then mapped to synsets inPWN following a merge approach (Vossen 1999)First all possible equivalence relations were cre-ated between synonyms in synsets using bilin-gual dictionaries after which the mappings wereranked on the basis of shared properties eg hy-peronyms and hyponyms already linked manuallysimilar domain labels and synset membership ofmultiple translations (Vossen et al 2008) TheVan Dale publisher however decided to stop allcollaborations with the research community Thismotivated us to develop Open Dutch WordNet forwhich we wanted to keep as much as possible theconcepts and word meanings that are defined in-dependently of PWN This implies that we cannotsimply follow an expand approach to translate En-glish synonyms in PWN to Dutch words but weneed to also match PWN synsets to RBN lexicalunits

Figure 1 introduces the main components ofthe Dutch lexical semantic database Cornetto

HAS HYPERONYM

EQ SYNONYM

palmboom1 palm1

boom1

palm3 palm tree1

ltcdb_lugtltform form-cat=nounform-spelling=palmgtltmorphology_noungtltsyntax_noungtltsemantics_noungtltexamplesgtltsem-definitiongtltsem-synonymsgt

ltcdb_lugt

Figure 1 The most important components of Cor-netto are visualized The ellipses in red are ex-amples of Cornetto synsets which contain Lex-ical Units (LU) Each LU can contain rich in-formation about its morphology syntax and se-mantics Cornetto synsets can have Internal Se-mantic Relations (ISRs) to other Cornetto synsets(eg HAS HYPERONYM) but also EquivalenceSemantic Relations (ESRs) to PWN synsets (egEQ SYNONYM)

Figure 1 visualizes the most important com-ponents of Cornetto Cornetto synsets or Cor-netto sets of synonyms are shown in red Thesynonyms inside the Cornetto synsets are calledLexical Units (LU) because they can containrich information about its morphology syntaxand semantics especially if these LUrsquos originatefrom RBN Synonyms that originate from the VanDale database only have part-of-speech informa-tion Cornetto synsets can have Internal Seman-tic Relations (ISRs) to other Cornetto synsets (egHAS HYPERONYM) but also Equivalence Se-mantic Relations (ESRs) to PWN synsets (egEQ SYNONYM) ESRs are mainly used to de-fine synonymy or near synonymy between Cor-netto synsets and PWN synsets Most ISR rela-tions originate from the Van Dale database Asmall set of relations were added manually in thevarious projects All synonyms and relations haveprovenance tags which enables us to trace datafrom Van Dale and data that can transferred to theOpen Dutch WordNet

Table 1 presents the provenance statistics forthe most important components of the database

Component Van Dale RBN Cornetto

LU 60 57 15S 70 1 0ISR 77 0 33ESR 0 0 82

Table 1 The provenance information for LexicalUnits (LU) Synsets (S) Internal Semantic Rela-tions (ISR) and Equivalence Semantic Relations(ESR) is shown for each of the three sources VanDale RBN and Cornetto (if the source is Cor-netto this means that the data was created manu-ally in the Cornetto project and does not originatefrom Van Dale)

Table 1 clearly shows that a large part of theLUrsquos synsets and ISRs originate from Van DaleThe removal of this licensed content creates largegaps in the resource The main goal is hence to useopen source resources to replace the licensed con-tent with open source content as much as possibleOne of the most promising components to trans-fer information from Cornetto into Open DutchWordNet are the ESRs that were created semi-

301

automatically during the EuroWordNet and Cor-netto project and are 100 open source

3 Methodology

We used the following procedure to create OpenDutch WordNet

We use English WordNet30 (PWN) (Miller1995 Fellbaum 1998) as our basis for the conceptstructure This means that we copied the PWNsynsets and relations to ODWN and ignored allsynsets and relations from Van Dale The nextstep is to transfer the LUrsquos from RBN to the PWN-based synsets

Before copying these LUrsquos we improved thequality of the ESRs We defined a set of ESRs thatare either likely to be more difficult or that playan important role in the transfer This subset waschecked manually and was also used as trainingto filter the remaining ESRs using a decision treealgorithm This process is described in subsection31

Subsequently we make use of the ESRs be-tween Cornetto synsets and WordNet synsets tocopy the LUrsquos that do not originate from Van Dalefrom a Cornetto synset into a WordNet synsetwhich is described in subsection 32

The transfer still leaves us with many synsetsfrom PWN without a Dutch LU We therefore useopen source resources to translate the WordNetsynonyms into Dutch which is described in sub-sections 33 and 34 respectively This resultson the one hand in more synsets to have Dutchsynonyms but also in further evidence for trans-ferred synonyms to be correct because of evidencethrough other sources

Finally we manually checked 8257 Dutchsynonyms which is described in subsection 35

31 Revision of equivalence relations

Firstly we manually filtered the ESRs from whichwe focused on the synonymy relations Each ESRlinks a Cornetto synset to a WordNet synset witha certain relation type The mapping of an ESR isone of many to many We considered three mainaspects of Cornetto synsets in deciding whetherto manually check an ESR the synset depth thenumber of children and the number of ESRs Wedecided to manually check the deepest and shal-lowest synsets because these relations got littleattention in previous projects In addition wechecked the synsets with most children because

they play an important role in a wordnet Fi-nally the Cornetto synsets with most ESRs werechecked because we suspect that the equivalencerelation is complex and likely to contain manywrong mappings Four students manually checked12966 of the total 82285 ESRs of which 6575were removed

The manually revised relations were used totrain a pruned C45 decision tree algorithm (Quin-lan 1993 Hall et al 2009) that was used to filterthe remaining ESRs An ESR consists of an equiv-alence relation between a Cornetto synset and aWordNet synset We used properties of the Cor-netto synset and the WordNet synset as well as ofthe synset relation itself as features

1 the number of equivalence relations in whicha Cornetto synset and a Wordnet synset arepresent

2 the depth of the Cornetto synset and theWordnet synsets The difference of the depthis also used

3 Because a Cornetto synset can be presentin multiple ESRs to WordNet synsets andvice versa we average the semantic similar-ity scores (using the Leacock amp Chodorowsimilarity measure (Leacock and Chodorow1998)) of of all combinations of these ESRs

Interestingly enough the features in whichCornetto properties were used yielded the best re-sults This might be caused by the fact that therelations were also generated using Cornetto Thefiltering of the ESRs using the decision tree algo-rithm resulted in an additional removal of 32258ESRs

32 Cornetto synonymsWhen there exists an ESR between a Cor-netto synset and a WordNet synset and therelation type is either EQ SYNONYM orEQ NEAR SYNONYM all LUrsquos that do notoriginate from Van Dale are inserted into theWordNet synset Using figure 1 as an examplethe LUrsquos palmboom1 and palm1 would replacepalm tree1 and palm3 If the ESR was checkedmanually the provenance tag is cdb22 ManualIf the ESR was checked using the decision treealgorithm the provenance tag is cdb22 AutoThe provenance tag cdb22 None is given to allother strategies that were used to add LUrsquos to

302

Open Dutch WordNet One of the most dominantstrategies of this class is when a LU in a Cornettosynset does not have a direct ESR (no ESR orone of EQ HAS HYPERONYM) to a WordNetSynset but the parent of the Cornetto synset doeshave an ESR to a WordNet synset In that casea new synset (not represented in WordNet) iscreated as a hyponym of the target of the ESRof the hyperonym Finally the ESRs are used toinsert Cornetto synset relations into Open DutchWordNet that do not originate from Van Dale butwere created manually in one of the projects

33 External resources

Using various external open source resources suchas Wiktionary (Foundation 2014b) Omegawiki 4and Google (Google 2014) Oliver (2014) trans-lated both monosemous and polysemous lemmasinto Dutch for the part of speeches noun verband adjective For the monosemous lemmas theEnglish lemmas are simply translated into DutchFor the polysemous lemmas the gloss overlap be-tween examples in an external resource and thepossible WordNet synsets for a lemma are usedto determine the correct synset for a lemma Weused a similar procedure to add synonyms fromWikipedia (Wikipedia 2014 Foundation 2014a)

34 Adjectives extended

We created a mapping for two kinds of adjec-tives monosemous adjectives that have only onesense in WordNet and lsquoslightly polysemous ad-jectivesrsquo that have exactly one adjectival senseand one nominal sense Adjectives of the latterkind are typically nationalities (Cameroonian) re-ligious denominations (Buddhist) and words likepurebred To create the mapping we translated theEnglish word forms using Google Translate andBing Translate We also use the word alignmentsfrom the OPUS project (Tiedemann 2012) Theseresources provide us with Dutch candidate wordforms that should correspond to the original Word-Net synonyms in synsets We then checked foreach word form how many senses are associatedwith them in RBN If there is only one (and theword is indeed an adjective) we conclude that thisDutch sense corresponds with the original Word-Net synset

One problem with the translation-based ap-proach is that Dutch adjectives are sometimes in-

4 httpwwwomegawikiorg

flected with the suffix -e For example the Englishontological is automatically translated by Googleto ontologische In RBN all word forms are storedwithout the inflectional ending which means thatthe translation does not match the lemma To solvethis issue in the cases where we could not find adirect match we applied an automatic stemmingrule to remove the suffix and tried to find a matchusing the stem

35 Manual editing

Finally we checked the resulting Dutch wordnetmanually We focused on two main editing tasksFirstly we inspected all synsets that had 10 ormore synonyms since excessive synsets may con-tain false synonyms In addition because oneCornetto synset could have multiple ESRs it oc-curred that the same sense was copied into multi-ple WordNet synsets This may lead to excessivepolysemy The second task therefore consistedof indicating which WordNet synset was the cor-rect synset for a sense that occurred in more thanone WordNet synset In total 8257 LUrsquos werechecked in this phase

4 Overview and statistics

In this section we provide an overview of OpenDutch Wordnet in terms of general statistics theformat it is delivered in evaluation and a Pythonmodule which allows to interact with the resource

Open Dutch Wordnet contains 117914synsets of which the majority are noun synsets98049 There are 18782 verb synsets and 1083adjectival synsets 51588 synsets contain at leastone Dutch synonym which leaves 66326 synsetsstill to obtain a synonym The resource contains92295 synonyms of which 75173 are nouns15979 are verbs and 1143 are adjectives Theaverage polysemy is 15 19996 relations wereadded to the WordNet hierarchy

41 Format

Open Dutch WordNet is stored in a type of XMLcalled Global WordNet Grid LMF (httpsgithubcomglobalwordnetschemas)which is an adaptation of WordnetLMF (Vossenet al 2012) The XML contains two mainelements LexicalEntry and Synset LexicalEntryelements contain information about a specificsynonym whereas Synset elements containinformation about synsets A simplified example

303

of a LexicalEntry element can be found in figure2

ltLexicalEntry id=ondernemer-n-1partOfSpeech=noungt

ltLemma writtenForm=ondernemergtltSenseid=r_n-25922senseId=1definition=iemand met eigen bedrijfsynset=eng-30-10060352-nprovenance=cdb22_Auto+wiktionary+googleannotator=gt

ltLexicalEntrygt

Figure 2 A simplified example of a LexicalEntryelement is shown

In figure 2 an example of a LexicalEntry el-ement is shown The attributes id and partOf-Speech of the LexicalEntry element indicate theidentifier and the part of speech respectively Inthis example the identifier is ondernemer-n-1which refers to the first noun sense of the Dutchtranslation of entrepreneur in the sense of ldquosome-one who organizes a business venture and assumesthe risk for itrdquo The attribute writtenForm of theelement Lemma indicates the lemma Followingthe structure of Cornetto the LexicalEntry struc-ture represents a lexical unit and not a form unitThe motivation for this is that form properties candiffer from one meaning to another for a lemmaThe same form can thus appear in multiple Lexi-calEntry elements

Finally the Sense element contains five at-tributes

1 senseId refers to the synonym sense number

2 id stores the synonym sense identifier If theidentifier starts with r the synonym origi-nates from RBN In this case more informa-tion about the synonym can be found in RBNIn all other cases this is not available

3 definition presents the definition for thesense

4 synset points to the synset to which this syn-onym belongs

5 Concatenated by rsquo+rsquo the attribute prove-nance shows which resources proposed thisparticular synonym for this particular synset

6 the attribute annotator shows the name ofan annotator and marks that the synonym hasbeen checked manually The default value is

an empty string Currently 6370 LexicalEn-try elements have been checked manually

The LexicalEntry used in Figure 2 belongedto the synset ldquoeng-30-10060352-nrdquo Figure 3presents a simplified example of that Synset ele-ment

ltSynset id=eng-30-10060352-nili=i89775gt

ltDefinitionsgtltDefinitiongloss=iemand met eigen bedrijflanguage=nlprovenance=odwngt

ltDefinitiongloss=someone who organizesa business venture andassumes the risk for itlanguage=enprovenance=pwngt

ltSynsetRelationsgtltSynsetRelationprovenance=pwnrelType=has_hyperonymtarget=eng-30-09882716-ngt

ltSynsetRelationprovenance=odwnrelType=role_agenttarget=eng-30-01651293-vgtltSynsetRelationsgt

ltSynsetgt

Figure 3 A simplified example of a Synset ele-ment is shown

In figure 3 a simplified example is shownof a Synset element The Synset attributes id andili provide information about the synset identifierand the interlingual index identifier respectivelyhttpdatalider-projecteuili

The elements DefinitionsDefinition provideinformation about the gloss language and prove-nance of the definitions Finally the elementSynsetRelationsSynsetRelation stores the infor-mation about the relations between synsets Againthe provenance attribute is used to mark whetherthe relation originates from PWN or from Cor-netto

42 Analysis Lexical Entries

Open Dutch WordNet contains 92295 synonymsoriginating from various resources Table 2presents information about the number of syn-onyms from each resource

Table 2 presents the number of synonymsproposed by each resource Note that the samesynonym can be proposed by multiple resourceswhich is why the sum of all numbers is higher than

304

Provenance instances of all LE

cdb22 Auto 32806 355cdb22 None 19073 207

wiktionary 17968 195cdb22 Manual 13075 142

omegawiki 12589 136google 8374 91

opus 612 07bing 506 05

wikipedia 375 04

Table 2 The number of synonyms from each re-source is shown In addition the second columnindicates what percentage this number is relativeto all synonyms in Open Dutch Wordnet

the total number of synonyms The vast major-ity of synonyms originate from the ESRs (prefixedby cdb22) between Cornetto synsets and WordNetsynsets

In order to evaluate the quality of each re-source for the creation of Open Dutch Wordnetwe randomly evaluated 50 monosemous and pol-ysemous instances The results can be found intable 3

Provenance m p

Google 084 NAWiktionary 086 068Wikipedia 088 062

Omegawiki 090 086Cdb22 Manual 088 074

Cdb22 Auto 080 080Cdb22 None 096 078

Table 3 The evaluation results of randomly se-lected 50 monosemous (m) and polysemous (p) in-stances per resources is shown

Table 3 shows that the overall precision ofthe resource is high as far as the quality of asynonym that bears a certain provenance is con-cerned What it does not show is a fair compar-ison of the quality of each resource because notexactly the same strategy was used to extract in-formation from each resource For example onlymonosemous words were used from the outputfrom Google Overall we observe that 87 ofthe proposed monosemous synonyms were correctin the evaluation whereas this was 76 for thepolysemous synonyms The most valuable exter-

nal resource for Open Dutch WordNet seems to beOmegawiki which is not only present in 136of the LexicalEntry elements but also performedwell in the evaluation For comparison Sevens(Sevens et al 2014) performed an independentevaluation of the equivalence relations in Cornettoand reported precision of 5218 for a samplebased on all synsets and 8894 for a subset thatwas likely to have manually created links Al-though it is difficult to compare both samples forevaluation the precision for Open Dutch Wordnetis thus very much in line with the precision of Cor-netto as reported by them

43 Depth Distribution

66326 synsets in Open Dutch Wordnet still lack asynonym We were interested in knowing in whichpart of the hierarchy these synsets were locatedBreadth-first search was used to calculate synsetdepth Figure 4 presents the distribution of synsetswith and without synonyms per depth layer

020406080100

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

of synsets without synonyms

0 20 40 60 80 100

of synsets with synonyms

Figure 4 For each depth layer in Open DutchWordNet which ranges from the top level 1 to themost deepest layer 17 the percentage of synsets inthat layer with and without synonyms is shown

Figure 4 presents the distribution of synsetswith and without synonyms per depth layer Ingeneral we observe that the top layers have rela-tively few synsets without synonyms whereas theopposite is true for the deeper layers It is likelythat these lower level synsets can be filled easily ifbilingual resources extend their coverage Thesewords usually have a single meaning and only onetranslation

Also the opposite situation occurs that weadded new synsets to the hierarchy that are notin WordNet These synsets appear to be spread

305

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 2: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

morpho-syntactic semantic and pragmatic prop-erties of lexical units with a focus on the combi-natorics The Cornetto database thus provides thesemantic organization of a wordnet and the detailson each synonym in a synset as can be found inlexical unit based lexicons An important charac-teristic of Cornetto is that it has been developed in-dependently from Princeton WordNet (PWN) Thesynsets in Cornetto were then mapped to synsets inPWN following a merge approach (Vossen 1999)First all possible equivalence relations were cre-ated between synonyms in synsets using bilin-gual dictionaries after which the mappings wereranked on the basis of shared properties eg hy-peronyms and hyponyms already linked manuallysimilar domain labels and synset membership ofmultiple translations (Vossen et al 2008) TheVan Dale publisher however decided to stop allcollaborations with the research community Thismotivated us to develop Open Dutch WordNet forwhich we wanted to keep as much as possible theconcepts and word meanings that are defined in-dependently of PWN This implies that we cannotsimply follow an expand approach to translate En-glish synonyms in PWN to Dutch words but weneed to also match PWN synsets to RBN lexicalunits

Figure 1 introduces the main components ofthe Dutch lexical semantic database Cornetto

HAS HYPERONYM

EQ SYNONYM

palmboom1 palm1

boom1

palm3 palm tree1

ltcdb_lugtltform form-cat=nounform-spelling=palmgtltmorphology_noungtltsyntax_noungtltsemantics_noungtltexamplesgtltsem-definitiongtltsem-synonymsgt

ltcdb_lugt

Figure 1 The most important components of Cor-netto are visualized The ellipses in red are ex-amples of Cornetto synsets which contain Lex-ical Units (LU) Each LU can contain rich in-formation about its morphology syntax and se-mantics Cornetto synsets can have Internal Se-mantic Relations (ISRs) to other Cornetto synsets(eg HAS HYPERONYM) but also EquivalenceSemantic Relations (ESRs) to PWN synsets (egEQ SYNONYM)

Figure 1 visualizes the most important com-ponents of Cornetto Cornetto synsets or Cor-netto sets of synonyms are shown in red Thesynonyms inside the Cornetto synsets are calledLexical Units (LU) because they can containrich information about its morphology syntaxand semantics especially if these LUrsquos originatefrom RBN Synonyms that originate from the VanDale database only have part-of-speech informa-tion Cornetto synsets can have Internal Seman-tic Relations (ISRs) to other Cornetto synsets (egHAS HYPERONYM) but also Equivalence Se-mantic Relations (ESRs) to PWN synsets (egEQ SYNONYM) ESRs are mainly used to de-fine synonymy or near synonymy between Cor-netto synsets and PWN synsets Most ISR rela-tions originate from the Van Dale database Asmall set of relations were added manually in thevarious projects All synonyms and relations haveprovenance tags which enables us to trace datafrom Van Dale and data that can transferred to theOpen Dutch WordNet

Table 1 presents the provenance statistics forthe most important components of the database

Component Van Dale RBN Cornetto

LU 60 57 15S 70 1 0ISR 77 0 33ESR 0 0 82

Table 1 The provenance information for LexicalUnits (LU) Synsets (S) Internal Semantic Rela-tions (ISR) and Equivalence Semantic Relations(ESR) is shown for each of the three sources VanDale RBN and Cornetto (if the source is Cor-netto this means that the data was created manu-ally in the Cornetto project and does not originatefrom Van Dale)

Table 1 clearly shows that a large part of theLUrsquos synsets and ISRs originate from Van DaleThe removal of this licensed content creates largegaps in the resource The main goal is hence to useopen source resources to replace the licensed con-tent with open source content as much as possibleOne of the most promising components to trans-fer information from Cornetto into Open DutchWordNet are the ESRs that were created semi-

301

automatically during the EuroWordNet and Cor-netto project and are 100 open source

3 Methodology

We used the following procedure to create OpenDutch WordNet

We use English WordNet30 (PWN) (Miller1995 Fellbaum 1998) as our basis for the conceptstructure This means that we copied the PWNsynsets and relations to ODWN and ignored allsynsets and relations from Van Dale The nextstep is to transfer the LUrsquos from RBN to the PWN-based synsets

Before copying these LUrsquos we improved thequality of the ESRs We defined a set of ESRs thatare either likely to be more difficult or that playan important role in the transfer This subset waschecked manually and was also used as trainingto filter the remaining ESRs using a decision treealgorithm This process is described in subsection31

Subsequently we make use of the ESRs be-tween Cornetto synsets and WordNet synsets tocopy the LUrsquos that do not originate from Van Dalefrom a Cornetto synset into a WordNet synsetwhich is described in subsection 32

The transfer still leaves us with many synsetsfrom PWN without a Dutch LU We therefore useopen source resources to translate the WordNetsynonyms into Dutch which is described in sub-sections 33 and 34 respectively This resultson the one hand in more synsets to have Dutchsynonyms but also in further evidence for trans-ferred synonyms to be correct because of evidencethrough other sources

Finally we manually checked 8257 Dutchsynonyms which is described in subsection 35

31 Revision of equivalence relations

Firstly we manually filtered the ESRs from whichwe focused on the synonymy relations Each ESRlinks a Cornetto synset to a WordNet synset witha certain relation type The mapping of an ESR isone of many to many We considered three mainaspects of Cornetto synsets in deciding whetherto manually check an ESR the synset depth thenumber of children and the number of ESRs Wedecided to manually check the deepest and shal-lowest synsets because these relations got littleattention in previous projects In addition wechecked the synsets with most children because

they play an important role in a wordnet Fi-nally the Cornetto synsets with most ESRs werechecked because we suspect that the equivalencerelation is complex and likely to contain manywrong mappings Four students manually checked12966 of the total 82285 ESRs of which 6575were removed

The manually revised relations were used totrain a pruned C45 decision tree algorithm (Quin-lan 1993 Hall et al 2009) that was used to filterthe remaining ESRs An ESR consists of an equiv-alence relation between a Cornetto synset and aWordNet synset We used properties of the Cor-netto synset and the WordNet synset as well as ofthe synset relation itself as features

1 the number of equivalence relations in whicha Cornetto synset and a Wordnet synset arepresent

2 the depth of the Cornetto synset and theWordnet synsets The difference of the depthis also used

3 Because a Cornetto synset can be presentin multiple ESRs to WordNet synsets andvice versa we average the semantic similar-ity scores (using the Leacock amp Chodorowsimilarity measure (Leacock and Chodorow1998)) of of all combinations of these ESRs

Interestingly enough the features in whichCornetto properties were used yielded the best re-sults This might be caused by the fact that therelations were also generated using Cornetto Thefiltering of the ESRs using the decision tree algo-rithm resulted in an additional removal of 32258ESRs

32 Cornetto synonymsWhen there exists an ESR between a Cor-netto synset and a WordNet synset and therelation type is either EQ SYNONYM orEQ NEAR SYNONYM all LUrsquos that do notoriginate from Van Dale are inserted into theWordNet synset Using figure 1 as an examplethe LUrsquos palmboom1 and palm1 would replacepalm tree1 and palm3 If the ESR was checkedmanually the provenance tag is cdb22 ManualIf the ESR was checked using the decision treealgorithm the provenance tag is cdb22 AutoThe provenance tag cdb22 None is given to allother strategies that were used to add LUrsquos to

302

Open Dutch WordNet One of the most dominantstrategies of this class is when a LU in a Cornettosynset does not have a direct ESR (no ESR orone of EQ HAS HYPERONYM) to a WordNetSynset but the parent of the Cornetto synset doeshave an ESR to a WordNet synset In that casea new synset (not represented in WordNet) iscreated as a hyponym of the target of the ESRof the hyperonym Finally the ESRs are used toinsert Cornetto synset relations into Open DutchWordNet that do not originate from Van Dale butwere created manually in one of the projects

33 External resources

Using various external open source resources suchas Wiktionary (Foundation 2014b) Omegawiki 4and Google (Google 2014) Oliver (2014) trans-lated both monosemous and polysemous lemmasinto Dutch for the part of speeches noun verband adjective For the monosemous lemmas theEnglish lemmas are simply translated into DutchFor the polysemous lemmas the gloss overlap be-tween examples in an external resource and thepossible WordNet synsets for a lemma are usedto determine the correct synset for a lemma Weused a similar procedure to add synonyms fromWikipedia (Wikipedia 2014 Foundation 2014a)

34 Adjectives extended

We created a mapping for two kinds of adjec-tives monosemous adjectives that have only onesense in WordNet and lsquoslightly polysemous ad-jectivesrsquo that have exactly one adjectival senseand one nominal sense Adjectives of the latterkind are typically nationalities (Cameroonian) re-ligious denominations (Buddhist) and words likepurebred To create the mapping we translated theEnglish word forms using Google Translate andBing Translate We also use the word alignmentsfrom the OPUS project (Tiedemann 2012) Theseresources provide us with Dutch candidate wordforms that should correspond to the original Word-Net synonyms in synsets We then checked foreach word form how many senses are associatedwith them in RBN If there is only one (and theword is indeed an adjective) we conclude that thisDutch sense corresponds with the original Word-Net synset

One problem with the translation-based ap-proach is that Dutch adjectives are sometimes in-

4 httpwwwomegawikiorg

flected with the suffix -e For example the Englishontological is automatically translated by Googleto ontologische In RBN all word forms are storedwithout the inflectional ending which means thatthe translation does not match the lemma To solvethis issue in the cases where we could not find adirect match we applied an automatic stemmingrule to remove the suffix and tried to find a matchusing the stem

35 Manual editing

Finally we checked the resulting Dutch wordnetmanually We focused on two main editing tasksFirstly we inspected all synsets that had 10 ormore synonyms since excessive synsets may con-tain false synonyms In addition because oneCornetto synset could have multiple ESRs it oc-curred that the same sense was copied into multi-ple WordNet synsets This may lead to excessivepolysemy The second task therefore consistedof indicating which WordNet synset was the cor-rect synset for a sense that occurred in more thanone WordNet synset In total 8257 LUrsquos werechecked in this phase

4 Overview and statistics

In this section we provide an overview of OpenDutch Wordnet in terms of general statistics theformat it is delivered in evaluation and a Pythonmodule which allows to interact with the resource

Open Dutch Wordnet contains 117914synsets of which the majority are noun synsets98049 There are 18782 verb synsets and 1083adjectival synsets 51588 synsets contain at leastone Dutch synonym which leaves 66326 synsetsstill to obtain a synonym The resource contains92295 synonyms of which 75173 are nouns15979 are verbs and 1143 are adjectives Theaverage polysemy is 15 19996 relations wereadded to the WordNet hierarchy

41 Format

Open Dutch WordNet is stored in a type of XMLcalled Global WordNet Grid LMF (httpsgithubcomglobalwordnetschemas)which is an adaptation of WordnetLMF (Vossenet al 2012) The XML contains two mainelements LexicalEntry and Synset LexicalEntryelements contain information about a specificsynonym whereas Synset elements containinformation about synsets A simplified example

303

of a LexicalEntry element can be found in figure2

ltLexicalEntry id=ondernemer-n-1partOfSpeech=noungt

ltLemma writtenForm=ondernemergtltSenseid=r_n-25922senseId=1definition=iemand met eigen bedrijfsynset=eng-30-10060352-nprovenance=cdb22_Auto+wiktionary+googleannotator=gt

ltLexicalEntrygt

Figure 2 A simplified example of a LexicalEntryelement is shown

In figure 2 an example of a LexicalEntry el-ement is shown The attributes id and partOf-Speech of the LexicalEntry element indicate theidentifier and the part of speech respectively Inthis example the identifier is ondernemer-n-1which refers to the first noun sense of the Dutchtranslation of entrepreneur in the sense of ldquosome-one who organizes a business venture and assumesthe risk for itrdquo The attribute writtenForm of theelement Lemma indicates the lemma Followingthe structure of Cornetto the LexicalEntry struc-ture represents a lexical unit and not a form unitThe motivation for this is that form properties candiffer from one meaning to another for a lemmaThe same form can thus appear in multiple Lexi-calEntry elements

Finally the Sense element contains five at-tributes

1 senseId refers to the synonym sense number

2 id stores the synonym sense identifier If theidentifier starts with r the synonym origi-nates from RBN In this case more informa-tion about the synonym can be found in RBNIn all other cases this is not available

3 definition presents the definition for thesense

4 synset points to the synset to which this syn-onym belongs

5 Concatenated by rsquo+rsquo the attribute prove-nance shows which resources proposed thisparticular synonym for this particular synset

6 the attribute annotator shows the name ofan annotator and marks that the synonym hasbeen checked manually The default value is

an empty string Currently 6370 LexicalEn-try elements have been checked manually

The LexicalEntry used in Figure 2 belongedto the synset ldquoeng-30-10060352-nrdquo Figure 3presents a simplified example of that Synset ele-ment

ltSynset id=eng-30-10060352-nili=i89775gt

ltDefinitionsgtltDefinitiongloss=iemand met eigen bedrijflanguage=nlprovenance=odwngt

ltDefinitiongloss=someone who organizesa business venture andassumes the risk for itlanguage=enprovenance=pwngt

ltSynsetRelationsgtltSynsetRelationprovenance=pwnrelType=has_hyperonymtarget=eng-30-09882716-ngt

ltSynsetRelationprovenance=odwnrelType=role_agenttarget=eng-30-01651293-vgtltSynsetRelationsgt

ltSynsetgt

Figure 3 A simplified example of a Synset ele-ment is shown

In figure 3 a simplified example is shownof a Synset element The Synset attributes id andili provide information about the synset identifierand the interlingual index identifier respectivelyhttpdatalider-projecteuili

The elements DefinitionsDefinition provideinformation about the gloss language and prove-nance of the definitions Finally the elementSynsetRelationsSynsetRelation stores the infor-mation about the relations between synsets Againthe provenance attribute is used to mark whetherthe relation originates from PWN or from Cor-netto

42 Analysis Lexical Entries

Open Dutch WordNet contains 92295 synonymsoriginating from various resources Table 2presents information about the number of syn-onyms from each resource

Table 2 presents the number of synonymsproposed by each resource Note that the samesynonym can be proposed by multiple resourceswhich is why the sum of all numbers is higher than

304

Provenance instances of all LE

cdb22 Auto 32806 355cdb22 None 19073 207

wiktionary 17968 195cdb22 Manual 13075 142

omegawiki 12589 136google 8374 91

opus 612 07bing 506 05

wikipedia 375 04

Table 2 The number of synonyms from each re-source is shown In addition the second columnindicates what percentage this number is relativeto all synonyms in Open Dutch Wordnet

the total number of synonyms The vast major-ity of synonyms originate from the ESRs (prefixedby cdb22) between Cornetto synsets and WordNetsynsets

In order to evaluate the quality of each re-source for the creation of Open Dutch Wordnetwe randomly evaluated 50 monosemous and pol-ysemous instances The results can be found intable 3

Provenance m p

Google 084 NAWiktionary 086 068Wikipedia 088 062

Omegawiki 090 086Cdb22 Manual 088 074

Cdb22 Auto 080 080Cdb22 None 096 078

Table 3 The evaluation results of randomly se-lected 50 monosemous (m) and polysemous (p) in-stances per resources is shown

Table 3 shows that the overall precision ofthe resource is high as far as the quality of asynonym that bears a certain provenance is con-cerned What it does not show is a fair compar-ison of the quality of each resource because notexactly the same strategy was used to extract in-formation from each resource For example onlymonosemous words were used from the outputfrom Google Overall we observe that 87 ofthe proposed monosemous synonyms were correctin the evaluation whereas this was 76 for thepolysemous synonyms The most valuable exter-

nal resource for Open Dutch WordNet seems to beOmegawiki which is not only present in 136of the LexicalEntry elements but also performedwell in the evaluation For comparison Sevens(Sevens et al 2014) performed an independentevaluation of the equivalence relations in Cornettoand reported precision of 5218 for a samplebased on all synsets and 8894 for a subset thatwas likely to have manually created links Al-though it is difficult to compare both samples forevaluation the precision for Open Dutch Wordnetis thus very much in line with the precision of Cor-netto as reported by them

43 Depth Distribution

66326 synsets in Open Dutch Wordnet still lack asynonym We were interested in knowing in whichpart of the hierarchy these synsets were locatedBreadth-first search was used to calculate synsetdepth Figure 4 presents the distribution of synsetswith and without synonyms per depth layer

020406080100

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

of synsets without synonyms

0 20 40 60 80 100

of synsets with synonyms

Figure 4 For each depth layer in Open DutchWordNet which ranges from the top level 1 to themost deepest layer 17 the percentage of synsets inthat layer with and without synonyms is shown

Figure 4 presents the distribution of synsetswith and without synonyms per depth layer Ingeneral we observe that the top layers have rela-tively few synsets without synonyms whereas theopposite is true for the deeper layers It is likelythat these lower level synsets can be filled easily ifbilingual resources extend their coverage Thesewords usually have a single meaning and only onetranslation

Also the opposite situation occurs that weadded new synsets to the hierarchy that are notin WordNet These synsets appear to be spread

305

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 3: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

automatically during the EuroWordNet and Cor-netto project and are 100 open source

3 Methodology

We used the following procedure to create OpenDutch WordNet

We use English WordNet30 (PWN) (Miller1995 Fellbaum 1998) as our basis for the conceptstructure This means that we copied the PWNsynsets and relations to ODWN and ignored allsynsets and relations from Van Dale The nextstep is to transfer the LUrsquos from RBN to the PWN-based synsets

Before copying these LUrsquos we improved thequality of the ESRs We defined a set of ESRs thatare either likely to be more difficult or that playan important role in the transfer This subset waschecked manually and was also used as trainingto filter the remaining ESRs using a decision treealgorithm This process is described in subsection31

Subsequently we make use of the ESRs be-tween Cornetto synsets and WordNet synsets tocopy the LUrsquos that do not originate from Van Dalefrom a Cornetto synset into a WordNet synsetwhich is described in subsection 32

The transfer still leaves us with many synsetsfrom PWN without a Dutch LU We therefore useopen source resources to translate the WordNetsynonyms into Dutch which is described in sub-sections 33 and 34 respectively This resultson the one hand in more synsets to have Dutchsynonyms but also in further evidence for trans-ferred synonyms to be correct because of evidencethrough other sources

Finally we manually checked 8257 Dutchsynonyms which is described in subsection 35

31 Revision of equivalence relations

Firstly we manually filtered the ESRs from whichwe focused on the synonymy relations Each ESRlinks a Cornetto synset to a WordNet synset witha certain relation type The mapping of an ESR isone of many to many We considered three mainaspects of Cornetto synsets in deciding whetherto manually check an ESR the synset depth thenumber of children and the number of ESRs Wedecided to manually check the deepest and shal-lowest synsets because these relations got littleattention in previous projects In addition wechecked the synsets with most children because

they play an important role in a wordnet Fi-nally the Cornetto synsets with most ESRs werechecked because we suspect that the equivalencerelation is complex and likely to contain manywrong mappings Four students manually checked12966 of the total 82285 ESRs of which 6575were removed

The manually revised relations were used totrain a pruned C45 decision tree algorithm (Quin-lan 1993 Hall et al 2009) that was used to filterthe remaining ESRs An ESR consists of an equiv-alence relation between a Cornetto synset and aWordNet synset We used properties of the Cor-netto synset and the WordNet synset as well as ofthe synset relation itself as features

1 the number of equivalence relations in whicha Cornetto synset and a Wordnet synset arepresent

2 the depth of the Cornetto synset and theWordnet synsets The difference of the depthis also used

3 Because a Cornetto synset can be presentin multiple ESRs to WordNet synsets andvice versa we average the semantic similar-ity scores (using the Leacock amp Chodorowsimilarity measure (Leacock and Chodorow1998)) of of all combinations of these ESRs

Interestingly enough the features in whichCornetto properties were used yielded the best re-sults This might be caused by the fact that therelations were also generated using Cornetto Thefiltering of the ESRs using the decision tree algo-rithm resulted in an additional removal of 32258ESRs

32 Cornetto synonymsWhen there exists an ESR between a Cor-netto synset and a WordNet synset and therelation type is either EQ SYNONYM orEQ NEAR SYNONYM all LUrsquos that do notoriginate from Van Dale are inserted into theWordNet synset Using figure 1 as an examplethe LUrsquos palmboom1 and palm1 would replacepalm tree1 and palm3 If the ESR was checkedmanually the provenance tag is cdb22 ManualIf the ESR was checked using the decision treealgorithm the provenance tag is cdb22 AutoThe provenance tag cdb22 None is given to allother strategies that were used to add LUrsquos to

302

Open Dutch WordNet One of the most dominantstrategies of this class is when a LU in a Cornettosynset does not have a direct ESR (no ESR orone of EQ HAS HYPERONYM) to a WordNetSynset but the parent of the Cornetto synset doeshave an ESR to a WordNet synset In that casea new synset (not represented in WordNet) iscreated as a hyponym of the target of the ESRof the hyperonym Finally the ESRs are used toinsert Cornetto synset relations into Open DutchWordNet that do not originate from Van Dale butwere created manually in one of the projects

33 External resources

Using various external open source resources suchas Wiktionary (Foundation 2014b) Omegawiki 4and Google (Google 2014) Oliver (2014) trans-lated both monosemous and polysemous lemmasinto Dutch for the part of speeches noun verband adjective For the monosemous lemmas theEnglish lemmas are simply translated into DutchFor the polysemous lemmas the gloss overlap be-tween examples in an external resource and thepossible WordNet synsets for a lemma are usedto determine the correct synset for a lemma Weused a similar procedure to add synonyms fromWikipedia (Wikipedia 2014 Foundation 2014a)

34 Adjectives extended

We created a mapping for two kinds of adjec-tives monosemous adjectives that have only onesense in WordNet and lsquoslightly polysemous ad-jectivesrsquo that have exactly one adjectival senseand one nominal sense Adjectives of the latterkind are typically nationalities (Cameroonian) re-ligious denominations (Buddhist) and words likepurebred To create the mapping we translated theEnglish word forms using Google Translate andBing Translate We also use the word alignmentsfrom the OPUS project (Tiedemann 2012) Theseresources provide us with Dutch candidate wordforms that should correspond to the original Word-Net synonyms in synsets We then checked foreach word form how many senses are associatedwith them in RBN If there is only one (and theword is indeed an adjective) we conclude that thisDutch sense corresponds with the original Word-Net synset

One problem with the translation-based ap-proach is that Dutch adjectives are sometimes in-

4 httpwwwomegawikiorg

flected with the suffix -e For example the Englishontological is automatically translated by Googleto ontologische In RBN all word forms are storedwithout the inflectional ending which means thatthe translation does not match the lemma To solvethis issue in the cases where we could not find adirect match we applied an automatic stemmingrule to remove the suffix and tried to find a matchusing the stem

35 Manual editing

Finally we checked the resulting Dutch wordnetmanually We focused on two main editing tasksFirstly we inspected all synsets that had 10 ormore synonyms since excessive synsets may con-tain false synonyms In addition because oneCornetto synset could have multiple ESRs it oc-curred that the same sense was copied into multi-ple WordNet synsets This may lead to excessivepolysemy The second task therefore consistedof indicating which WordNet synset was the cor-rect synset for a sense that occurred in more thanone WordNet synset In total 8257 LUrsquos werechecked in this phase

4 Overview and statistics

In this section we provide an overview of OpenDutch Wordnet in terms of general statistics theformat it is delivered in evaluation and a Pythonmodule which allows to interact with the resource

Open Dutch Wordnet contains 117914synsets of which the majority are noun synsets98049 There are 18782 verb synsets and 1083adjectival synsets 51588 synsets contain at leastone Dutch synonym which leaves 66326 synsetsstill to obtain a synonym The resource contains92295 synonyms of which 75173 are nouns15979 are verbs and 1143 are adjectives Theaverage polysemy is 15 19996 relations wereadded to the WordNet hierarchy

41 Format

Open Dutch WordNet is stored in a type of XMLcalled Global WordNet Grid LMF (httpsgithubcomglobalwordnetschemas)which is an adaptation of WordnetLMF (Vossenet al 2012) The XML contains two mainelements LexicalEntry and Synset LexicalEntryelements contain information about a specificsynonym whereas Synset elements containinformation about synsets A simplified example

303

of a LexicalEntry element can be found in figure2

ltLexicalEntry id=ondernemer-n-1partOfSpeech=noungt

ltLemma writtenForm=ondernemergtltSenseid=r_n-25922senseId=1definition=iemand met eigen bedrijfsynset=eng-30-10060352-nprovenance=cdb22_Auto+wiktionary+googleannotator=gt

ltLexicalEntrygt

Figure 2 A simplified example of a LexicalEntryelement is shown

In figure 2 an example of a LexicalEntry el-ement is shown The attributes id and partOf-Speech of the LexicalEntry element indicate theidentifier and the part of speech respectively Inthis example the identifier is ondernemer-n-1which refers to the first noun sense of the Dutchtranslation of entrepreneur in the sense of ldquosome-one who organizes a business venture and assumesthe risk for itrdquo The attribute writtenForm of theelement Lemma indicates the lemma Followingthe structure of Cornetto the LexicalEntry struc-ture represents a lexical unit and not a form unitThe motivation for this is that form properties candiffer from one meaning to another for a lemmaThe same form can thus appear in multiple Lexi-calEntry elements

Finally the Sense element contains five at-tributes

1 senseId refers to the synonym sense number

2 id stores the synonym sense identifier If theidentifier starts with r the synonym origi-nates from RBN In this case more informa-tion about the synonym can be found in RBNIn all other cases this is not available

3 definition presents the definition for thesense

4 synset points to the synset to which this syn-onym belongs

5 Concatenated by rsquo+rsquo the attribute prove-nance shows which resources proposed thisparticular synonym for this particular synset

6 the attribute annotator shows the name ofan annotator and marks that the synonym hasbeen checked manually The default value is

an empty string Currently 6370 LexicalEn-try elements have been checked manually

The LexicalEntry used in Figure 2 belongedto the synset ldquoeng-30-10060352-nrdquo Figure 3presents a simplified example of that Synset ele-ment

ltSynset id=eng-30-10060352-nili=i89775gt

ltDefinitionsgtltDefinitiongloss=iemand met eigen bedrijflanguage=nlprovenance=odwngt

ltDefinitiongloss=someone who organizesa business venture andassumes the risk for itlanguage=enprovenance=pwngt

ltSynsetRelationsgtltSynsetRelationprovenance=pwnrelType=has_hyperonymtarget=eng-30-09882716-ngt

ltSynsetRelationprovenance=odwnrelType=role_agenttarget=eng-30-01651293-vgtltSynsetRelationsgt

ltSynsetgt

Figure 3 A simplified example of a Synset ele-ment is shown

In figure 3 a simplified example is shownof a Synset element The Synset attributes id andili provide information about the synset identifierand the interlingual index identifier respectivelyhttpdatalider-projecteuili

The elements DefinitionsDefinition provideinformation about the gloss language and prove-nance of the definitions Finally the elementSynsetRelationsSynsetRelation stores the infor-mation about the relations between synsets Againthe provenance attribute is used to mark whetherthe relation originates from PWN or from Cor-netto

42 Analysis Lexical Entries

Open Dutch WordNet contains 92295 synonymsoriginating from various resources Table 2presents information about the number of syn-onyms from each resource

Table 2 presents the number of synonymsproposed by each resource Note that the samesynonym can be proposed by multiple resourceswhich is why the sum of all numbers is higher than

304

Provenance instances of all LE

cdb22 Auto 32806 355cdb22 None 19073 207

wiktionary 17968 195cdb22 Manual 13075 142

omegawiki 12589 136google 8374 91

opus 612 07bing 506 05

wikipedia 375 04

Table 2 The number of synonyms from each re-source is shown In addition the second columnindicates what percentage this number is relativeto all synonyms in Open Dutch Wordnet

the total number of synonyms The vast major-ity of synonyms originate from the ESRs (prefixedby cdb22) between Cornetto synsets and WordNetsynsets

In order to evaluate the quality of each re-source for the creation of Open Dutch Wordnetwe randomly evaluated 50 monosemous and pol-ysemous instances The results can be found intable 3

Provenance m p

Google 084 NAWiktionary 086 068Wikipedia 088 062

Omegawiki 090 086Cdb22 Manual 088 074

Cdb22 Auto 080 080Cdb22 None 096 078

Table 3 The evaluation results of randomly se-lected 50 monosemous (m) and polysemous (p) in-stances per resources is shown

Table 3 shows that the overall precision ofthe resource is high as far as the quality of asynonym that bears a certain provenance is con-cerned What it does not show is a fair compar-ison of the quality of each resource because notexactly the same strategy was used to extract in-formation from each resource For example onlymonosemous words were used from the outputfrom Google Overall we observe that 87 ofthe proposed monosemous synonyms were correctin the evaluation whereas this was 76 for thepolysemous synonyms The most valuable exter-

nal resource for Open Dutch WordNet seems to beOmegawiki which is not only present in 136of the LexicalEntry elements but also performedwell in the evaluation For comparison Sevens(Sevens et al 2014) performed an independentevaluation of the equivalence relations in Cornettoand reported precision of 5218 for a samplebased on all synsets and 8894 for a subset thatwas likely to have manually created links Al-though it is difficult to compare both samples forevaluation the precision for Open Dutch Wordnetis thus very much in line with the precision of Cor-netto as reported by them

43 Depth Distribution

66326 synsets in Open Dutch Wordnet still lack asynonym We were interested in knowing in whichpart of the hierarchy these synsets were locatedBreadth-first search was used to calculate synsetdepth Figure 4 presents the distribution of synsetswith and without synonyms per depth layer

020406080100

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

of synsets without synonyms

0 20 40 60 80 100

of synsets with synonyms

Figure 4 For each depth layer in Open DutchWordNet which ranges from the top level 1 to themost deepest layer 17 the percentage of synsets inthat layer with and without synonyms is shown

Figure 4 presents the distribution of synsetswith and without synonyms per depth layer Ingeneral we observe that the top layers have rela-tively few synsets without synonyms whereas theopposite is true for the deeper layers It is likelythat these lower level synsets can be filled easily ifbilingual resources extend their coverage Thesewords usually have a single meaning and only onetranslation

Also the opposite situation occurs that weadded new synsets to the hierarchy that are notin WordNet These synsets appear to be spread

305

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 4: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

Open Dutch WordNet One of the most dominantstrategies of this class is when a LU in a Cornettosynset does not have a direct ESR (no ESR orone of EQ HAS HYPERONYM) to a WordNetSynset but the parent of the Cornetto synset doeshave an ESR to a WordNet synset In that casea new synset (not represented in WordNet) iscreated as a hyponym of the target of the ESRof the hyperonym Finally the ESRs are used toinsert Cornetto synset relations into Open DutchWordNet that do not originate from Van Dale butwere created manually in one of the projects

33 External resources

Using various external open source resources suchas Wiktionary (Foundation 2014b) Omegawiki 4and Google (Google 2014) Oliver (2014) trans-lated both monosemous and polysemous lemmasinto Dutch for the part of speeches noun verband adjective For the monosemous lemmas theEnglish lemmas are simply translated into DutchFor the polysemous lemmas the gloss overlap be-tween examples in an external resource and thepossible WordNet synsets for a lemma are usedto determine the correct synset for a lemma Weused a similar procedure to add synonyms fromWikipedia (Wikipedia 2014 Foundation 2014a)

34 Adjectives extended

We created a mapping for two kinds of adjec-tives monosemous adjectives that have only onesense in WordNet and lsquoslightly polysemous ad-jectivesrsquo that have exactly one adjectival senseand one nominal sense Adjectives of the latterkind are typically nationalities (Cameroonian) re-ligious denominations (Buddhist) and words likepurebred To create the mapping we translated theEnglish word forms using Google Translate andBing Translate We also use the word alignmentsfrom the OPUS project (Tiedemann 2012) Theseresources provide us with Dutch candidate wordforms that should correspond to the original Word-Net synonyms in synsets We then checked foreach word form how many senses are associatedwith them in RBN If there is only one (and theword is indeed an adjective) we conclude that thisDutch sense corresponds with the original Word-Net synset

One problem with the translation-based ap-proach is that Dutch adjectives are sometimes in-

4 httpwwwomegawikiorg

flected with the suffix -e For example the Englishontological is automatically translated by Googleto ontologische In RBN all word forms are storedwithout the inflectional ending which means thatthe translation does not match the lemma To solvethis issue in the cases where we could not find adirect match we applied an automatic stemmingrule to remove the suffix and tried to find a matchusing the stem

35 Manual editing

Finally we checked the resulting Dutch wordnetmanually We focused on two main editing tasksFirstly we inspected all synsets that had 10 ormore synonyms since excessive synsets may con-tain false synonyms In addition because oneCornetto synset could have multiple ESRs it oc-curred that the same sense was copied into multi-ple WordNet synsets This may lead to excessivepolysemy The second task therefore consistedof indicating which WordNet synset was the cor-rect synset for a sense that occurred in more thanone WordNet synset In total 8257 LUrsquos werechecked in this phase

4 Overview and statistics

In this section we provide an overview of OpenDutch Wordnet in terms of general statistics theformat it is delivered in evaluation and a Pythonmodule which allows to interact with the resource

Open Dutch Wordnet contains 117914synsets of which the majority are noun synsets98049 There are 18782 verb synsets and 1083adjectival synsets 51588 synsets contain at leastone Dutch synonym which leaves 66326 synsetsstill to obtain a synonym The resource contains92295 synonyms of which 75173 are nouns15979 are verbs and 1143 are adjectives Theaverage polysemy is 15 19996 relations wereadded to the WordNet hierarchy

41 Format

Open Dutch WordNet is stored in a type of XMLcalled Global WordNet Grid LMF (httpsgithubcomglobalwordnetschemas)which is an adaptation of WordnetLMF (Vossenet al 2012) The XML contains two mainelements LexicalEntry and Synset LexicalEntryelements contain information about a specificsynonym whereas Synset elements containinformation about synsets A simplified example

303

of a LexicalEntry element can be found in figure2

ltLexicalEntry id=ondernemer-n-1partOfSpeech=noungt

ltLemma writtenForm=ondernemergtltSenseid=r_n-25922senseId=1definition=iemand met eigen bedrijfsynset=eng-30-10060352-nprovenance=cdb22_Auto+wiktionary+googleannotator=gt

ltLexicalEntrygt

Figure 2 A simplified example of a LexicalEntryelement is shown

In figure 2 an example of a LexicalEntry el-ement is shown The attributes id and partOf-Speech of the LexicalEntry element indicate theidentifier and the part of speech respectively Inthis example the identifier is ondernemer-n-1which refers to the first noun sense of the Dutchtranslation of entrepreneur in the sense of ldquosome-one who organizes a business venture and assumesthe risk for itrdquo The attribute writtenForm of theelement Lemma indicates the lemma Followingthe structure of Cornetto the LexicalEntry struc-ture represents a lexical unit and not a form unitThe motivation for this is that form properties candiffer from one meaning to another for a lemmaThe same form can thus appear in multiple Lexi-calEntry elements

Finally the Sense element contains five at-tributes

1 senseId refers to the synonym sense number

2 id stores the synonym sense identifier If theidentifier starts with r the synonym origi-nates from RBN In this case more informa-tion about the synonym can be found in RBNIn all other cases this is not available

3 definition presents the definition for thesense

4 synset points to the synset to which this syn-onym belongs

5 Concatenated by rsquo+rsquo the attribute prove-nance shows which resources proposed thisparticular synonym for this particular synset

6 the attribute annotator shows the name ofan annotator and marks that the synonym hasbeen checked manually The default value is

an empty string Currently 6370 LexicalEn-try elements have been checked manually

The LexicalEntry used in Figure 2 belongedto the synset ldquoeng-30-10060352-nrdquo Figure 3presents a simplified example of that Synset ele-ment

ltSynset id=eng-30-10060352-nili=i89775gt

ltDefinitionsgtltDefinitiongloss=iemand met eigen bedrijflanguage=nlprovenance=odwngt

ltDefinitiongloss=someone who organizesa business venture andassumes the risk for itlanguage=enprovenance=pwngt

ltSynsetRelationsgtltSynsetRelationprovenance=pwnrelType=has_hyperonymtarget=eng-30-09882716-ngt

ltSynsetRelationprovenance=odwnrelType=role_agenttarget=eng-30-01651293-vgtltSynsetRelationsgt

ltSynsetgt

Figure 3 A simplified example of a Synset ele-ment is shown

In figure 3 a simplified example is shownof a Synset element The Synset attributes id andili provide information about the synset identifierand the interlingual index identifier respectivelyhttpdatalider-projecteuili

The elements DefinitionsDefinition provideinformation about the gloss language and prove-nance of the definitions Finally the elementSynsetRelationsSynsetRelation stores the infor-mation about the relations between synsets Againthe provenance attribute is used to mark whetherthe relation originates from PWN or from Cor-netto

42 Analysis Lexical Entries

Open Dutch WordNet contains 92295 synonymsoriginating from various resources Table 2presents information about the number of syn-onyms from each resource

Table 2 presents the number of synonymsproposed by each resource Note that the samesynonym can be proposed by multiple resourceswhich is why the sum of all numbers is higher than

304

Provenance instances of all LE

cdb22 Auto 32806 355cdb22 None 19073 207

wiktionary 17968 195cdb22 Manual 13075 142

omegawiki 12589 136google 8374 91

opus 612 07bing 506 05

wikipedia 375 04

Table 2 The number of synonyms from each re-source is shown In addition the second columnindicates what percentage this number is relativeto all synonyms in Open Dutch Wordnet

the total number of synonyms The vast major-ity of synonyms originate from the ESRs (prefixedby cdb22) between Cornetto synsets and WordNetsynsets

In order to evaluate the quality of each re-source for the creation of Open Dutch Wordnetwe randomly evaluated 50 monosemous and pol-ysemous instances The results can be found intable 3

Provenance m p

Google 084 NAWiktionary 086 068Wikipedia 088 062

Omegawiki 090 086Cdb22 Manual 088 074

Cdb22 Auto 080 080Cdb22 None 096 078

Table 3 The evaluation results of randomly se-lected 50 monosemous (m) and polysemous (p) in-stances per resources is shown

Table 3 shows that the overall precision ofthe resource is high as far as the quality of asynonym that bears a certain provenance is con-cerned What it does not show is a fair compar-ison of the quality of each resource because notexactly the same strategy was used to extract in-formation from each resource For example onlymonosemous words were used from the outputfrom Google Overall we observe that 87 ofthe proposed monosemous synonyms were correctin the evaluation whereas this was 76 for thepolysemous synonyms The most valuable exter-

nal resource for Open Dutch WordNet seems to beOmegawiki which is not only present in 136of the LexicalEntry elements but also performedwell in the evaluation For comparison Sevens(Sevens et al 2014) performed an independentevaluation of the equivalence relations in Cornettoand reported precision of 5218 for a samplebased on all synsets and 8894 for a subset thatwas likely to have manually created links Al-though it is difficult to compare both samples forevaluation the precision for Open Dutch Wordnetis thus very much in line with the precision of Cor-netto as reported by them

43 Depth Distribution

66326 synsets in Open Dutch Wordnet still lack asynonym We were interested in knowing in whichpart of the hierarchy these synsets were locatedBreadth-first search was used to calculate synsetdepth Figure 4 presents the distribution of synsetswith and without synonyms per depth layer

020406080100

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

of synsets without synonyms

0 20 40 60 80 100

of synsets with synonyms

Figure 4 For each depth layer in Open DutchWordNet which ranges from the top level 1 to themost deepest layer 17 the percentage of synsets inthat layer with and without synonyms is shown

Figure 4 presents the distribution of synsetswith and without synonyms per depth layer Ingeneral we observe that the top layers have rela-tively few synsets without synonyms whereas theopposite is true for the deeper layers It is likelythat these lower level synsets can be filled easily ifbilingual resources extend their coverage Thesewords usually have a single meaning and only onetranslation

Also the opposite situation occurs that weadded new synsets to the hierarchy that are notin WordNet These synsets appear to be spread

305

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 5: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

of a LexicalEntry element can be found in figure2

ltLexicalEntry id=ondernemer-n-1partOfSpeech=noungt

ltLemma writtenForm=ondernemergtltSenseid=r_n-25922senseId=1definition=iemand met eigen bedrijfsynset=eng-30-10060352-nprovenance=cdb22_Auto+wiktionary+googleannotator=gt

ltLexicalEntrygt

Figure 2 A simplified example of a LexicalEntryelement is shown

In figure 2 an example of a LexicalEntry el-ement is shown The attributes id and partOf-Speech of the LexicalEntry element indicate theidentifier and the part of speech respectively Inthis example the identifier is ondernemer-n-1which refers to the first noun sense of the Dutchtranslation of entrepreneur in the sense of ldquosome-one who organizes a business venture and assumesthe risk for itrdquo The attribute writtenForm of theelement Lemma indicates the lemma Followingthe structure of Cornetto the LexicalEntry struc-ture represents a lexical unit and not a form unitThe motivation for this is that form properties candiffer from one meaning to another for a lemmaThe same form can thus appear in multiple Lexi-calEntry elements

Finally the Sense element contains five at-tributes

1 senseId refers to the synonym sense number

2 id stores the synonym sense identifier If theidentifier starts with r the synonym origi-nates from RBN In this case more informa-tion about the synonym can be found in RBNIn all other cases this is not available

3 definition presents the definition for thesense

4 synset points to the synset to which this syn-onym belongs

5 Concatenated by rsquo+rsquo the attribute prove-nance shows which resources proposed thisparticular synonym for this particular synset

6 the attribute annotator shows the name ofan annotator and marks that the synonym hasbeen checked manually The default value is

an empty string Currently 6370 LexicalEn-try elements have been checked manually

The LexicalEntry used in Figure 2 belongedto the synset ldquoeng-30-10060352-nrdquo Figure 3presents a simplified example of that Synset ele-ment

ltSynset id=eng-30-10060352-nili=i89775gt

ltDefinitionsgtltDefinitiongloss=iemand met eigen bedrijflanguage=nlprovenance=odwngt

ltDefinitiongloss=someone who organizesa business venture andassumes the risk for itlanguage=enprovenance=pwngt

ltSynsetRelationsgtltSynsetRelationprovenance=pwnrelType=has_hyperonymtarget=eng-30-09882716-ngt

ltSynsetRelationprovenance=odwnrelType=role_agenttarget=eng-30-01651293-vgtltSynsetRelationsgt

ltSynsetgt

Figure 3 A simplified example of a Synset ele-ment is shown

In figure 3 a simplified example is shownof a Synset element The Synset attributes id andili provide information about the synset identifierand the interlingual index identifier respectivelyhttpdatalider-projecteuili

The elements DefinitionsDefinition provideinformation about the gloss language and prove-nance of the definitions Finally the elementSynsetRelationsSynsetRelation stores the infor-mation about the relations between synsets Againthe provenance attribute is used to mark whetherthe relation originates from PWN or from Cor-netto

42 Analysis Lexical Entries

Open Dutch WordNet contains 92295 synonymsoriginating from various resources Table 2presents information about the number of syn-onyms from each resource

Table 2 presents the number of synonymsproposed by each resource Note that the samesynonym can be proposed by multiple resourceswhich is why the sum of all numbers is higher than

304

Provenance instances of all LE

cdb22 Auto 32806 355cdb22 None 19073 207

wiktionary 17968 195cdb22 Manual 13075 142

omegawiki 12589 136google 8374 91

opus 612 07bing 506 05

wikipedia 375 04

Table 2 The number of synonyms from each re-source is shown In addition the second columnindicates what percentage this number is relativeto all synonyms in Open Dutch Wordnet

the total number of synonyms The vast major-ity of synonyms originate from the ESRs (prefixedby cdb22) between Cornetto synsets and WordNetsynsets

In order to evaluate the quality of each re-source for the creation of Open Dutch Wordnetwe randomly evaluated 50 monosemous and pol-ysemous instances The results can be found intable 3

Provenance m p

Google 084 NAWiktionary 086 068Wikipedia 088 062

Omegawiki 090 086Cdb22 Manual 088 074

Cdb22 Auto 080 080Cdb22 None 096 078

Table 3 The evaluation results of randomly se-lected 50 monosemous (m) and polysemous (p) in-stances per resources is shown

Table 3 shows that the overall precision ofthe resource is high as far as the quality of asynonym that bears a certain provenance is con-cerned What it does not show is a fair compar-ison of the quality of each resource because notexactly the same strategy was used to extract in-formation from each resource For example onlymonosemous words were used from the outputfrom Google Overall we observe that 87 ofthe proposed monosemous synonyms were correctin the evaluation whereas this was 76 for thepolysemous synonyms The most valuable exter-

nal resource for Open Dutch WordNet seems to beOmegawiki which is not only present in 136of the LexicalEntry elements but also performedwell in the evaluation For comparison Sevens(Sevens et al 2014) performed an independentevaluation of the equivalence relations in Cornettoand reported precision of 5218 for a samplebased on all synsets and 8894 for a subset thatwas likely to have manually created links Al-though it is difficult to compare both samples forevaluation the precision for Open Dutch Wordnetis thus very much in line with the precision of Cor-netto as reported by them

43 Depth Distribution

66326 synsets in Open Dutch Wordnet still lack asynonym We were interested in knowing in whichpart of the hierarchy these synsets were locatedBreadth-first search was used to calculate synsetdepth Figure 4 presents the distribution of synsetswith and without synonyms per depth layer

020406080100

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

of synsets without synonyms

0 20 40 60 80 100

of synsets with synonyms

Figure 4 For each depth layer in Open DutchWordNet which ranges from the top level 1 to themost deepest layer 17 the percentage of synsets inthat layer with and without synonyms is shown

Figure 4 presents the distribution of synsetswith and without synonyms per depth layer Ingeneral we observe that the top layers have rela-tively few synsets without synonyms whereas theopposite is true for the deeper layers It is likelythat these lower level synsets can be filled easily ifbilingual resources extend their coverage Thesewords usually have a single meaning and only onetranslation

Also the opposite situation occurs that weadded new synsets to the hierarchy that are notin WordNet These synsets appear to be spread

305

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 6: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

Provenance instances of all LE

cdb22 Auto 32806 355cdb22 None 19073 207

wiktionary 17968 195cdb22 Manual 13075 142

omegawiki 12589 136google 8374 91

opus 612 07bing 506 05

wikipedia 375 04

Table 2 The number of synonyms from each re-source is shown In addition the second columnindicates what percentage this number is relativeto all synonyms in Open Dutch Wordnet

the total number of synonyms The vast major-ity of synonyms originate from the ESRs (prefixedby cdb22) between Cornetto synsets and WordNetsynsets

In order to evaluate the quality of each re-source for the creation of Open Dutch Wordnetwe randomly evaluated 50 monosemous and pol-ysemous instances The results can be found intable 3

Provenance m p

Google 084 NAWiktionary 086 068Wikipedia 088 062

Omegawiki 090 086Cdb22 Manual 088 074

Cdb22 Auto 080 080Cdb22 None 096 078

Table 3 The evaluation results of randomly se-lected 50 monosemous (m) and polysemous (p) in-stances per resources is shown

Table 3 shows that the overall precision ofthe resource is high as far as the quality of asynonym that bears a certain provenance is con-cerned What it does not show is a fair compar-ison of the quality of each resource because notexactly the same strategy was used to extract in-formation from each resource For example onlymonosemous words were used from the outputfrom Google Overall we observe that 87 ofthe proposed monosemous synonyms were correctin the evaluation whereas this was 76 for thepolysemous synonyms The most valuable exter-

nal resource for Open Dutch WordNet seems to beOmegawiki which is not only present in 136of the LexicalEntry elements but also performedwell in the evaluation For comparison Sevens(Sevens et al 2014) performed an independentevaluation of the equivalence relations in Cornettoand reported precision of 5218 for a samplebased on all synsets and 8894 for a subset thatwas likely to have manually created links Al-though it is difficult to compare both samples forevaluation the precision for Open Dutch Wordnetis thus very much in line with the precision of Cor-netto as reported by them

43 Depth Distribution

66326 synsets in Open Dutch Wordnet still lack asynonym We were interested in knowing in whichpart of the hierarchy these synsets were locatedBreadth-first search was used to calculate synsetdepth Figure 4 presents the distribution of synsetswith and without synonyms per depth layer

020406080100

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

of synsets without synonyms

0 20 40 60 80 100

of synsets with synonyms

Figure 4 For each depth layer in Open DutchWordNet which ranges from the top level 1 to themost deepest layer 17 the percentage of synsets inthat layer with and without synonyms is shown

Figure 4 presents the distribution of synsetswith and without synonyms per depth layer Ingeneral we observe that the top layers have rela-tively few synsets without synonyms whereas theopposite is true for the deeper layers It is likelythat these lower level synsets can be filled easily ifbilingual resources extend their coverage Thesewords usually have a single meaning and only onetranslation

Also the opposite situation occurs that weadded new synsets to the hierarchy that are notin WordNet These synsets appear to be spread

305

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 7: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

over all levels of the hierarchy It is more difficultto resolve these cases since searching for possiblematches in WordNet that could have been missedcan only partially be supported through eg gloss-comparison but in the end needs to be verifiedmanually To support this process we visualizedthese concepts in the hierarchy An example canbe found in Figure 5

Figure 5 In this visualisation pink nodes arenew concepts red nodes are WordNet synsets withDutch synonyms and blue nodes are WordNetsynsets without Dutch synonyms

Figure 5 presents an example of a new con-cept that has been added to the hierarchy Weadded the concept of tramhalte (tram stop) as ahyponym of the concept lsquostoprsquo In general we ob-served that we mostly added concepts that are rep-resented in Dutch by compounds such as polder-landschap (flat barren landscape)

44 Python moduleA Python module has been created to use OpenDutch WordNet The module can be found athttpsgithubcomMartenPostmaOpenDutchWordnet It is designed in Python34 The module allows the user to inspect theLexicalEntry and Synset elements and to gathergeneral statistics about the resource Finally it ispossible to edit the resource using this module

5 Discussion and future work

In this section we discuss the process of creatingOpen Dutch WordNet as well as future work tofurther improve the resource

A part of Open Dutch WordNet consists ofsynonyms that originate from the inter-languagelinks in external resources such as OmegawikiWiktionary and Wikipedia It is interesting toobserve that we obtained mostly noun synonyms

Figure 6 This figure visualizes the noun hy-peronym hierarchie in ODWN The black centernode represents the top noun node (lsquoentitiyrsquo) Inthis visualisation pink nodes are new conceptsred nodes are WordNet synsets with Dutch syn-onyms and blue nodes are WordNet synsets with-out Dutch synonyms

from these resources There are two main rea-sons why this is the case Firstly nouns simplyhave more entries in these resources In additionit is obviously more difficult to disambiguate verbsthan nouns In order to get a better understandingof where we added Dutch noun synonyms we vi-sualized the noun hyperonym hierarchy which canbe found in Figure 6

In Figure 6 the noun hyperonym hierarchyis visualized focusing on which synsets containa Dutch synonym The lower left side shows alarge blue spot which means that no Dutch syn-onyms are located in that part of the hierarchy Weidentified the synset genus (lsquotaxonomic group con-taining one or more species)rsquo as the main hyper-onym of this part In addition we observe pinknodes around the top node which we identified asreligious terms such as Heer (Lord) and Jaweh(Jaweh)

In order to improve the resource we striveto both improve the quality and quantity of the re-source The quality will be improved by manuallyinspecting the synsets ranging from 5 to 10 syn-onyms The quantity will be improved by addingsynonyms in the deeper parts of the resource Thiscan be done by using more or improved publicbilingual resources both English-Dutch but also

306

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 8: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

by combining more languages or by using par-allel corpora In addition we plan to assess themost important parts of the hierarchy This in-volves the top nodes of the hierarchies and thebase concepts Errors in these synsets are likely topropagate to other synsets in lower parts of the hi-erarchy Finally the relations imported from Cor-netto are now added to the PWN relations As aresult we obtained 115077 hyperonym relationsfrom PWN and 19996 hyperonym relations fromCornetto Additional hyperonym relations resultin tangled hierarchies with more complex seman-tics Whereas PWN has 559 top nodes for verbsODWN has 154 tops The reduction of the tops isdue to the additional relations that were created inCornetto to provide more structure to the verb hi-erarchy In Cornetto there are only two top nodesfor the verb hierarchy

Open Dutch WordNet currently contains alimited amount of monosemous adjectives Wehope to be able to map the polysemous adjectivesynsets to PWN synsets by translating the Dutchglosses and by making use of the synset rela-tions in Cornetto and Princeton WordNet BecauseDutch is very close to German another possibil-ity is to map the Cornetto synsets to GermaNet(Hamp and Feldweg 1997) and make use of therich set of synset relations that it provides

Finally the current format of the resource isXML We would also like to make the resourceavailable in RDF (Klyne and Carroll 2006)

6 Conclusion

We described Open Dutch WordNet which is de-rived from the Cornetto database Princeton Word-Net and various external resources We exploitedexisting equivalence relations between Cornettosynsets and WordNet synsets in order to replaceWordNet synonyms by Dutch synonyms In ad-dition the inter-language links in various exter-nal resources such as Wiktionary and Omegawikiwere used to add synonyms to the resource Inaddition we manually evaluated each resourceand manually edited the most problematic synsetsThe Princeton-based hierarchy was also extendedwith manually created relations came from Cor-netto

Open Dutch Wordnet contains 92295 syn-onyms which are located in 51588 synsets Thereare 75173 nouns 15979 verbs and 1143 adjec-tives In total the resource consists of 117914

synsets which leave 66326 synsets still to obtaina synonym The average polysemy is 15

The resource is currently delivered inXML under the CC BY-SA 40 license5 Inorder to use and improve the resource aPython module has been created This mod-ule can be found at httpsgithubcomMartenPostmaOpenDutchWordnet

Acknowledgments

This project has been co-funded by the Neder-landse Taalunie (httptaalunieorg)In addition we thank Anne Broekhuis AnjaStoop Marjolein Klaassen and Amber Wit-senburg for their work on evaluating the ESRsmanually Moreover we thank Isa Maks(httpswwwlinkedincompubisa-maks24b47) and Hennie van derVliet (httpswwwlinkedincompubhennie-van-der-vliet0869512)for their valuable input Finally we wouldlike to thank Adam Rambousek (httpwwwmuniczfipeople60380) forhis help in creating and updating the DebVisDiceditor

ReferencesChristiane Fellbaum 1998 Wordnet An Electronic

Lexical Database MIT Press Cambridge MA

Wikimedia Foundation 2014a Wikipedia httpenwikipediaorg

Wikimedia Foundation 2014b Wiktionary httpenwiktionaryorg

Google 2014 Google translate httpstranslategooglenl

Mark Hall Eibe Frank Geoffrey Holmes BernhardPfahringer Peter Reutemann and Ian H Wit-ten 2009 The weka data mining software anupdate ACM SIGKDD explorations newsletter11(1)10ndash18

Birgit Hamp and Helmut Feldweg 1997 Germanet-alexical-semantic net for german In Proceedingsof ACL workshop Automatic Information Extrac-tion and Building of Lexical Semantic Resourcesfor NLP Applications pages 9ndash15

Graham Klyne and Jeremy J Carroll 2006 Resourcedescription framework (rdf) Concepts and ab-stract syntax

5 httpscreativecommonsorglicensesby-sa40

307

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308

Page 9: Open Dutch WordNet - Emiel van Miltenburg · We describe Open Dutch WordNet, which has been derived from the Cornetto database, the Princeton WordNet and open source resources. We

Claudia Leacock and Martin Chodorow 1998 Com-bining local context and wordnet similarity forword sense identification WordNet An elec-tronic lexical database 49(2)265ndash283

George A Miller 1995 Wordnet a Lexical Databasefor English Communications of the ACM38(11)39ndash41

Antoni Oliver 2014 Wn-toolkit Automatic gener-ation of wordnets following the expand modelProceedings of the 7th Global WordNetConfer-ence Tartu Estonia

Ross Quinlan 1993 C45 Programs for MachineLearning Morgan Kaufmann Publishers SanMateo CA

Leen Sevens Vincent Vandeghinste and Frank VanEynde 2014 Improving the precision of synsetlinks between cornetto and princeton wordnetProceedings of the Workshop on Lexical andGrammatical Resources for Language Process-ingColing 2014 Dublin Ireland pages 120ndash126

Jorg Tiedemann 2012 Parallel data tools and inter-faces in opus In LREC pages 2214ndash2218

Hennie Van der Vliet 2007 The Referentiebe-stand Nederlands as a multi-purpose lexicaldatabase International Journal of Lexicography20(3)239ndash257

P Vossen I Maks R Segers and H Vliet 2008van der zutphen h van(2008) the cornettodatabase the architecture and alignment issuesIn Proceedings of the Fourth International Glob-alWordNet Conference-GWC 2008 pages 22ndash25

Piek Vossen Claudia Soria and Monica Monachini2012 Wordnet-lmf a standard representationfor multilingual wordnets In G Francopouloeditor LMF Lexical Markup Framework theoryand practice pages 51ndash66 Hermes LavoisierISTE

Piek Vossen Isa Maks Roxane Segers Hennie van derVliet Marie-Francine Moens Katja HofmannErik Tjong Kim Sang and Maarten de Rijke2013 Cornetto a Combinatorial Lexical Se-mantic Database for Dutch In Jan Odijk Pe-ter Spyns editor Essential Speech and LanguageTechnology for Dutch Theory and Applicationsof Natural Language Processing pages 165ndash184Springer

Piek Vossen 1999 Eurowordnet General documentversion 3 final University of Amsterdam Eu-roWordNet LE2-4003 LE4-8328

Wikipedia 2014 Plagiarism mdash Wikipediathe free encyclopedia httpenwikipediaorgwindexphptitle=Plagiarismampoldid=5139350

308