ewa rudnicka, marek maziarz, maciej piasecki g4.19 research group institute of informatics, wrocław...

21
Ewa Rudnicka, Marek Maziarz, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl

Upload: timothy-sutton

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Ewa Rudnicka, Marek Maziarz, Maciej Piasecki

G4.19 Research GroupInstitute of Informatics,

Wrocław University of Technology

nlp.pwr.wroc.pl

plwordnet.pwr.wroc.pl

What is a wordnet?

Princeton WordNet (Fellbaum 1998)a huge electronic lexical database – a kind of thesaurus, yet of a much more advanced structure Words grouped into synonym sets called synsets Synsets linked via different lexico-semantic relations such as synonymy, near-synonymy, hypernymy/hyponymy, meronymy/holonymy, antonymy, fuzzynymy)the integratation of lexical data gathered from the existing resources such as traditional and electronic dictionaries as well as from corporapsycholinguistic principles – the structure of human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for verbs

WordNet – a lexico-semantic database

multi-lingual databases consisting of inter-linked 'national'/mono-lingual wordnets:

EuroWordNet - transfer method – translation from Princeton WordNet

Dutch, Spanish, Italian, French, German, Czech and Estonian (cf. Vossen 2002)

MultiWordNet - semi-automatic acquisition method from the Princeton WordNet

Italian , Spanish, Portuguese, Romanian and Latin (Bentivogli et. al. )

IndoWordNet Sinha et al. 2006, Bhattacharyya 2010)

expansion approach from Hindi wordnet;16 out of 22 languages of India

Multi-lingual wordnets

plWordNet (Słowosieć)

plWordNet – developed fairly independently of Princeton WordNet by applying a unique corpus-based method

one of the biggest existing wordnets

the emphasis on relations between lexical units, not between synsets

much more relations, some of them specially designed to cover the pecularities of morphosyntactic structure of Polish

(cf. Piasecki et al. 2009, Maziarz et al. 2012)

Basic common concepts:

lemma – base form representing different inflectional forms and different meanings

Lexical unit – lemma plus sense pair (in wordnets marked with number)

Synset – a set of synonymous lexical units

Differences:

plWN – synsets built of lexical units sharing the same constitutive relations (such as hyponymy, hypernymy, meronymy, holonymy)

PWN – a synset represents a 'lexicalised concept' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation, understood as a conceptual relation established on the basis of linguist's intuitions and dictionary definitions

plWordNet vs. Princeton WordNet

Mapping plWordNet on Princeton WordNet

Linking plWordNet synsets with Princeton Wordnet synsets

Defining a set of inter-lingual relations

Setting a hierarchy of inter-lingual relations

Designing mapping procedure

Mapping direction: plWordNet > Princeton WordNet

Domains selected for mapping:

person, artefact, location, family relationships, food, time, vocabulary connected with thinking and communication

a novel perspective – linking two independent systems

the main challenge – different philosophical, theoretical and

methodological assumptions

Inter-lingual relationshierarchy

A set of inter-lingual relations inspired by:

- inter-lingual relations from EuroWordNet (Vossen 2002)

- intra-lingual relations from plWordNet (Maziarz et al. 2011)

1. Synonymy2. Partial synonymy3. Inter-register synonymy 4. Hyponymy5. Hypernymy6. Meronymy7. Holonymy

Inter-lingual relations (1)

Synonymy (only one per one synset) - for large correspondence in sense and position in the source wordnet structure combined with many indirect inter-lingual links between the source and target synsets

Inter-register synonymy - for I-synonyms as defined above, but differing in stylistic register

Partial synonymy - in the case of partial correspondence of meanings and/or structures

Partial synonymy

Inter-lingual relations (2)

Inter-lingual hyponymy -

defined in terms of inclusion of set denotation:

a hyponym refers to an object which is included in the denotation set of a hypernym

Inter-lingual hypernymy -

defined in terms of inclusion of set denotation;

a hypernym refers to an object that includes hyponyms in its denotation set

Inter-lingual meronymy -

for parts, elements or materials of bigger wholes

Inter-lingual holonymy -

for a whole made of smaller parts, elements or materials

Mapping procedure (1)

Recognizing the sense of a source synset:

- checking its position in the network structure (all existing relations with an emphasis on hypernym(s) and

hyponyms; definitions, commentaries; comparing other synsets contaning the given lemma)

Example:

{zagranica 1, obczyzna 1, obce terytorium 1}:

- is a hyponym of {obszar 1, terytorium 1, obręb 1, strefa 1, zona 1, rejon 3} commentary: 'ograniczona część przestrzeni, zwykle dużych rozmiarów, określona powierzchnia czegoś (np. obszar państwa)

'a limited part of an area, usually of big size, a set surface of sth (e.g. state territory)

- is a meronym of {świat 3, nieznane 1} – 'world, unknown territory'

- is a fuzzynym of {granica państwa 1} – 'state border'

Mapping procedure (2)

Searching for a target synset:

– choosing candidates for a target synset with the help of intuitions, automatic prompts and dictionaries:

e.g. {foreign country 1} - 'any state of which one is not a citizen' – is a hyponym of {state 1, nation 1, country 1, land 9, commonwealth 2, res publica 1, body politic 1} - 'a politically organized body of people under a single government'

- verifing candidates for a target synset (comparing hyper and hyponymic structures (and other if such exist) with the source synset (checking the existing and/or potential inter-lingual relations; definitions, commentaries; dictionaries)

{state 1, ..} is an inter-lingual hyponym of {państwo 1, kraj 1} -

'zorganizowana politycznie społeczność, zamieszkująca określone terytorium, z niepodległą formą rządów' – 'a politically organised community, inhabiting a certain territory, with an independent form of government'

Mapping procedure (3)

Choosing a target synset and an inter-lingual relation: {foreign country 1}

Synonymy – no (different meaning, structures and relations)

Hyponymy – no (meaning, structures and relations do not qualify as a subtype)

Meronymy – yes (meaning, structures and relations qualify as a part)

Linking the source synset with the target synset:

Results of inter-lingual mapping

About 46 500 inter-lingual links/relations between synsets which amounts to about 50 000 relations between lexical units

• Synonymy - 15268• Partial synonymy – 971• Inter-register synonymy - 676• Hyponymy - 23677• Hypernymy - 3526• Meronymy – 1898• Holonymy - 555

• Mapped branches: people, artefacts, places,food, time units, communication (partly), states and processes (partly), body parts (partly), group names (partly)

Mapping direction: plWordNet – Princeton WordNet

Bottom-up approach – starting from the lowest levels in the hierarchy

Inter-lingual lexico-grammatical differences:

- marked forms (diminutives, augmentatives)

- lexicalised gender

- lexical gaps

Differences in the definition of synonymy and synset:

- 'Mixed' PWN synsets – marked and unmarked forms, feminine and masculine, countable and uncountable, hypernym and hyponym- hypernymy and (plWN) vs. and/or (PWN)

Other differences:

- synset definitions incompatible with relations (PWN)

- different relations used for coding the same conceptual dependencies

- more fine-grained meaning differentiation

- differences boiling down to the content and size of resources

Types of differences between plWN and PWN

Marked forms

Differences in lexicalisation

Hyponymy

Different relations for coding the same conceptual

dependencies

References

Fellbaum, Ch. (ed). 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge, Massachusets.

Maziarz, M., Piasecki, M. and S. Szpakowicz. 2012. Approaching plWordNet 2.0. Proceedings of the 6th Global Wordnet Conference, Matsue. pp. 189-196. accepted for publication.

Piasecki, M., Szpakowicz, S. and B. Broda. 2009. A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wrocławskiej: Wrocław.

Princeton WordNet http://wordnet.princeton.edu/wordnet/

Słowosieć http://plwordnet.pwr.wroc.pl/wordnet/

Vossen, P. (ed). 2002. EuroWordNet. General Document. Amsterdam.