automatic lexicon generation through wordnet

37
CSE Department, I.I.T. Bombay Automatic Lexicon Generation through WordNet by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004

Upload: artan

Post on 05-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Automatic Lexicon Generation through WordNet. by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004. Introduction. A lexicon is the heart of any natural language processing system. Difficult to construct requiring enormous amount of time and man power. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Automatic Lexicon Generation through WordNet

by

Nitin Verma and Pushpak Bhattacharyya

Jan 21, 2004

Page 2: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Introduction A lexicon is the heart of any natural language

processing system. Difficult to construct requiring enormous

amount of time and man power. Document specific dictionary generation –

– Given a document D and word W therein, which sense S of W should be picked up from the document ?

– Can one construct a document specific dictionary wherein single senses of the words are stored ?

Page 3: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW Dictionary An important machine readable lexical

resource used by the enconverter and deconverter software's.

Introduction

Enconverter

UWDictionary

AnalysisRules

Natural Language

UNL

Page 4: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Format of dictionary entries –

– Semantic attributes (derived from the ontology).– Syntactic attributes (POS, person, number,

tense).– Used for the firing of appropriate analysis rules.

Introduction (UW dictionary)

[crane] “crane (icl>bird)” (N, ANIMT, FAUNA, BIRD);

Restriction

HW UW Attributes (both syntactic and semantic)

Page 5: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Animate (ANIMT)– Flora (FLORA)

Shrubs (ANIMT, FLORA, SHRB), e.g. jasmine Aquatic plants(ANIMT, FLORA, AQTC), e.g. lotus ….

– Fauna (FAUNA) Mammals (MML) Reptiles (ANIMT, FAUNA, RPTL), e.g. lizard Birds (ANIMT, FAUNA, BIRD) Fish (ANIMT, FAUNA, FISH) Insects (ANIMT, FAUNA, INSCT), e.g. butterfly ……

Ontology*

*Dictionary group, CFILT, IIT Bombay.

Introduction

Page 6: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

English-UW dictionary generation

Page 7: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Resources used –– English WordNet, a WSD* system (soft

word sense disambiguation method), the UNLKB and an inferencer.

Knowledge based approach.

English-UW dictionary generation

* G. Ramakrishnan and P. Bhattacharya. Soft Word Sense Disambiguation, GWN 2004

Page 8: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Stage 1 –

Stage 2 –

English-UW dictionary generation

Method

Word1 word2..----------------------

Input Document

WSD*

Word1:N:1Word2:N:3

----------------------

POS and Sense tagged document

Page 9: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

English-UW dictionary generation (Method)

Word1:pos1:sense1Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

---------------------------------

------

UW Dictionary

Explanation

UNL KB

Page 10: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

UW generation

Page 11: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

1

Page 12: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

1

2

Page 13: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

1

2

3

Page 14: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

1

4

2

3

Page 15: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

1

4

2

3

5

depth word relation restriction

6 bird icl animal

5 animal icl living thing

4 living thing null null

Page 16: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

Crane(icl>bird)

1

4

2

3

5

6

depth word relation restriction

6 bird icl animal

5 animal icl living thing

4 living thing null null

6

Page 17: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

Crane(icl>bird)

1

4

2

3

5

6

Explanation7

depth word relation restriction

6 bird icl animal

5 animal icl living thing

4 living thing null null

6

Page 18: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for verbs

UW generation

Page 19: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for verbs

Input word

{hypernyms(word)} Π {‘be’, ‘continue’, etc}= 0

true(icl > be)

e.g. : exist (icl > be)

{hypernyms(nominal word)} Π {‘phenomenon’, ‘natural event’, etc}

= 0

true(icl > occur)

e.g. : rain (icl > occur)

false

false

(icl > do) e.g. : make (icl > do)

Page 20: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

UW generation for adjectives

Input word

UW present in the UNL KB ?Yes

Pick the UW

e.g. : broad (aoj > thing)

No

IS_DEFINED (is_a_value_of relation) on the input word ?

Yes(aoj > thing)

e.g. : good (aoj > thing)

No

(mod > thing) e.g. : green (mod > thing)

Page 21: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

English-UW dictionary generation (Method)

Page 22: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

1

Page 23: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

1

2

Page 24: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

1

2

3

Page 25: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

1

4

2

3

Page 26: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

IF hypernym=‘organism’ THEN generate ‘ANIMT’

ELSE generate ‘INANI’;

IF hypernym=‘fauna’ THEN generate ‘FAUNA’;

IF hypernym=‘bird’ THEN generate ‘BIRD’;

--- ------ ----

1

4

2

3

5

Page 27: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

IF hypernym=‘organism’ THEN generate ‘ANIMT’

ELSE generate ‘INANI’;

IF hypernym=‘fauna’ THEN generate ‘FAUNA’;

IF hypernym=‘bird’ THEN generate ‘BIRD’;

--- ------ ----

(N,ANIMT,FAUNA,BIRD)1

4

2

3

5

6

Page 28: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Database of rules

Semantic attribute generation

No of such rules: 4344

HYPERNYM ATTRIBUTE

organism ANIMT

flora FLORA

fauna FAUNA

bird BIRD

HYPERNYM ATTRIBUTE

change VOA,CHNG

communicate VOA,COMM

move VOA,MOTN

complete VOA,CMPLT

IS_A_VALUE_OF ATTRIBUTE

weight DES,WT

strength DES,STRNGTH

qual DES,QUAL

SYNONYMY OR ANTONYMY

ATTRIBUTE

bright DES,APPR

deep DES,DPTH

shallow DES,DPTH

SYNONYMY ATTRIBUTE

backward DRCTN

always FREQ

frequent FREQ

beautifully MAN

Table 1. Rules for nouns (96) Table 2. Rules for verbs (405)

Table 4. Rules for adverbs (556)Table 3.2. Rules for adjectives (3258)

Table 3.1. Rules for adjectives (29)

Page 29: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Experiments and Results

82

84

86

88

90

92

94

96

98

1 2 3 4 5 6 7 8 9 10

Precision

No of correct entries in the dictionary

Total no of entries in the dictionary

70

72

74

76

78

80

82

84

86

88

90

92

1 2 3 4 5 6 7 8 9 10

Precision

Precision for nouns – 93.9% Precision for verbs – 84.4%

Document No Document No

Precision =

Page 30: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

78

80

82

84

86

88

90

92

94

96

1 2 3 4 5 6 7 8 9 10

Precision

No of correct entries in the dictionary

Total no of entries in the dictionary

72

74

76

78

80

82

84

86

88

90

92

94

1 2 3 4 5 6 7 8 9 10

Precision

Precision for adjectives – 90.06% Precision for adverbs – 86%

Document No Document No

Precision =

Experiments and results

Page 31: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Implementation details Subtasks identified –

– MySQL database is used for storing the rules and the UNL KB.

7540 entries in the UNL KB. 4344 entries in the rule base.

– Inference engine in C++.– Web interface of the DDG in CGI & PHP.– Other utilities like UNL KB organizer, Rule entry

interface, WSD integrator are implemented in Perl.

– LOC 4761

Page 32: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Demo

Page 33: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Hindi-UW dictionary generation

Method

Page 34: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Hindi-UW dictionary generation

1. WordNet API is used to obtain all possible parts-of-speech and all possible senses for every word.

2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.

Page 35: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.

3. The Hindi UW dictionary database is queried (on the basis of input-word and its POS) to obtain an appropriate UW.

4. In this step the irrelevant entries are disabled and the incorrect ones are corrected manually by the lexicographer.

Hindi-UW dictionary generation

Page 36: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

Demo

Page 37: Automatic Lexicon Generation through WordNet

CSE Department, I.I.T. Bombay

The burden of lexicography has been reduced considerably.

The system is being routinely used in our work on machine translation in a tri-language setting (English, Hindi and Marathi).

Future work will be directed towards the implementation of part-of-speech tagger and word-sense-disambiguator, for Hindi and Marathi languages.

Conclusion and future work