ranlp 2013: dutchsemcor in quest of the ideal corpus

26
DutchSemCor: in Quest of the Ideal Sense Tagged Corpus Piek Vossen [email protected] Rubén Izquierdo ruben.izquierdobevia@vu .nl Attila Görög [email protected]

Upload: ruben-izquierdo-bevia

Post on 22-Jun-2015

199 views

Category:

Education


0 download

TRANSCRIPT

Page 1: RANLP 2013: DutchSemcor in quest of the ideal corpus

DutchSemCor: in Quest of the Ideal Sense Tagged Corpus

Piek Vossen [email protected]

Rubén Izquierdo

[email protected]

Attila Görög [email protected]

Page 2: RANLP 2013: DutchSemcor in quest of the ideal corpus

2

OutlineMain goal of our project

WSD and annotated corpus

Our approach

Balanced-sense corpus and evaluation

Balanced-context corpus and evaluation

Sense distributions, all words corpus and evaluation

Numbers…

Page 3: RANLP 2013: DutchSemcor in quest of the ideal corpus

3

Main goal of DSCDeliver a Dutch corpus enriched with semantic

information:Senses of the most frequent and most polysemous

wordsDomainsNamed Entities linked with Wikipedia

1 million sense tagged tokens:250K tagged manually by 2 annotators750K tagged by 1 annotator / automatically through

Active Learning

Page 4: RANLP 2013: DutchSemcor in quest of the ideal corpus

4

Current WSD Insights on Word Sense Disambiguation

1. Evaluation tasks depend on the corpus / lexicon It seems that the results depend more on the evaluation data than on WSD

systems Are the evaluation corpora diverse enough?

2. Most frequent sense from SemCor difficult to beat Are evaluation tasks neglecting low frequent senses?

3. Predominant senses in specific domains give the best results

4. Supervised systems beat unsupervised systems Which are the best corpora for WSD

How should be the ideal corpus for WSD? (we) Define criteria for the ideal sense-tagged corpus Describe a novel approach for building a large scale sense tagged corpus

for meet criteria (with as little manual effort as possible)

Page 5: RANLP 2013: DutchSemcor in quest of the ideal corpus

5

Criteria for a corpusA good corpus for WSD should:

Be balanced for different sensesEqual number of examples for each meaning

Be balanced for different contextsDifferent usages of the words

Provide information on sense frequencies (across domains and genres)Frequency of the words in a representative meaning

Page 6: RANLP 2013: DutchSemcor in quest of the ideal corpus

6

Annotating a corpusSequential

Tagging

All Words corpus

Targeted tagging

Lexical Sample Corpus

Balanced sense

Balanced context

Whole text Reconsider

meanings

KWIC Repeated contexts

Small numbers of texts, genres, domains and senses

Sense distributions SemCor

Usually large number of contexts and senses

Line-hard-serve DSO

Page 7: RANLP 2013: DutchSemcor in quest of the ideal corpus

7

Annotating a corpus

Sense distributio

n

Sense coverage

Context diversity

All words ✔ ✖ ✖

Balanced-sense ✖ ✔ ✖

Balanced-context

✖ ✖ ✔

Page 8: RANLP 2013: DutchSemcor in quest of the ideal corpus

8

Our main approach1. Annotated corpus that represents ALL the meanings of an

existing lexicon Balanced sense Manual

2. Train WSD systems using the annotated corpus Will be trained for all the senses

3. Extend this annotated corpus to acquire a wider representation of contexts Balanced-context Manual + WSD

4. Annotate the full raw corpus Sense distributions WSD

5. Evaluation of the annotations for the 3 criteria

Page 9: RANLP 2013: DutchSemcor in quest of the ideal corpus

9

ResourcesCornetto database

Lexical semantic database for DutchStructure and content of WN + FrameNet-

like dataSoNaR (500M tokens)

Dutch wide range of genres and topics 34 categories: discussion lists, books, chats,

autocues…)CGN (9M tokens)

Transcribed spontaneous Dutch adult speechInternet

Page 10: RANLP 2013: DutchSemcor in quest of the ideal corpus

10

WSD systemsDSC-timbl

Memory learning classifierSupervised K-nearest neighbor

DSC-SVMLinear classifier / Support Vector MachinesBinary classifiers 1 vs all

DSC-UKBKnowledge based systemPersonalized page rank algorithmSynsets nodes Relations hedgesContext words inject mass into word senses

Page 11: RANLP 2013: DutchSemcor in quest of the ideal corpus

11

Balanced-sense corpus2870 most polysemous and frequent words

(11982 meanings avg polysemy 3)

Student assistants 2 years

SAT tool and Web-snippets tool

80% agreement 25 examples per sense

282,503 tokens double annotated 80% senses with more than 25 examples 90% lemmas with 25 examples for each sense

Distribution-> 67% sonar, 5% CGN, 28% web

Page 12: RANLP 2013: DutchSemcor in quest of the ideal corpus

12

Balanced-sense corpusStudent assistants 2 years

SAT tool

80% agreement 25 examples per sense

282,503 tokens double annotated80% senses with more than 25 examples90% lemmas with 25 examples for each sense

Distribution-> 67% sonar, 5% CGN, 28% web

Page 13: RANLP 2013: DutchSemcor in quest of the ideal corpus

13

Balanced-sense corpus2870 most polysemous and frequent words

(11982 meanings avg polysemy 3)

Student assistants 2 years

SAT tool

80% agreement 25 examples per sense

282,503 tokens double annotated 80% senses with more than 25 examples 90% lemmas with 25 examples for each sense

Distribution-> 67% sonar, 5% CGN, 28% web

Page 14: RANLP 2013: DutchSemcor in quest of the ideal corpus

14

WSD from balanced sense

5-FCV at sense level and focus on nouns

Optimized for annotate SONAR Specific features (word_id)

Overall result for nouns 82.76

Results used for further annotate weakly performing senses

Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%

Page 15: RANLP 2013: DutchSemcor in quest of the ideal corpus

15

WSD from balanced sense

5-FCV at sense level and focus on nouns

Optimized for annotate SONAR Specific features (word_id)

Overall result for nouns 82.76

Results used for further annotate weakly performing senses

Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%

Page 16: RANLP 2013: DutchSemcor in quest of the ideal corpus

16

WSD from balanced sense

5-FCV at sense level and focus on nouns

Optimized for annotate SONAR Specific features (word_id)

Overall result for nouns 82.76

Results used for further annotate weakly performing senses

Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%

Page 17: RANLP 2013: DutchSemcor in quest of the ideal corpus

17

Balanced context Try to annotate the whole corpus as many contexts as the

whole corpus have a good WSD improve problematic cases

Select all words perform under 80%

Annotate all corpus with Timbl-wsd system (optimized)

50 new tokens for senses of words under 80% being different context High confidence Low distance / High distance to the nearest neighbor

Manually annotate these 50 Completely different to first phase where annotators could chose Lemmatization errors, PoS errors, figurative, idiomatic unknown

senses

Page 18: RANLP 2013: DutchSemcor in quest of the ideal corpus

18

Evaluating the Balanced-sense and new annotations

Type Accuracy # examples

Balanced Sense (BS) 81.62 8641

BS + LowD 78.81 13266

BS+ LowD_agreed 85.02 11405

BS+ High 76.24 19055

BS+ HighD_agreed 83.77 13359

BS + LowD_agreed + HighD_agreed

85.33 16123

• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas

• Better results when using agreed data

• High/Low distance does not make big difference

Page 19: RANLP 2013: DutchSemcor in quest of the ideal corpus

19

Evaluation balanced-context

5-FCV using agreed new instances

Best is majority voting

System Nouns Verbs Adjs

DSC-timbl 83.97 83.44 78.64

DSC-svm 82.69 84.93 79.03

DSC-ukb 73.04 55.84 56.36

Voting 88.65 87.60 83.06

Page 20: RANLP 2013: DutchSemcor in quest of the ideal corpus

20

Evaluating representativeness

Our manual annotated corpus probably skewed towards balanced-sense

Required to test the performance of our WSD on the rest of SONAR

Random evaluationRanges of accuracy (90-100 80-90 70-80 60-

70)5 nouns 5 verbs and 3 adjs 52 lemmas100 tokens for each lemma automatic tagged

and manual validated

Page 21: RANLP 2013: DutchSemcor in quest of the ideal corpus

21

Evaluating representativeness

Results lower than previous evaluations

Difference between approach representing the lexicon (sense) and representing the corpus

Results comparable to state-of-the-art English Sens/Sem-eval

System Nouns Verbs Adjs

DSC-timbl 54.25 48.25 46.50

DSC-svm 64.10 52.20 52.00

DSC-ukb 49.37 44.15 38.13

Voting 60.70 53.95 50.83

Page 22: RANLP 2013: DutchSemcor in quest of the ideal corpus

22

Obtaining sense distributions

Approach Annotate the remainder SoNaR with WSD systems an

obtain sense frequencies Assume that automatic annotation still reflects real

distribution Evaluate this frequency distribution (Most Frequent

Sense)

How can be evaluated this MFS approach? Manual annotations

25 examples per sense, no sense distribution Random evaluation corpus

Only a small selection of words (52 lemmas)

Page 23: RANLP 2013: DutchSemcor in quest of the ideal corpus

23

Obtaining sense distributions

All-words corpus was created Completely independent texts from Lassy Medical journals, manuals, newspapers, magazines, reports,

websites, wikipedia 23,907 tokens and covers 1,527 of our set of lemmas (53%)

Evaluation of 3 WSD systems First sense baseline according to cornetto Random sense baseline Most frequent sense

Sense distributions obtained from automatic annotation

Page 24: RANLP 2013: DutchSemcor in quest of the ideal corpus

24

Obtaining sense distributions

MFS in Dutch similar to English MFS

MFS better than 1st and random sense baselines

MFS automatically derived is a good predictor

System Nouns Verbs Adjs

1st sense 53.17 32.84 52.17

Random sense 29.52 24.99 32.16

MFS 61.20 50.76 54.62

DSC-timbl 55.76 37.96 49.00

DSC-svm 64.58 45.81 55.70

DSC-ukb 56.81 31.37 35.93

Voting 66.09 45.68 52.24

Page 25: RANLP 2013: DutchSemcor in quest of the ideal corpus

25

Numbers of DSC Balanced-sense annotated corpus

274,344 tokens 2,874 lemmas Annotated by 2 annotators, 90% IAA

Balanced-context annotated corpus 132,666 tokens 1,133 lemmas Manually annotated by 1 agreeing

with WSD in 44%

Random evaluation corpus 5,200 tokens 52 lemmas

All words corpus 23,907 tokens 1,527 lemmas

3 WSD systems for Dutch DSC-timbl DSC-svm DSC-ukb

Automatic annotations by the 3 WSD Sense distributions 48 million of tokens with confidence

… and more… 800,000 semantic relations between

senses extracted from manual annotations

28.080 sense groups Improved version of Cornetto SAT annotation tool Web search tool Statistics on figurative, idiomatic and

collocational usage of words …

Page 26: RANLP 2013: DutchSemcor in quest of the ideal corpus

DutchSemCor: in Quest of the Ideal Sense Tagged Corpus

Piek Vossen [email protected]

Rubén Izquierdo [email protected]

Attila Görög [email protected]

Thanks for your attention