ranlp 2013: dutchsemcor in quest of the ideal corpus

Post on 22-Jun-2015

199 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DutchSemCor: in Quest of the Ideal Sense Tagged Corpus

Piek Vossen piek.vossen@vu.nl

Rubén Izquierdo

ruben.izquierdobevia@vu.nl

Attila Görög a.gorog@vu.nl

2

OutlineMain goal of our project

WSD and annotated corpus

Our approach

Balanced-sense corpus and evaluation

Balanced-context corpus and evaluation

Sense distributions, all words corpus and evaluation

Numbers…

3

Main goal of DSCDeliver a Dutch corpus enriched with semantic

information:Senses of the most frequent and most polysemous

wordsDomainsNamed Entities linked with Wikipedia

1 million sense tagged tokens:250K tagged manually by 2 annotators750K tagged by 1 annotator / automatically through

Active Learning

4

Current WSD Insights on Word Sense Disambiguation

1. Evaluation tasks depend on the corpus / lexicon It seems that the results depend more on the evaluation data than on WSD

systems Are the evaluation corpora diverse enough?

2. Most frequent sense from SemCor difficult to beat Are evaluation tasks neglecting low frequent senses?

3. Predominant senses in specific domains give the best results

4. Supervised systems beat unsupervised systems Which are the best corpora for WSD

How should be the ideal corpus for WSD? (we) Define criteria for the ideal sense-tagged corpus Describe a novel approach for building a large scale sense tagged corpus

for meet criteria (with as little manual effort as possible)

5

Criteria for a corpusA good corpus for WSD should:

Be balanced for different sensesEqual number of examples for each meaning

Be balanced for different contextsDifferent usages of the words

Provide information on sense frequencies (across domains and genres)Frequency of the words in a representative meaning

6

Annotating a corpusSequential

Tagging

All Words corpus

Targeted tagging

Lexical Sample Corpus

Balanced sense

Balanced context

Whole text Reconsider

meanings

KWIC Repeated contexts

Small numbers of texts, genres, domains and senses

Sense distributions SemCor

Usually large number of contexts and senses

Line-hard-serve DSO

7

Annotating a corpus

Sense distributio

n

Sense coverage

Context diversity

All words ✔ ✖ ✖

Balanced-sense ✖ ✔ ✖

Balanced-context

✖ ✖ ✔

8

Our main approach1. Annotated corpus that represents ALL the meanings of an

existing lexicon Balanced sense Manual

2. Train WSD systems using the annotated corpus Will be trained for all the senses

3. Extend this annotated corpus to acquire a wider representation of contexts Balanced-context Manual + WSD

4. Annotate the full raw corpus Sense distributions WSD

5. Evaluation of the annotations for the 3 criteria

9

ResourcesCornetto database

Lexical semantic database for DutchStructure and content of WN + FrameNet-

like dataSoNaR (500M tokens)

Dutch wide range of genres and topics 34 categories: discussion lists, books, chats,

autocues…)CGN (9M tokens)

Transcribed spontaneous Dutch adult speechInternet

10

WSD systemsDSC-timbl

Memory learning classifierSupervised K-nearest neighbor

DSC-SVMLinear classifier / Support Vector MachinesBinary classifiers 1 vs all

DSC-UKBKnowledge based systemPersonalized page rank algorithmSynsets nodes Relations hedgesContext words inject mass into word senses

11

Balanced-sense corpus2870 most polysemous and frequent words

(11982 meanings avg polysemy 3)

Student assistants 2 years

SAT tool and Web-snippets tool

80% agreement 25 examples per sense

282,503 tokens double annotated 80% senses with more than 25 examples 90% lemmas with 25 examples for each sense

Distribution-> 67% sonar, 5% CGN, 28% web

12

Balanced-sense corpusStudent assistants 2 years

SAT tool

80% agreement 25 examples per sense

282,503 tokens double annotated80% senses with more than 25 examples90% lemmas with 25 examples for each sense

Distribution-> 67% sonar, 5% CGN, 28% web

13

Balanced-sense corpus2870 most polysemous and frequent words

(11982 meanings avg polysemy 3)

Student assistants 2 years

SAT tool

80% agreement 25 examples per sense

282,503 tokens double annotated 80% senses with more than 25 examples 90% lemmas with 25 examples for each sense

Distribution-> 67% sonar, 5% CGN, 28% web

14

WSD from balanced sense

5-FCV at sense level and focus on nouns

Optimized for annotate SONAR Specific features (word_id)

Overall result for nouns 82.76

Results used for further annotate weakly performing senses

Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%

15

WSD from balanced sense

5-FCV at sense level and focus on nouns

Optimized for annotate SONAR Specific features (word_id)

Overall result for nouns 82.76

Results used for further annotate weakly performing senses

Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%

16

WSD from balanced sense

5-FCV at sense level and focus on nouns

Optimized for annotate SONAR Specific features (word_id)

Overall result for nouns 82.76

Results used for further annotate weakly performing senses

Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%

17

Balanced context Try to annotate the whole corpus as many contexts as the

whole corpus have a good WSD improve problematic cases

Select all words perform under 80%

Annotate all corpus with Timbl-wsd system (optimized)

50 new tokens for senses of words under 80% being different context High confidence Low distance / High distance to the nearest neighbor

Manually annotate these 50 Completely different to first phase where annotators could chose Lemmatization errors, PoS errors, figurative, idiomatic unknown

senses

18

Evaluating the Balanced-sense and new annotations

Type Accuracy # examples

Balanced Sense (BS) 81.62 8641

BS + LowD 78.81 13266

BS+ LowD_agreed 85.02 11405

BS+ High 76.24 19055

BS+ HighD_agreed 83.77 13359

BS + LowD_agreed + HighD_agreed

85.33 16123

• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas

• Better results when using agreed data

• High/Low distance does not make big difference

19

Evaluation balanced-context

5-FCV using agreed new instances

Best is majority voting

System Nouns Verbs Adjs

DSC-timbl 83.97 83.44 78.64

DSC-svm 82.69 84.93 79.03

DSC-ukb 73.04 55.84 56.36

Voting 88.65 87.60 83.06

20

Evaluating representativeness

Our manual annotated corpus probably skewed towards balanced-sense

Required to test the performance of our WSD on the rest of SONAR

Random evaluationRanges of accuracy (90-100 80-90 70-80 60-

70)5 nouns 5 verbs and 3 adjs 52 lemmas100 tokens for each lemma automatic tagged

and manual validated

21

Evaluating representativeness

Results lower than previous evaluations

Difference between approach representing the lexicon (sense) and representing the corpus

Results comparable to state-of-the-art English Sens/Sem-eval

System Nouns Verbs Adjs

DSC-timbl 54.25 48.25 46.50

DSC-svm 64.10 52.20 52.00

DSC-ukb 49.37 44.15 38.13

Voting 60.70 53.95 50.83

22

Obtaining sense distributions

Approach Annotate the remainder SoNaR with WSD systems an

obtain sense frequencies Assume that automatic annotation still reflects real

distribution Evaluate this frequency distribution (Most Frequent

Sense)

How can be evaluated this MFS approach? Manual annotations

25 examples per sense, no sense distribution Random evaluation corpus

Only a small selection of words (52 lemmas)

23

Obtaining sense distributions

All-words corpus was created Completely independent texts from Lassy Medical journals, manuals, newspapers, magazines, reports,

websites, wikipedia 23,907 tokens and covers 1,527 of our set of lemmas (53%)

Evaluation of 3 WSD systems First sense baseline according to cornetto Random sense baseline Most frequent sense

Sense distributions obtained from automatic annotation

24

Obtaining sense distributions

MFS in Dutch similar to English MFS

MFS better than 1st and random sense baselines

MFS automatically derived is a good predictor

System Nouns Verbs Adjs

1st sense 53.17 32.84 52.17

Random sense 29.52 24.99 32.16

MFS 61.20 50.76 54.62

DSC-timbl 55.76 37.96 49.00

DSC-svm 64.58 45.81 55.70

DSC-ukb 56.81 31.37 35.93

Voting 66.09 45.68 52.24

25

Numbers of DSC Balanced-sense annotated corpus

274,344 tokens 2,874 lemmas Annotated by 2 annotators, 90% IAA

Balanced-context annotated corpus 132,666 tokens 1,133 lemmas Manually annotated by 1 agreeing

with WSD in 44%

Random evaluation corpus 5,200 tokens 52 lemmas

All words corpus 23,907 tokens 1,527 lemmas

3 WSD systems for Dutch DSC-timbl DSC-svm DSC-ukb

Automatic annotations by the 3 WSD Sense distributions 48 million of tokens with confidence

… and more… 800,000 semantic relations between

senses extracted from manual annotations

28.080 sense groups Improved version of Cornetto SAT annotation tool Web search tool Statistics on figurative, idiomatic and

collocational usage of words …

DutchSemCor: in Quest of the Ideal Sense Tagged Corpus

Piek Vossen piek.vossen@vu.nl

Rubén Izquierdo ruben.izquierdobevia@vu.nl

Attila Görög a.gorog@vu.nl

Thanks for your attention

top related