ranlp2013: dutchsemcor, in quest of the ideal sense tagged corpus

26
DutchSemCor: in Quest of the Ideal Sense Tagged Corpus Piek Vossen [email protected] Rubén Izquierdo [email protected] Attila Görög [email protected]

Upload: ruben-izquierdo-bevia

Post on 16-Apr-2017

189 views

Category:

Science


0 download

TRANSCRIPT

DutchSemCor: in Quest of the Ideal Sense Tagged Corpus

Piek Vossen [email protected]

Rubén Izquierdo [email protected]

Attila Görög [email protected]

Outline �  Main goal of our project

�  WSD and annotated corpus

�  Our approach

�  Balanced-sense corpus and evaluation

�  Balanced-context corpus and evaluation

�  Sense distributions, all words corpus and evaluation

�  Numbers…

1

Main goal of DSC �  Deliver a Dutch corpus enriched with semantic

information: �  Senses of the most frequent and most polysemous

words �  Domains �  Named Entities linked with Wikipedia

�  1 million sense tagged tokens: �  250K tagged manually by 2 annotators �  750K tagged by 1 annotator / automatically through

Active Learning

2

Current WSD �  Insights on Word Sense Disambiguation

1.  Evaluation tasks depend on the corpus / lexicon �  It seems that the results depend more on the evaluation data than on WSD systems �  Are the evaluation corpora diverse enough?

2.  Most frequent sense from SemCor difficult to beat �  Are evaluation tasks neglecting low frequent senses?

3.  Predominant senses in specific domains give the best results 4.  Supervised systems beat unsupervised systems

�  Which are the best corpora for WSD

�  How should be the ideal corpus for WSD? (we) �  Define criteria for the ideal sense-tagged corpus �  Describe a novel approach for building a large scale sense tagged corpus for

meet criteria (with as little manual effort as possible)

3

Criteria for a corpus �  A good corpus for WSD should:

�  Be balanced for different senses �  Equal number of examples for each meaning

�  Be balanced for different contexts �  Different usages of the words

�  Provide information on sense frequencies (across domains and genres) �  Frequency of the words in a representative meaning

4

Annotating a corpus Sequential

Tagging

All Words corpus

Targeted tagging

Lexical Sample Corpus

Balanced sense

Balanced context

à  Whole text à  Reconsider meanings

à KWIC à Repeated contexts

à  Small numbers of texts, genres, domains and senses

à  Sense distributions à SemCor à  Usually large number of contexts and

senses à  Line-hard-serve à  DSO 5

Annotating a corpus

Sense distribution

Sense coverage

Context diversity

All words ✔ ✖ ✖

Balanced-sense ✖ ✔ ✖

Balanced-context ✖ ✖ ✔

6

Our main approach 1.  Annotated corpus that represents ALL the meanings of an existing

lexicon �  Balanced sense

�  Manual

2.  Train WSD systems using the annotated corpus �  Will be trained for all the senses

3.  Extend this annotated corpus to acquire a wider representation of contexts �  Balanced-context

�  Manual + WSD

4.  Annotate the full raw corpus �  Sense distributions

�  WSD

5.  Evaluation of the annotations for the 3 criteria

7

Resources � Cornetto database � Lexical semantic database for Dutch � Structure and content of WN + FrameNet-

like data � SoNaR (500M tokens) � Dutch wide range of genres and topics � 34 categories: discussion lists, books, chats,

autocues…) � CGN (9M tokens) � Transcribed spontaneous Dutch adult speech

�  Internet

8

WSD systems �  DSC-timbl

�  Memory learning classifier �  Supervised K-nearest neighbor

�  DSC-SVM �  Linear classifier / Support Vector Machines �  Binary classifiers 1 vs all

�  DSC-UKB �  Knowledge based system �  Personalized page rank algorithm �  Synsets à nodes Relations à hedges �  Context words inject mass into word senses

9

Balanced-sense corpus �  2870 most polysemous and frequent words (11982

meanings avg polysemy 3)

�  Student assistants 2 years

�  SAT tool and Web-snippets tool

�  80% agreement 25 examples per sense

�  282,503 tokens double annotated �  80% senses with more than 25 examples �  90% lemmas with 25 examples for each sense

�  Distribution-> 67% sonar, 5% CGN, 28% web

10

Balanced-sense corpus �  Student assistants 2 years

�  SAT tool

�  80% agreement 25 examples per sense

�  282,503 tokens double annotated �  80% senses with more than 25 examples �  90% lemmas with 25 examples for each sense

�  Distribution-> 67% sonar, 5% CGN, 28% web

11

Balanced-sense corpus �  2870 most polysemous and frequent words (11982

meanings avg polysemy 3)

�  Student assistants 2 years

�  SAT tool

�  80% agreement 25 examples per sense

�  282,503 tokens double annotated �  80% senses with more than 25 examples �  90% lemmas with 25 examples for each sense

�  Distribution-> 67% sonar, 5% CGN, 28% web

12

WSD from balanced sense �  5-FCV at sense level and focus on nouns

�  Optimized for annotate SONAR �  Specific features (word_id)

�  Overall result for nouns è 82.76

�  Results used for further annotate weakly performing senses

�  Active Learning approach �  Select 82 lemmas performing under 80% �  3 rounds of annotation till reach 81.62%

13

WSD from balanced sense �  5-FCV at sense level and focus on nouns

�  Optimized for annotate SONAR �  Specific features (word_id)

�  Overall result for nouns è 82.76

�  Results used for further annotate weakly performing senses

�  Active Learning approach �  Select 82 lemmas performing under 80% �  3 rounds of annotation till reach 81.62%

14

WSD from balanced sense �  5-FCV at sense level and focus on nouns

�  Optimized for annotate SONAR �  Specific features (word_id)

�  Overall result for nouns è 82.76

�  Results used for further annotate weakly performing senses

�  Active Learning approach �  Select 82 lemmas performing under 80% �  3 rounds of annotation till reach 81.62%

15

Balanced context �  Try to annotate the whole corpus è as many contexts as the

whole corpus à have a good WSD à improve problematic cases

�  Select all words perform under 80%

�  Annotate all corpus with Timbl-wsd system (optimized)

�  50 new tokens for senses of words under 80% being different context �  High confidence �  Low distance / High distance to the nearest neighbor

�  Manually annotate these 50 �  Completely different to first phase where annotators could chose �  Lemmatization errors, PoS errors, figurative, idiomatic unknown

senses

16

Evaluating the Balanced-sense and new annotations

Type Accuracy # examples

Balanced Sense (BS) 81.62 8641

BS + LowD 78.81 13266

BS+ LowD_agreed 85.02 11405

BS+ High 76.24 19055

BS+ HighD_agreed 83.77 13359

BS + LowD_agreed + HighD_agreed

85.33 16123

•  Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas

•  Better results when using agreed data

•  High/Low distance does not make big difference

17

Evaluation balanced-context

�  5-FCV using agreed new instances

�  Best is majority voting

System Nouns Verbs Adjs

DSC-timbl 83.97 83.44 78.64

DSC-svm 82.69 84.93 79.03

DSC-ukb 73.04 55.84 56.36

Voting 88.65 87.60 83.06

18

Evaluating representativeness

�  Our manual annotated corpus probably skewed towards balanced-sense

�  Required to test the performance of our WSD on the rest of SONAR

�  Random evaluation �  Ranges of accuracy (90-100 80-90 70-80 60-70) �  5 nouns 5 verbs and 3 adjs è 52 lemmas

�  100 tokens for each lemma automatic tagged and manual validated

19

Evaluating representativeness

�  Results lower than previous evaluations

�  Difference between approach representing the lexicon (sense) and representing the corpus

�  Results comparable to state-of-the-art English Sens/Sem-eval

System Nouns Verbs Adjs

DSC-timbl 54.25 48.25 46.50

DSC-svm 64.10 52.20 52.00

DSC-ukb 49.37 44.15 38.13

Voting 60.70 53.95 50.83

20

Obtaining sense distributions

�  Approach �  Annotate the remainder SoNaR with WSD systems an

obtain sense frequencies

�  Assume that automatic annotation still reflects real distribution

�  Evaluate this frequency distribution (Most Frequent Sense)

�  How can be evaluated this MFS approach? �  Manual annotations

�  25 examples per sense, no sense distribution

�  Random evaluation corpus �  Only a small selection of words (52 lemmas)

21

Obtaining sense distributions

�  All-words corpus was created �  Completely independent texts from Lassy �  Medical journals, manuals, newspapers, magazines,

reports, websites, wikipedia �  23,907 tokens and covers 1,527 of our set of lemmas

(53%)

�  Evaluation of �  3 WSD systems �  First sense baseline according to cornetto �  Random sense baseline �  Most frequent sense

�  Sense distributions obtained from automatic annotation

22

Obtaining sense distributions

�  MFS in Dutch similar to English MFS

�  MFS better than 1st and random sense baselines

�  MFS automatically derived is a good predictor

System Nouns Verbs Adjs

1st sense 53.17 32.84 52.17

Random sense 29.52 24.99 32.16

MFS 61.20 50.76 54.62

DSC-timbl 55.76 37.96 49.00

DSC-svm 64.58 45.81 55.70

DSC-ukb 56.81 31.37 35.93

Voting 66.09 45.68 52.24

23

Numbers of DSC �  Balanced-sense annotated corpus

�  274,344 tokens �  2,874 lemmas �  Annotated by 2 annotators, 90% IAA

�  Balanced-context annotated corpus �  132,666 tokens �  1,133 lemmas �  Manually annotated by 1 agreeing with

WSD in 44%

�  Random evaluation corpus �  5,200 tokens �  52 lemmas

�  All words corpus �  23,907 tokens �  1,527 lemmas

�  3 WSD systems for Dutch �  DSC-timbl �  DSC-svm

�  DSC-ukb

�  Automatic annotations by the 3 WSD �  Sense distributions

�  48 million of tokens with confidence

�  … and more… �  800,000 semantic relations between senses

extracted from manual annotations

�  28.080 sense groups �  Improved version of Cornetto

�  SAT annotation tool �  Web search tool

�  Statistics on figurative, idiomatic and collocational usage of words

�  … 24

Piek Vossen [email protected]

Rubén Izquierdo [email protected]

Attila Görög [email protected]

Thanks for your attention