ranlp 2013: dutchsemcor in quest of the ideal corpus
TRANSCRIPT
DutchSemCor: in Quest of the Ideal Sense Tagged Corpus
Piek Vossen [email protected]
Rubén Izquierdo
Attila Görög [email protected]
2
OutlineMain goal of our project
WSD and annotated corpus
Our approach
Balanced-sense corpus and evaluation
Balanced-context corpus and evaluation
Sense distributions, all words corpus and evaluation
Numbers…
3
Main goal of DSCDeliver a Dutch corpus enriched with semantic
information:Senses of the most frequent and most polysemous
wordsDomainsNamed Entities linked with Wikipedia
1 million sense tagged tokens:250K tagged manually by 2 annotators750K tagged by 1 annotator / automatically through
Active Learning
4
Current WSD Insights on Word Sense Disambiguation
1. Evaluation tasks depend on the corpus / lexicon It seems that the results depend more on the evaluation data than on WSD
systems Are the evaluation corpora diverse enough?
2. Most frequent sense from SemCor difficult to beat Are evaluation tasks neglecting low frequent senses?
3. Predominant senses in specific domains give the best results
4. Supervised systems beat unsupervised systems Which are the best corpora for WSD
How should be the ideal corpus for WSD? (we) Define criteria for the ideal sense-tagged corpus Describe a novel approach for building a large scale sense tagged corpus
for meet criteria (with as little manual effort as possible)
5
Criteria for a corpusA good corpus for WSD should:
Be balanced for different sensesEqual number of examples for each meaning
Be balanced for different contextsDifferent usages of the words
Provide information on sense frequencies (across domains and genres)Frequency of the words in a representative meaning
6
Annotating a corpusSequential
Tagging
All Words corpus
Targeted tagging
Lexical Sample Corpus
Balanced sense
Balanced context
Whole text Reconsider
meanings
KWIC Repeated contexts
Small numbers of texts, genres, domains and senses
Sense distributions SemCor
Usually large number of contexts and senses
Line-hard-serve DSO
7
Annotating a corpus
Sense distributio
n
Sense coverage
Context diversity
All words ✔ ✖ ✖
Balanced-sense ✖ ✔ ✖
Balanced-context
✖ ✖ ✔
8
Our main approach1. Annotated corpus that represents ALL the meanings of an
existing lexicon Balanced sense Manual
2. Train WSD systems using the annotated corpus Will be trained for all the senses
3. Extend this annotated corpus to acquire a wider representation of contexts Balanced-context Manual + WSD
4. Annotate the full raw corpus Sense distributions WSD
5. Evaluation of the annotations for the 3 criteria
9
ResourcesCornetto database
Lexical semantic database for DutchStructure and content of WN + FrameNet-
like dataSoNaR (500M tokens)
Dutch wide range of genres and topics 34 categories: discussion lists, books, chats,
autocues…)CGN (9M tokens)
Transcribed spontaneous Dutch adult speechInternet
10
WSD systemsDSC-timbl
Memory learning classifierSupervised K-nearest neighbor
DSC-SVMLinear classifier / Support Vector MachinesBinary classifiers 1 vs all
DSC-UKBKnowledge based systemPersonalized page rank algorithmSynsets nodes Relations hedgesContext words inject mass into word senses
11
Balanced-sense corpus2870 most polysemous and frequent words
(11982 meanings avg polysemy 3)
Student assistants 2 years
SAT tool and Web-snippets tool
80% agreement 25 examples per sense
282,503 tokens double annotated 80% senses with more than 25 examples 90% lemmas with 25 examples for each sense
Distribution-> 67% sonar, 5% CGN, 28% web
12
Balanced-sense corpusStudent assistants 2 years
SAT tool
80% agreement 25 examples per sense
282,503 tokens double annotated80% senses with more than 25 examples90% lemmas with 25 examples for each sense
Distribution-> 67% sonar, 5% CGN, 28% web
13
Balanced-sense corpus2870 most polysemous and frequent words
(11982 meanings avg polysemy 3)
Student assistants 2 years
SAT tool
80% agreement 25 examples per sense
282,503 tokens double annotated 80% senses with more than 25 examples 90% lemmas with 25 examples for each sense
Distribution-> 67% sonar, 5% CGN, 28% web
14
WSD from balanced sense
5-FCV at sense level and focus on nouns
Optimized for annotate SONAR Specific features (word_id)
Overall result for nouns 82.76
Results used for further annotate weakly performing senses
Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%
15
WSD from balanced sense
5-FCV at sense level and focus on nouns
Optimized for annotate SONAR Specific features (word_id)
Overall result for nouns 82.76
Results used for further annotate weakly performing senses
Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%
16
WSD from balanced sense
5-FCV at sense level and focus on nouns
Optimized for annotate SONAR Specific features (word_id)
Overall result for nouns 82.76
Results used for further annotate weakly performing senses
Active Learning approachSelect 82 lemmas performing under 80%3 rounds of annotation till reach 81.62%
17
Balanced context Try to annotate the whole corpus as many contexts as the
whole corpus have a good WSD improve problematic cases
Select all words perform under 80%
Annotate all corpus with Timbl-wsd system (optimized)
50 new tokens for senses of words under 80% being different context High confidence Low distance / High distance to the nearest neighbor
Manually annotate these 50 Completely different to first phase where annotators could chose Lemmatization errors, PoS errors, figurative, idiomatic unknown
senses
18
Evaluating the Balanced-sense and new annotations
Type Accuracy # examples
Balanced Sense (BS) 81.62 8641
BS + LowD 78.81 13266
BS+ LowD_agreed 85.02 11405
BS+ High 76.24 19055
BS+ HighD_agreed 83.77 13359
BS + LowD_agreed + HighD_agreed
85.33 16123
• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas
• Better results when using agreed data
• High/Low distance does not make big difference
19
Evaluation balanced-context
5-FCV using agreed new instances
Best is majority voting
System Nouns Verbs Adjs
DSC-timbl 83.97 83.44 78.64
DSC-svm 82.69 84.93 79.03
DSC-ukb 73.04 55.84 56.36
Voting 88.65 87.60 83.06
20
Evaluating representativeness
Our manual annotated corpus probably skewed towards balanced-sense
Required to test the performance of our WSD on the rest of SONAR
Random evaluationRanges of accuracy (90-100 80-90 70-80 60-
70)5 nouns 5 verbs and 3 adjs 52 lemmas100 tokens for each lemma automatic tagged
and manual validated
21
Evaluating representativeness
Results lower than previous evaluations
Difference between approach representing the lexicon (sense) and representing the corpus
Results comparable to state-of-the-art English Sens/Sem-eval
System Nouns Verbs Adjs
DSC-timbl 54.25 48.25 46.50
DSC-svm 64.10 52.20 52.00
DSC-ukb 49.37 44.15 38.13
Voting 60.70 53.95 50.83
22
Obtaining sense distributions
Approach Annotate the remainder SoNaR with WSD systems an
obtain sense frequencies Assume that automatic annotation still reflects real
distribution Evaluate this frequency distribution (Most Frequent
Sense)
How can be evaluated this MFS approach? Manual annotations
25 examples per sense, no sense distribution Random evaluation corpus
Only a small selection of words (52 lemmas)
23
Obtaining sense distributions
All-words corpus was created Completely independent texts from Lassy Medical journals, manuals, newspapers, magazines, reports,
websites, wikipedia 23,907 tokens and covers 1,527 of our set of lemmas (53%)
Evaluation of 3 WSD systems First sense baseline according to cornetto Random sense baseline Most frequent sense
Sense distributions obtained from automatic annotation
24
Obtaining sense distributions
MFS in Dutch similar to English MFS
MFS better than 1st and random sense baselines
MFS automatically derived is a good predictor
System Nouns Verbs Adjs
1st sense 53.17 32.84 52.17
Random sense 29.52 24.99 32.16
MFS 61.20 50.76 54.62
DSC-timbl 55.76 37.96 49.00
DSC-svm 64.58 45.81 55.70
DSC-ukb 56.81 31.37 35.93
Voting 66.09 45.68 52.24
25
Numbers of DSC Balanced-sense annotated corpus
274,344 tokens 2,874 lemmas Annotated by 2 annotators, 90% IAA
Balanced-context annotated corpus 132,666 tokens 1,133 lemmas Manually annotated by 1 agreeing
with WSD in 44%
Random evaluation corpus 5,200 tokens 52 lemmas
All words corpus 23,907 tokens 1,527 lemmas
3 WSD systems for Dutch DSC-timbl DSC-svm DSC-ukb
Automatic annotations by the 3 WSD Sense distributions 48 million of tokens with confidence
… and more… 800,000 semantic relations between
senses extracted from manual annotations
28.080 sense groups Improved version of Cornetto SAT annotation tool Web search tool Statistics on figurative, idiomatic and
collocational usage of words …
DutchSemCor: in Quest of the Ideal Sense Tagged Corpus
Piek Vossen [email protected]
Rubén Izquierdo [email protected]
Attila Görög [email protected]
Thanks for your attention