ranlp2013: dutchsemcor, in quest of the ideal sense tagged corpus
TRANSCRIPT
DutchSemCor: in Quest of the Ideal Sense Tagged Corpus
Piek Vossen [email protected]
Rubén Izquierdo [email protected]
Attila Görög [email protected]
Outline � Main goal of our project
� WSD and annotated corpus
� Our approach
� Balanced-sense corpus and evaluation
� Balanced-context corpus and evaluation
� Sense distributions, all words corpus and evaluation
� Numbers…
1
Main goal of DSC � Deliver a Dutch corpus enriched with semantic
information: � Senses of the most frequent and most polysemous
words � Domains � Named Entities linked with Wikipedia
� 1 million sense tagged tokens: � 250K tagged manually by 2 annotators � 750K tagged by 1 annotator / automatically through
Active Learning
2
Current WSD � Insights on Word Sense Disambiguation
1. Evaluation tasks depend on the corpus / lexicon � It seems that the results depend more on the evaluation data than on WSD systems � Are the evaluation corpora diverse enough?
2. Most frequent sense from SemCor difficult to beat � Are evaluation tasks neglecting low frequent senses?
3. Predominant senses in specific domains give the best results 4. Supervised systems beat unsupervised systems
� Which are the best corpora for WSD
� How should be the ideal corpus for WSD? (we) � Define criteria for the ideal sense-tagged corpus � Describe a novel approach for building a large scale sense tagged corpus for
meet criteria (with as little manual effort as possible)
3
Criteria for a corpus � A good corpus for WSD should:
� Be balanced for different senses � Equal number of examples for each meaning
� Be balanced for different contexts � Different usages of the words
� Provide information on sense frequencies (across domains and genres) � Frequency of the words in a representative meaning
4
Annotating a corpus Sequential
Tagging
All Words corpus
Targeted tagging
Lexical Sample Corpus
Balanced sense
Balanced context
à Whole text à Reconsider meanings
à KWIC à Repeated contexts
à Small numbers of texts, genres, domains and senses
à Sense distributions à SemCor à Usually large number of contexts and
senses à Line-hard-serve à DSO 5
Annotating a corpus
Sense distribution
Sense coverage
Context diversity
All words ✔ ✖ ✖
Balanced-sense ✖ ✔ ✖
Balanced-context ✖ ✖ ✔
6
Our main approach 1. Annotated corpus that represents ALL the meanings of an existing
lexicon � Balanced sense
� Manual
2. Train WSD systems using the annotated corpus � Will be trained for all the senses
3. Extend this annotated corpus to acquire a wider representation of contexts � Balanced-context
� Manual + WSD
4. Annotate the full raw corpus � Sense distributions
� WSD
5. Evaluation of the annotations for the 3 criteria
7
Resources � Cornetto database � Lexical semantic database for Dutch � Structure and content of WN + FrameNet-
like data � SoNaR (500M tokens) � Dutch wide range of genres and topics � 34 categories: discussion lists, books, chats,
autocues…) � CGN (9M tokens) � Transcribed spontaneous Dutch adult speech
� Internet
8
WSD systems � DSC-timbl
� Memory learning classifier � Supervised K-nearest neighbor
� DSC-SVM � Linear classifier / Support Vector Machines � Binary classifiers 1 vs all
� DSC-UKB � Knowledge based system � Personalized page rank algorithm � Synsets à nodes Relations à hedges � Context words inject mass into word senses
9
Balanced-sense corpus � 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
� Student assistants 2 years
� SAT tool and Web-snippets tool
� 80% agreement 25 examples per sense
� 282,503 tokens double annotated � 80% senses with more than 25 examples � 90% lemmas with 25 examples for each sense
� Distribution-> 67% sonar, 5% CGN, 28% web
10
Balanced-sense corpus � Student assistants 2 years
� SAT tool
� 80% agreement 25 examples per sense
� 282,503 tokens double annotated � 80% senses with more than 25 examples � 90% lemmas with 25 examples for each sense
� Distribution-> 67% sonar, 5% CGN, 28% web
11
Balanced-sense corpus � 2870 most polysemous and frequent words (11982
meanings avg polysemy 3)
� Student assistants 2 years
� SAT tool
� 80% agreement 25 examples per sense
� 282,503 tokens double annotated � 80% senses with more than 25 examples � 90% lemmas with 25 examples for each sense
� Distribution-> 67% sonar, 5% CGN, 28% web
12
WSD from balanced sense � 5-FCV at sense level and focus on nouns
� Optimized for annotate SONAR � Specific features (word_id)
� Overall result for nouns è 82.76
� Results used for further annotate weakly performing senses
� Active Learning approach � Select 82 lemmas performing under 80% � 3 rounds of annotation till reach 81.62%
13
WSD from balanced sense � 5-FCV at sense level and focus on nouns
� Optimized for annotate SONAR � Specific features (word_id)
� Overall result for nouns è 82.76
� Results used for further annotate weakly performing senses
� Active Learning approach � Select 82 lemmas performing under 80% � 3 rounds of annotation till reach 81.62%
14
WSD from balanced sense � 5-FCV at sense level and focus on nouns
� Optimized for annotate SONAR � Specific features (word_id)
� Overall result for nouns è 82.76
� Results used for further annotate weakly performing senses
� Active Learning approach � Select 82 lemmas performing under 80% � 3 rounds of annotation till reach 81.62%
15
Balanced context � Try to annotate the whole corpus è as many contexts as the
whole corpus à have a good WSD à improve problematic cases
� Select all words perform under 80%
� Annotate all corpus with Timbl-wsd system (optimized)
� 50 new tokens for senses of words under 80% being different context � High confidence � Low distance / High distance to the nearest neighbor
� Manually annotate these 50 � Completely different to first phase where annotators could chose � Lemmatization errors, PoS errors, figurative, idiomatic unknown
senses
16
Evaluating the Balanced-sense and new annotations
Type Accuracy # examples
Balanced Sense (BS) 81.62 8641
BS + LowD 78.81 13266
BS+ LowD_agreed 85.02 11405
BS+ High 76.24 19055
BS+ HighD_agreed 83.77 13359
BS + LowD_agreed + HighD_agreed
85.33 16123
• Timbl-DSC 5-FCV (folds incremented with new data) 82 lemmas
• Better results when using agreed data
• High/Low distance does not make big difference
17
Evaluation balanced-context
� 5-FCV using agreed new instances
� Best is majority voting
System Nouns Verbs Adjs
DSC-timbl 83.97 83.44 78.64
DSC-svm 82.69 84.93 79.03
DSC-ukb 73.04 55.84 56.36
Voting 88.65 87.60 83.06
18
Evaluating representativeness
� Our manual annotated corpus probably skewed towards balanced-sense
� Required to test the performance of our WSD on the rest of SONAR
� Random evaluation � Ranges of accuracy (90-100 80-90 70-80 60-70) � 5 nouns 5 verbs and 3 adjs è 52 lemmas
� 100 tokens for each lemma automatic tagged and manual validated
19
Evaluating representativeness
� Results lower than previous evaluations
� Difference between approach representing the lexicon (sense) and representing the corpus
� Results comparable to state-of-the-art English Sens/Sem-eval
System Nouns Verbs Adjs
DSC-timbl 54.25 48.25 46.50
DSC-svm 64.10 52.20 52.00
DSC-ukb 49.37 44.15 38.13
Voting 60.70 53.95 50.83
20
Obtaining sense distributions
� Approach � Annotate the remainder SoNaR with WSD systems an
obtain sense frequencies
� Assume that automatic annotation still reflects real distribution
� Evaluate this frequency distribution (Most Frequent Sense)
� How can be evaluated this MFS approach? � Manual annotations
� 25 examples per sense, no sense distribution
� Random evaluation corpus � Only a small selection of words (52 lemmas)
21
Obtaining sense distributions
� All-words corpus was created � Completely independent texts from Lassy � Medical journals, manuals, newspapers, magazines,
reports, websites, wikipedia � 23,907 tokens and covers 1,527 of our set of lemmas
(53%)
� Evaluation of � 3 WSD systems � First sense baseline according to cornetto � Random sense baseline � Most frequent sense
� Sense distributions obtained from automatic annotation
22
Obtaining sense distributions
� MFS in Dutch similar to English MFS
� MFS better than 1st and random sense baselines
� MFS automatically derived is a good predictor
System Nouns Verbs Adjs
1st sense 53.17 32.84 52.17
Random sense 29.52 24.99 32.16
MFS 61.20 50.76 54.62
DSC-timbl 55.76 37.96 49.00
DSC-svm 64.58 45.81 55.70
DSC-ukb 56.81 31.37 35.93
Voting 66.09 45.68 52.24
23
Numbers of DSC � Balanced-sense annotated corpus
� 274,344 tokens � 2,874 lemmas � Annotated by 2 annotators, 90% IAA
� Balanced-context annotated corpus � 132,666 tokens � 1,133 lemmas � Manually annotated by 1 agreeing with
WSD in 44%
� Random evaluation corpus � 5,200 tokens � 52 lemmas
� All words corpus � 23,907 tokens � 1,527 lemmas
� 3 WSD systems for Dutch � DSC-timbl � DSC-svm
� DSC-ukb
� Automatic annotations by the 3 WSD � Sense distributions
� 48 million of tokens with confidence
� … and more… � 800,000 semantic relations between senses
extracted from manual annotations
� 28.080 sense groups � Improved version of Cornetto
� SAT annotation tool � Web search tool
� Statistics on figurative, idiomatic and collocational usage of words
� … 24
Piek Vossen [email protected]
Rubén Izquierdo [email protected]
Attila Görög [email protected]
Thanks for your attention