clin 2012: dutchsemcor building a semantically annotated corpus for dutch

19
DutchSemCor Building a semantically annotated corpus for Dutch Piek Vossen, Attila Görög, VU University Amsterdam Fons Laan, ISLA, University of Amsterdam Rubén Izquierdo, Tilburg University Antal van den Bosch, Maarten van Gompel, Radboud University Nijmegen 1 CLIN 22,Tilburg University, 20/01/2012

Upload: ruben-izquierdo-bevia

Post on 11-May-2015

120 views

Category:

Education


0 download

TRANSCRIPT

Page 1: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

DutchSemCor

Building a semantically annotated corpus for Dutch

Piek Vossen, Attila Görög, VU University AmsterdamFons Laan, ISLA, University of Amsterdam

Rubén Izquierdo, Tilburg UniversityAntal van den Bosch, Maarten van Gompel, Radboud University Nijmegen

1CLIN 22,Tilburg University, 20/01/2012

Page 2: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

2

Overview

Project goals and planning Current progress Word-sense-disambiguation results Active learning phase

CLIN 22,Tilburg University, 20/01/2012

Page 3: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

3

Goals and planning

Funded by NWO, 2009-2012 Create a large semantically tagged corpus for

Dutch: Sense-tags from the Cornetto database

(includes Dutch wordnet) Domain labels from Wordnet Domains Named entities mapped to Wikipedia

CLIN 22,Tilburg University, 20/01/2012

Page 4: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

4

Global procedure Phase-1:

25 examples per meaning for 3,000 most polysemous and frequent nouns, verbs and adjectives (average nr. of meanings = 3)

Annotated by two student assistents

Minimal IAA 80% Phase-2:

Word-sense-disambiguation (WSD) systems trained with the data of phase-1

Active learning: add examples for low performing words and meanings untill we reach accuracy of 80% or no progress

Phase-3:

Apply WSD to rest of the full corpus

CLIN 22,Tilburg University, 20/01/2012

Page 5: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

5

Corpora

SoNaR: 500M tokens written Dutch CGN: 1M tokens spoken Dutch Web snippets mediated through WebCorp.co.uk (

http://www.webcorp.org.uk/) In case no or insufficient examples are found for

particular senses in SoNaR and CGN Students select snippets (target word and

context) which are added to the corpus in the SoNaR annotation format

CLIN 22,Tilburg University, 20/01/2012

Page 6: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

CLIN 22,Tilburg University, 20/01/2012 6

Annotation tool

Page 7: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

7

Current results Phase-1

PoS: nouns, verbs and adjectives Number of annotated lemmas: 2,870 Number of word senses: 11,982 Number of overlapping annotations: 282,503

(67% SoNaR, 5% CGN, 28% Snippets) Inter Annotator Agreement: 92% Coverage of senses with 25 examples: 70% Coverage of annotations for words: 79%

CLIN 22,Tilburg University, 20/01/2012

Page 8: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

WSD Systems

UKB --> Knowledge-based WSD system that employs semantic relations

Tilburg WSD --> Supervised machine-learning based WSD system

8CLIN 22,Tilburg University, 20/01/2012

Page 9: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

UKB. Description

Knowledge based (Agirre and Soroa, 2009) WordNet considered as a graph

Senses -> nodes Relations -> edges

Personalized PageRank algorithm Modification of traditional PageRank Context words act as source nodes injecting

mass into word senses Assign stronger probabilities to certain nodes

9CLIN 22,Tilburg University, 20/01/2012

Page 10: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

UKB. Semantic relations

Dutch WordNet English WordNet Dutch WordNet ==> English WordNet WordNet Domain

tennis player, tennis ball => tennis => Football player, football => soccer =>

Annotation co-occurrence relations Polysemous => monosemous Polysemous => polysemous

SPORT

10CLIN 22,Tilburg University, 20/01/2012

Page 11: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

UKB. Graph relations

Relation Number

Dutch synset – Dutch synset 140,219

Domain - Domain 125

Dutch synset - Domain 86,798

Dutch synset – English synset 73,935

English synset – English synset 252,392

English synset – English gloss synset 419,387

Annotation co-occurrences polysemous

17,152

Annotation co-occurrences monosemous

151,598

TOTAL 1,266,481

UKB-1 UKB-2

UKB-3

Annot. Co-occurrences ( AC )

UKB-4 = UKB-1 + AC

UKB-5 = UKB-3 + AC

11CLIN 22,Tilburg University, 20/01/2012

Page 12: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

UKB. Evaluation

Precision Recall F-measure

UKB-1 01.4557 0.4491 0.4523

UKB-2 0.4557 0.4491 0.4524

UKB-3 0.4560 0.4493 0.4526

UKB-4 0.6360 0.6272 0.6316

UKB-5 0.6411 0.6322 0.6366

For comparison SemEval2010 Task on WSD in specific domain, all-words-task: UKB3 52.6 precision English UKB 48.1 precision

UKB5 & UKB4 gained 9 points on UKB3 due to co-occurrence relations

12CLIN 22,Tilburg University, 20/01/2012

Page 13: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

Tilburg WSD System Based on TiMBL, K-nearest neighbour classifier

(Daelemans et at, 2007) Features:

Local context (words in window around target) Global context (binary Bag of Words) Sonar category (domain label)

Parameter Search:

Using TiMBL leave-one-out feature Evaluation:

10 examples per sense TEST >= 15 examples per sense TRAIN

13CLIN 22,Tilburg University, 20/01/2012

Page 14: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

Tilburg WSD System. First results

Feature set Token accuracy

Words1

0.6462

Words1 + Bag-of-words 0.7259

Words1 + PoS

1 + Bag-of-words 0.7226

Words1 + Bag-of-words + PS 0.7931

Bag-of-words improvement of 8% Parameter search (PS) improvement of another 7%

Previous experiments suggest that the best size for the context window is 1

14CLIN 22,Tilburg University, 20/01/2012

Page 15: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

TIMBL confidence 0.55:Precision 0.84 (+0.44 compared to no filtering)Fscore 0.78 (only -0.03 less than no filtering)

Tilburg WSD System. TiMBL Confidence

15CLIN 22,Tilburg University, 20/01/2012

Page 16: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

Active Learning

1. Obtain annotated data

2. Train and evaluate the system

3. Select words with accuracy < 80%

4. Apply WSD all tokens of selected words not annotated

5. Select tokens of meanings with F-score < 80%

16CLIN 22,Tilburg University, 20/01/2012

Page 17: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

Active Learning

6) For each word meaning rank all the tokens according to the combination (F-score)

1) TiMBL confidence

2) Distance to the nearest neighbor

6) Select the 50 first ranking tokens per meaning to be manually reviewed in 2 weeks

6) Go to 1

17CLIN 22,Tilburg University, 20/01/2012

Page 18: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

Future Work

Fine tune the active learning Optimize the WSD systems Combine different WSD systems Test on independent texts in all-words task Apply optimal system to full corpora (over 500K

tokens)

18CLIN 22,Tilburg University, 20/01/2012

Page 19: CLIN 2012: DutchSemCor  Building a semantically annotated corpus for Dutch

19

Thanks to

Anneleen Schoen

Charlotte van Tongeren

Daphne van Kessel

Dieke Janssen

Elizabeth van Zutphen

Gratia Bruining

Jonica Kaagman

Laura Kipp

Lisanne Ranzijn

Marlisa Hommel

Wilma van Velzen

Milou Kerkhof

Sam Vossen

Niqee Vossen

Rosa Scheffer

Chantal van Son

CLIN 22,Tilburg University, 20/01/2012