topic modeling and wsd on the ancora corpus

31
Topic Modeling and WSD on the Ancora Corpus Ruben Izquierdo Marten Postma Piek Vossen Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Upload: ruben-izquierdo-bevia

Post on 16-Apr-2017

242 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Topic Modeling and WSD on the Ancora

CorpusRuben Izquierdo

Marten PostmaPiek Vossen

Page 2: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 2

Outline1.Starting Point2. Motivation3. Our Approach4. Evaluation Framework5. Experiments and Results6. Conclusions

Page 3: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 3

Starting point“Understanding languages by machines” projectStarts from the results of DutchSemCor (WSD)Analyse the real problems of WSDUnderstand the WSD task

WordMeaningContext

Page 4: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 4

Outline1. Starting Point

2.Motivation3. Our Approach4. Evaluation Framework5. Experiments and Results6. Conclusions

Page 5: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 5

Still WSD?Word Sense Disambiguation is still unsolved

Used in high level applications

Recently some unsupervised approaches and SemEval tasksBabelnet, Babelfy…

Several reasons and problems

Page 6: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 6

WSD problems IContext is not considered properly

Most are/were supervised approachesMoving to unsupervised, graph-based…

WSD as a black boxThe larger number of features, the better performance?The best and newest machine learning algorithm

WSD is seen as only one problemAll words and cases treated in the same way

Page 7: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 7

WSD problems IIError analysis SenseEval/SemEval systems

[Postma et al., 2014]Propagation errors (monosemous)

Most Frequent Sense biasSupervised systems are skewed towards MFSError analysis on WSD and SenseEval/SemEval

Performance on MFS cases is good Very poor performance on non MFS cases

Page 8: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 8

WSD problems II

Page 9: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 9

WSD problems IIMost Frequent Sense bias

Supervised systems are skewed towards MFS

Error analysis on WSD and SenseEval/SemEvalPerformance on MFS cases is goodVery poor performance on non MFS casesSystems assign MFS in almost every case

Sval2799 cases where the correct is not the MFS84% of the system still assign the MFS

Page 10: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 10

Outline1. Starting Point2. Motivation

3.Our Approach4. Evaluation Framework5. Experiments and Results6. Conclusions

Page 11: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 11

Main ideaWSD considered as two different problems

When the MFS appliesMore general usagesLarger contexts ??

Rest of the sensesMore concrete usagesShorter contexts ??

Specialized classifiers for each case Different features, parameters, contexts…

Evaluation for Spanish Sense annotated corpus Ancora

Page 12: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 12

Our approachTRAINING. Use Topic Modeling (LDA) to induce

word expert classifiersFor the Most Frequent Sense

Topics for the MFS caseTopics for non MFS cases

For the rest of senses (non MFS) Topics for every sense

CLASSIFICATION. Apply the 2 classifiers in cascade to decide the sense in every case

BINARY

MULTICLASS

Page 13: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 13

Training

Page 14: Topic modeling and WSD on the Ancora corpus

14

Classification

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Page 15: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 15

Outline1. Starting Point2. Motivation3. Our Approach

4.Evaluation Framework5. Experiments and Results6. Conclusions

Page 16: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 16

Evaluation frameworkAncora corpus

News Articles, Spanish part, 500K words, sense annotated (nouns)

Converted to NAF format3 Folded-cross validation

Keeping sense distribution7119 unique lemmas annotated with nominal

senses

Page 17: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 17

Evaluation frameworkAncora corpus

Spanish part, 500K words, sense annotated (nouns)3 Folded-cross validation

Keeping sense distribution7119 unique lemmas annotated

4907 are monosemous (69%)2212 are polysemous (31%)

589 with at least 3 instances per sense (from the annotated)

Page 18: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 18

Evaluation frameworkAncora corpus

Spanish part, 500K words, sense annotated (nouns)

3 Folded-cross validationKeeping sense distribution

7119 unique lemmas annotated

2 3 4 5 6 7 8 9 10 11 120

200

400

600

800

1000

1200

1400Number of lemmas vs. polysemy

Number of Lemmas

Page 19: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 19

Baseline ResultsFor the 589 selected lemmas

Baseline AccuracyRandom 40.10MFS overall 67.68MFS folded 68.63

Page 20: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 20

Outline1. Starting Point2. Motivation3. Our Approach4. Evaluation Framework

5.Experiments and Results6. Conclusions

Page 21: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 21

ExperimentationConfiguration of our cascade classifiers

Only one step with the senseLDA classifier2 steps, mfsLDA with perfect performance +

senseLDA2 steps, mfsLDA and senseLDA both induced

automaticallyLDA parameters (python gensim library)

Context size (number of sentences)Number of topics for LDA

Page 22: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 22

Results IInstance Example

Sense LDA (all senses)

Word SenseOne step

classificationSentences Topic

sAccuracy

MFS baseline 68.630 3 67.54

10 65.56100 58.34

3 3 66.3010 64.62100 60.07

50 3 66.0410 63.42100 59.06

• MFS not reached• Most informative clues in

small contexts• More topics less

performance

Page 23: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 23

Results IIInstance Example

MFS (100%

accuracy)

Sense LDA (all senses)

Word Sense

Two steps, MFS classifier 100% performance

Sentences Topics

Accuracy

MFS baseline 68.630 3 92.48

10 92.12100 90.50

3 3 92.4510 92.11100 91.60

50 3 92.4110 92.12100 91.43

• Extremely high figures• Good performance of the

senseLDA classifier (when no MFS)

• Similar behaviour w.r.t. #sents and # topics

Page 24: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 24

Results IIIInstanc

e Exampl

e

MFS (s5)

Sense LDA (all senses)

Word Sense

Two steps, MFS classifier #S=5

Sents

Topics Acc. MFS T100

Acc. MFS T1000

MFS baseline 68.630 3 74.53 66.73

10 74.00 66.41100 72.61 64.91

3 3 74.30 66.6110 73.87 66.36100 73.39 65.76

50 3 74.26 66.4810 73.90 66.24100 73.53 65.75

• MFS s5 t100• Smaller contexts

for non MFS cases (3, 50 included by 0)

• 3 Topics is the best

Page 25: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 25

Results IVInstanc

e Exampl

eMFS (s50)

Sense LDA (all senses)

Word Sense

Two steps, MFS classifier #S=50

Sents

Topics Acc. MFS T100

Acc. MFS T1000

MFS baseline 68.630 3 73.34 67.15

10 72.92 66.76100 71.43 65.13

3 3 73.21 67.0210 72.88 66.60100 72.40 66.24

50 3 73.21 66.9510 72.83 66.58100 72.15 66.20

• Similar behaviour compared to MFS_s5

• Slightly lower results

Page 26: Topic modeling and WSD on the Ancora corpus

26

Lemma comparisonLemma MFS

(68.63)LDA (74.53)

Variation Annotations

año 89.15 91.19 2.04 1275país 72.29 83.55 11.26 695presidente 70.31 73.94 3.63 690partido 55.87 64.48 8.61 641equipo 98.32 98.88 0.56 539mes 54.29 80 25.71 315hora 61.39 56.11 -5.28 305caso 61.05 91.58 30.53 286mundo 47.31 40.14 -7.17 279semana 85.06 92.34 7.28 263

Most frequent lemmas

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante.

Page 27: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 27

Outline1. Starting Point2. Motivation3. Our Approach4. Evaluation Framework5. Experiments and Results

6.Conclusions

Page 28: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 28

Conclusions Simple approach based on LDA for WSD in Spanish Two step classification approach for WSD improves the results

for Spanish (6 points) Different nature of both cases

MFS in contexts of 5 sentences, 100 topics NonMFS in contexts in the local sentence, 3 topics

All code and data publicly available on GitHub (group policy)

http://github.com/rubenIzquierdo/lda_wsd

Page 29: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 29

Page 30: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 30

Page 31: Topic modeling and WSD on the Ancora corpus

Ruben Izquierdo. LDA & WSD. SEPLN2015, Alicante. 31

GRACIASRuben IzquierdoMarten PostmaPiek Vossen

email: [email protected]://github.com/rubenIzquierdo/lda_wsdhttp://rubenizquierdobevia.com