representing meaning in unsupervised word sense disambiguation

66
1 Representing Meaning in Unsupervised Word Sense Disambiguation Bridget T. McInnes 5 September 2008 University of Minnesota Twin Cities

Upload: shiela

Post on 05-Jan-2016

40 views

Category:

Documents


2 download

DESCRIPTION

Bridget T. McInnes 5 September 2008. Representing Meaning in Unsupervised Word Sense Disambiguation. University of Minnesota Twin Cities. What is WSD?. The culture count doubled. Culture. Anthropological Culture. Laboratory Culture. Sense Inventory. Approaches to WSD. Supervised - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Representing  Meaning in Unsupervised Word Sense Disambiguation

1

Representing Meaning in Unsupervised Word Sense

Disambiguation

Bridget T. McInnes

5 September 2008

University of Minnesota Twin Cities

Page 2: Representing  Meaning in Unsupervised Word Sense Disambiguation

2

What is WSD?

The culture count doubled.

Culture

LaboratoryCulture

AnthropologicalCulture

Sense Inventory

Page 3: Representing  Meaning in Unsupervised Word Sense Disambiguation

3

Approaches to WSD

SupervisedAdvantages: obtains a high accuracyDisadvantages: manually annotated training data is required for each word that needs to be disambiguated therefore it can not scale

UnsupervisedAdvantages: does not require manually annotated training dataDisadvantages: generally does not obtain as high of an accuracy as supervised approaches

Page 4: Representing  Meaning in Unsupervised Word Sense Disambiguation

4

Unsupervised Approaches

Similarity and Relatedness Based

Page 5: Representing  Meaning in Unsupervised Word Sense Disambiguation

5

Unsupervised Approaches

Similarity and Relatedness BasedPatwardhan, Banerjee and Pedersen 2005Pedersen, et al 2006Budanitsky and Hirst 2006

Page 6: Representing  Meaning in Unsupervised Word Sense Disambiguation

6

Unsupervised Approaches

Similarity and Relatedness based

Vector Based

Page 7: Representing  Meaning in Unsupervised Word Sense Disambiguation

7

Unsupervised Approaches

Similarity and Relatedness Based

Vector-basedMohammad and Hirst, 2006Patwardhan, 2003Pedersen, et al 2006Humphrey, et al 2006

Page 8: Representing  Meaning in Unsupervised Word Sense Disambiguation

8

Unsupervised Approaches

Similarity and Relatedness-based

Vector-based

Clustering

Page 9: Representing  Meaning in Unsupervised Word Sense Disambiguation

9

Unsupervised Approaches

Similarity and Relatedness based

Vector-based

ClusteringPedersen and Bruce, 1997Shütze, 1998Pedersen and Bruce, 1998Purandare and Pedersen, 2004Kulkarni and Pedersen, 2005

Page 10: Representing  Meaning in Unsupervised Word Sense Disambiguation

10

Road Map

Previous Approaches

Our vector approach

Future Work

Page 11: Representing  Meaning in Unsupervised Word Sense Disambiguation

11

Previous Approaches

Similarity and Relatedness Based

SenseRelate (Banerjee and Pedersen, 2003)

Vector-based

Semantic Type Indexing (Humphrey et al 2006)

Clustering

SenseClusters (Kulkarni and Pedersen, 2005)

Page 12: Representing  Meaning in Unsupervised Word Sense Disambiguation

12

Banerjee and Pedersen 2003

Sense Relate

Page 13: Representing  Meaning in Unsupervised Word Sense Disambiguation

13

SenseRelateTarget Word: Transport

Concept 1: Biological Transport (C0005528)

Concept 2: Patient Transport (C0150390)

Transport of glutathione S-linked conjugates.

glutathione S-linked conjugates.

C0017817C0522529 C0301869

C0005528 = SS + SS + SS = Total SS for Concept 1

Page 14: Representing  Meaning in Unsupervised Word Sense Disambiguation

14

SenseRelateTarget Word: Transport

Concept 1: Biological Transport (C0005528)

Concept 2: Patient Transport (C0150390)

Transport of glutathione S-linked conjugates.

glutathione S-linked conjugates.

C0017817C0522529 C0301869

C0150390 = SS + SS + SS = Total SS for concept 2

C0005528 = SS + SS + SS = Total SS for concept 1

Page 15: Representing  Meaning in Unsupervised Word Sense Disambiguation

15

Humphrey et al, 2006

Semantic Type Indexing for WSD

Page 16: Representing  Meaning in Unsupervised Word Sense Disambiguation

16

Semantic Type Indexing (STI) Target Word: Transport

Concept 2 Vector

Concept 1 Vector

Target Word VectorCosine 2

Cosine 1

Concept 1: Biological TransportSemantic type: Cell Function

Concept 2: Patient TransportSemantic type: Health Care Activity

JDI

CV1 – JDI vectorCV2 – JDI vector

TW – JDI vector

Transport of glutathione S-linked conjugates.

Page 17: Representing  Meaning in Unsupervised Word Sense Disambiguation

17

Target Word Vector

Transport of glutathione S-linked conjugates.

Contains the words surrounding the ambiguous word

Page 18: Representing  Meaning in Unsupervised Word Sense Disambiguation

18

STI - Target Word Vectors

Transport of glutathione S-linked conjugates.

Contains the words surrounding the ambiguous word

Page 19: Representing  Meaning in Unsupervised Word Sense Disambiguation

19

STI -Concept Vectors

The concept vectors are created based on their semantic type(s)

Transport:C0005528: Biological TransportC0150390: Patient Transport

C0005528

C0150390

Cell FunctionOne word terms in the Metathesaurus associated with Cell Function

Health Care Activity One word terms in the Metathesaurus associated with Health Care Activity

Page 20: Representing  Meaning in Unsupervised Word Sense Disambiguation

20

Kulkarni and Pedersen, 2005

SenseClusters

Page 21: Representing  Meaning in Unsupervised Word Sense Disambiguation

21

Sense Clusters (SC)Target Word: Transport

Concept 1: Biological TransportConcept 2: Patient Transport

Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7Instance 8Instance 9Instance 10Instance 11Instance 12Instance 13…

Concept 1

Concept 2

Transport of glutathione S-linked conjugates.

Page 22: Representing  Meaning in Unsupervised Word Sense Disambiguation

22

Sense Clusters (SC)

Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7Instance 8Instance 9Instance 10Instance 11Instance 12Instance 13…

Concept 1

Concept 2

Target Word: Transport

Concept 1: Biological TransportConcept 2: Patient Transport

Transport of glutathione S-linked conjugates.

Page 23: Representing  Meaning in Unsupervised Word Sense Disambiguation

23

Sense Clusters

Concept 2 Vector

Concept 1 Vector

Target Word Vector

Cosine 2

Cosine 1

Target Word: Transport

Concept 1: Biological TransportConcept 2: Patient Transport

Transport of glutathione S-linked conjugates.

Page 24: Representing  Meaning in Unsupervised Word Sense Disambiguation

24

SC -Vectors

Contain the words surrounding the ambiguous word

Created using:

First order co-occurrences

Second order co-occurrences

Page 25: Representing  Meaning in Unsupervised Word Sense Disambiguation

25

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

50

6

5

.

.

.

5

6

1

.

.

.

5

0

15

.

.

.

20

4

7

TargetVector

Page 26: Representing  Meaning in Unsupervised Word Sense Disambiguation

26

Second Order Co-occurrence Vectors

Word 1

Word 2

Word N

.

.

.

.

.

.

.

10

30

0

1st orderglutathione

20 10 0

10

0

0

2

50

2

… …

Word1 Word 2 … Word N

0 2 2…

2nd orderglutathione

Page 27: Representing  Meaning in Unsupervised Word Sense Disambiguation

27

Second Order Co-occurrence Vectors

S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

10

30

2

.

.

.

0

6

0

.

.

.

5

0

13

.

.

.

5

13

5

TargetVector

glutathione

Page 28: Representing  Meaning in Unsupervised Word Sense Disambiguation

28

Our unsupervised approach

Page 29: Representing  Meaning in Unsupervised Word Sense Disambiguation

29

CuiTools ApproachOur approach uses a general vector approach with SenseCluster vectors

Page 30: Representing  Meaning in Unsupervised Word Sense Disambiguation

30

CuiTools

Concept 2 Vector

Concept 1 Vector

Target Word Vector

Cosine 2

Cosine 1

Target Word: Transport

Concept 1: Biological Transport (C0005528)

Concept 2: Patient Transport (C0150390)

Transport of glutathione S-linked conjugates.

Page 31: Representing  Meaning in Unsupervised Word Sense Disambiguation

31

CuiTools Approach

We explore using

First-order co-occurrence vectors

Second-order co-occurrence vectors

Our approach uses a general vector approach with SenseCluster vectors

Page 32: Representing  Meaning in Unsupervised Word Sense Disambiguation

32

Target Word Vector

Contains the words surrounding the ambiguous word

Transport of glutathione S-linked conjugates.

Page 33: Representing  Meaning in Unsupervised Word Sense Disambiguation

33

CuiTools - Concept Vectors

How to create a vector that can represent the meaning of

a concept for word sense disambiguation?

Page 34: Representing  Meaning in Unsupervised Word Sense Disambiguation

34

To answer this question

We explore information in the UMLS that can be used to

represent the meaning of a concept.

Page 35: Representing  Meaning in Unsupervised Word Sense Disambiguation

35

CuiTools - Concept Vectors

Adjustment

Individual AdjustmentConceptually broad term referring to a state of harmony between internal needs and external …

Adjustment ActionThe act of making necessary corrections or modifications …

Psychological AdjustmentA state of harmony between internal needs and external demands and the processes used …

CUI definition

Page 36: Representing  Meaning in Unsupervised Word Sense Disambiguation

36

CuiTools - Concept Vectors

Blood Pressure

Blood PressureForce exerted by the blood on the walls of the arteries and other vessels.

Blood Pressure DeterminationActions performed to measure the diastolic and systolic pressure of the blood.

Arterial PressureNO DEFINTION

CUI definition

Page 37: Representing  Meaning in Unsupervised Word Sense Disambiguation

37

CuiTools - Concept Vectors

CUI definitionUse CUI definition but if it doesn’t exist

PARent definitionSemantic Type definition

SYNonymous terms

For example:C0430400: Laboratory Culture

laboratory culturemicrobial culturesample culture

Page 38: Representing  Meaning in Unsupervised Word Sense Disambiguation

38

CuiTools - Concept Vectors

CUI definition

PARent definitionSemantic Type definition

SIBlings

For example:C0010453: Anthropological Culture

archeologyfamilysocial groups

If CUI definition doesn’t exist

SYNonymous terms

Page 39: Representing  Meaning in Unsupervised Word Sense Disambiguation

39

CuiTools - Concept Vectors

CUI definitionIf CUI definition doesn’t exist

PARent definitionSemantic Type definition

SIBlings

SYNonymous terms

TOP 50 most frequent words surrounding the terms associated with the CUI

Page 40: Representing  Meaning in Unsupervised Word Sense Disambiguation

40

Dataset

National Library of Medicine's Word Sense Disambiguation (NLM-WSD) Dataset

50 words from the 1998 MEDLINE abstracts

100 instances for each of the 50 words

The target word was manually assigned a UMLS concept or None

All instances of None were removed

Average number of concepts per ambiguous word is 2.26

Page 41: Representing  Meaning in Unsupervised Word Sense Disambiguation

41

Data subsets

Humphrey subset

Humphrey, et al 2006

45 out of the 50 words in NLM-WSD

5 words were excluded because at least two of the possible concepts associated with these words have the same semantic type

Instances that were assigned “None” were removed

Page 42: Representing  Meaning in Unsupervised Word Sense Disambiguation

42

Training Data

The training data used to create the 1st and 2nd order co-occurrence vectors is

2005 Medline baseline

Page 43: Representing  Meaning in Unsupervised Word Sense Disambiguation

43

Results

Page 44: Representing  Meaning in Unsupervised Word Sense Disambiguation

Results

Page 45: Representing  Meaning in Unsupervised Word Sense Disambiguation

45

Results of Co-occurrence Vectors

Page 46: Representing  Meaning in Unsupervised Word Sense Disambiguation

46

Results of the Representations of Meaning

Page 47: Representing  Meaning in Unsupervised Word Sense Disambiguation

47

Results of the Representations of Meaning - CUI

Adding the parent and semantic type definitions decreased the accuracy by 6 and 7 percentage points

Parent and semantic type definitions are too broad to define the meaning of a concept

Page 48: Representing  Meaning in Unsupervised Word Sense Disambiguation

48

Results of the Representations of Meaning - SYN

Using the synonymous terms associated with a concept is too narrow to represent the meaning.

Adjustment ActionAdjustment – actionAdjustmentsAdjustment, NOSAdjustment – action qualifier valueAdjustment – action procedure

Page 49: Representing  Meaning in Unsupervised Word Sense Disambiguation

49

Results of the Representations of Meaning - SIB

Using the terms associated the siblings of a concept is too broad to represent the meaning.

Adjustment ActionBiopsyCauterisationCauteryCold TherapyDesiccationDrainage procedureElectrolysis

Page 50: Representing  Meaning in Unsupervised Word Sense Disambiguation

50

Results of the Representations of Meaning

Page 51: Representing  Meaning in Unsupervised Word Sense Disambiguation

51

Supervised versus Unsupervised

Joshi McInnes Stevenson SenseClusters Humphrey CuiTools et al 04 et al 07 et al 08 et al 06

Page 52: Representing  Meaning in Unsupervised Word Sense Disambiguation

52

To recap

How to create a vector that can represent the meaning of

a concept for word sense disambiguation?

Page 53: Representing  Meaning in Unsupervised Word Sense Disambiguation

53

Conclusions

To answer this we explored information in the UMLS that could be used to represent the meaning of a concept

Finding a context to represent the meaning of a concept is difficult

We found using the top 50 most frequent words surrounding the terms associated with the concept best represented the concept for the task of word sense disambiguation

Page 54: Representing  Meaning in Unsupervised Word Sense Disambiguation

54

Take away message

Unsupervised approaches are showing promise

Their disadvantage due to supervised approaches obtaining a higher disambiguation accuracy is slowly disappearing

But we are not there yet … so there is more work to do

Page 55: Representing  Meaning in Unsupervised Word Sense Disambiguation

55

Future Work

UMLS-Similarity package

Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors

Page 56: Representing  Meaning in Unsupervised Word Sense Disambiguation

56

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

50

6

5

.

.

.

5

6

1

.

.

.

5

0

15

.

.

.

20

4

7

TargetVector

FREQ (glutathione, word N) Average

Page 57: Representing  Meaning in Unsupervised Word Sense Disambiguation

57

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

.5

.6

.5

.

.

.

.5

.6

.1

.

.

.

.5

0

.15

.

.

.

.75

.6

.25

TargetVector

Similarity (glutathione, word N) Average

Page 58: Representing  Meaning in Unsupervised Word Sense Disambiguation

58

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

.5

.6

.5

.

.

.

.5

.6

.1

.

.

.

.5

0

.15

.

.

.

1.5

1.2

.75

TargetVector

Similarity (glutathione, word N) Sum (like SenseRelate)

Page 59: Representing  Meaning in Unsupervised Word Sense Disambiguation

59

First Order Co-occurrences

glutathione

Word 1

Word 2

Word N

.

.

.

.

.

.

.

.5

.6

.5

Word N

(C0005528)

.3+ .2

C0000000 C0000001

Similarity = = .5

C0005528

Page 60: Representing  Meaning in Unsupervised Word Sense Disambiguation

60

Future Work

UMLS-Similarity package

Creating 2nd order co-occurrence matrices based on highly similar concepts rather than words in text

Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors

Page 61: Representing  Meaning in Unsupervised Word Sense Disambiguation

61

Second Order Co-occurrence Vectors

Word 1

Word 2

Word N

.

.

.

.

20 10 0

10

0

0

2

50

2

… …

Word1 Word 2 … Word N

Words come from training corpus

Frequency counts

Page 62: Representing  Meaning in Unsupervised Word Sense Disambiguation

62

Second Order Co-occurrence Vectors

CUI 1

CUI 2

CUI N

.

.

.

.

.20 .10 0

.10

0

0

.20

.50

.20

… …

CUI1 CUI2 … CUI N

Use concepts from the UMLS

Similarity scores

Page 63: Representing  Meaning in Unsupervised Word Sense Disambiguation

63

Future Work

UMLS-Similarity package

Creating 2nd order co-occurrence matrices based on highly similar concepts rather than co-occurrences in text

Use terms associated with CUIs that have a high similarity score with the possible concept to represent the meaning of the concept

Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors

Page 64: Representing  Meaning in Unsupervised Word Sense Disambiguation

64

Similarity Scores

What is potentially gained by using the similarity or relatedness measures

May catch words/concepts that are similar but do not frequently occur together in the training data

culture and ethnology

Ethnology is the study of anthropology

ethnology appears with culture only five times in the training data

The concepts Anthropological Culture and Ethnology would have a high similarity score where as Laboratory culture and Ethnology would not

Page 65: Representing  Meaning in Unsupervised Word Sense Disambiguation

65

Software

CuiTools version 0.19

http://cuitools.sourceforge.net

Page 66: Representing  Meaning in Unsupervised Word Sense Disambiguation

66

Thank you

Lan AronsonFrançois LangJim MorkAurélie NévéolWill Rogers

Olivier BodenreiderAllen BrowneMay CheyDina Demner-FushmanGuy DivitaKin Wah FungSusanne HumphreyDwayne McCullyTom RindfleschSuresh Srinivasan