cicling2005
TRANSCRIPT
1
Name Discrimination by Clustering Similar Contexts
Ted Pedersen & Anagha KulkarniUniversity of Minnesota, Duluth
Amruta PurandareNow at University of Pittsburgh
Research Supported by National Science FoundationFaculty Early Career Development Award (#0092784)
2
Name Discrimination
Different people have the same name George (HW) Bush and George (W) Bush
Different places have the same name Duluth (Minn) and Duluth (GA)
Different things have the same abbrev. UMD (Duluth) and UMD (College Park)
3
4
5
6
7
Our goals?
Given 1000 contexts w/ “John Smith”, identify those that are similar to each other
Group similar contexts together, assume they are associated with single individual
Generate an identifying label from the content of the different clusters
8
Measuring Similarity of Words and Contexts w/Large Corpora?
Second order Co-occurrences
Jim drives his car fast / Jim speeds in his auto
Car -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accident
Car and Auto occur with many of the same words. They are therefore similar!
Less direct relationship, more resistant to sparsity!
9
Word sense discrimination Given 1000 contexts that include a
particular target word (e.g., shell) Cluster those contexts such that
similar contexts come together Similar contexts have similar
meanings Label each cluster with something
that describes content, maybe even provides definition
10
Methodology
Feature Selection Context Representation Measuring Similarities Clustering Evaluation
11
Feature Selection Identify features in large
(separate) training corpora, or in data to be clustered.
Rely on lexical features Unigrams, bigrams, co-occurrences
12
Lexical features Unigrams, words that occur more
than X times Bigrams, ordered pairs of words,
separated by at most 2-3 intervening words, score above cutoff on measure of association
Co-occurrences, same as bigrams, but unordered
13
Context representation
First order Unigrams, bigrams, and co-
occurrences that occur in training corpus, also occur in context to be clustered
Context is represented as vector that shows if (or how often) these features occur in context to be clustered
14
Context Representation
Second order Bigrams or co-occurrences used to
create matrix, cells represent counts or measure of word pair
Rows serve as co-occurrence vector for a word
Represent context by averaging vectors of words in that context
15
2nd Order Context VectorsThe largest shell store by the sea shore
Sells Water North-
West
Sandy Bombs
Sales Artillery
Sea 18.5533 3324.98 30.520 51.7812 8.7399 0 0
Shore 0 0 29.576 136.0441
0 0 0
Store 134.5102
205.5469
0 0 0 18818.55
0
O2contex
t
51.021 1176.84 20.032 62.6084 2.9133 6272.85 0
16
2nd Order Context Vectors
17
Measuring Similaritiesc1: {file, unix, commands, system, store}c2: {machine, os, unix, system, computer, dos,
store}
Matching = |X П Y|{unix, system, store} = 3
Cosine = |X П Y|/(|X|*|Y|)3/(√5*√7) = 3/(2.2361*2.646) = 0.5070
18
Limitations of 1st or 2nd order
Kill Murder
Destroy
Fire Shoot Missile
Weapon
2.53 0 1.28 0 3.24 0 28.72
0 4.21 0 0.92 0 52.27 0
Burn
CD Fire Pipe Bomb
Command Execute
2.56 1.28
0 72.7 0 2.36 19.23
34.2 0 22.1 46.2 14.6 0 17.77
19
Latent Semantic Analysis
Singular Value Decomposition
Captures Polysemy and Synonymy(?)
Conceptual Fuzzy Feature Matching
Word Space to Semantic Space
20
After context representation…
Each context is represented by a vector of some sort First order vector shows direct
occurrence of features in context Second order vector is an average of
word vectors that make up context, captures indirect relationships
Now, cluster the vectors!
21
Clustering
UPGMA Hierarchical : Agglomerative
Repeated Bisections Hybrid : Divisive + Partitional
22
Evaluation (before mapping)
C1 10 0 3 2
C2 1 1 7 1
C3 2 1 1 6
C4 2 15 1 2
23
Evaluation (after mapping)
C1 10 3 2 0 15
C2 1 7 1 1 10
C3 2 1 6 1 10
C4 2 1 2 15 20
15 12 11 17 55
24
Majority Sense Classifier
25
Data Line, Hard, Serve
4000+ Instances / Word 60:40 Split 3-5 Senses / Word
SENSEVAL-2 73 words = 28 V + 29 N + 15 A Approx. 50-100 Test, 100-200 Train 8-12 Senses/Word
26
Experimental comparison of 1st and 2nd order representations:
Pedersen & Bruce (1st Order Contexts)
Schütze(2nd Order Contexts)
• PB1Co-occurrences,
UPGMA, Similarity Space
• SC1Co-occurrence Matrix,
SVDRB, Vector Space
• PB2PB1 except
RB, Vector Space
• SC2SC1 except
UPGMA, Similarity Space
• PB3PB1 with Bi-gram
Features
• SC3SC1 with Bi-gram
Matrix
27
Experimental Conclusions
Nature of Data RecommendationSmaller Data
(like SENSEVAL-2)2nd order, RB
Large, Homogeneous(like Line, Hard, Serve)
1st order, UPGMA
28
Software SenseClusters –
http://senseclusters.sourceforge.net/
N-gram Statistic Package - http://www.d.umn.edu/~tpederse/nsp.html
Cluto -http://www-users.cs.umn.edu/~karypis/cluto/
SVDPack - http://netlib.org/svdpack/
29
Making Free SoftwareMostly Perl, All CopyLeft
SenseClusters Identify similar contexts
Ngram Statistics Package Identify interesting sequences of words
WordNet::Similarity Measure similarity among concepts
Google-Hack Find sets of related words
WordNet::SenseRelate All words sense disambiguation
SyntaLex and Duluth systems Supervised WSD
http://www.d.umn.edu/~tpederse/code.html