introduction - klinton bicknell · learning (less weights to tune) • dense vectors may generalize...
TRANSCRIPT
![Page 1: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/1.jpg)
Vector SemanticsIntroduction
Klinton Bicknell
(borrowing from: Dan Jurafsky and Jim Martin)
![Page 2: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/2.jpg)
Whyvectormodelsofmeaning? computingthesimilaritybetweenwords
“fast”issimilarto“rapid”“tall”issimilarto“height”
Questionanswering:Q:“HowtallisMt.Everest?” CandidateA:“TheofficialheightofMountEverestis29029feet”
2
![Page 3: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/3.jpg)
Wordsimilarityforplagiarismdetection
![Page 4: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/4.jpg)
Wordsimilarityforhistoricalchange:semanticchangeovertime
4
Kulkarni,Al-Rfou,Perozzi,Skiena2015Sagi,KaufmannClark2013
Seman
;cBroad
ening
0
10
20
30
40
dog deer hound
<1250Middle1350-1500Modern1500-1710
![Page 5: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/5.jpg)
Problemswiththesaurus-basedmeaning
• Wedon’thaveathesaurusforeverylanguage• Wecan’thaveathesaurusforeveryyear
• Forchangedetection,weneedtocomparewordmeaningsinyearttoyeart+1
• Thesauruseshaveproblemswithrecall• Manywordsandphrasesaremissing• Thesauriworklesswellforverbs,adjectives
![Page 6: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/6.jpg)
Distributionalmodelsofmeaning =vector-spacemodelsofmeaning =vectorsemantics
Intuitions:ZelligHarris(1954):• “oculistandeye-doctor…occurinalmostthesameenvironments”
• “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”
Firth(1957):• “Youshallknowawordbythecompanyitkeeps!”
6
![Page 7: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/7.jpg)
Intuitionofdistributionalwordsimilarity
• Nidaexample:SupposeIaskedyouwhatistesgüino? A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.
• From context words humans can guess tesgüino means• analcoholicbeveragelikebeer
• Intuitionforalgorithm:• Twowordsaresimilariftheyhavesimilarwordcontexts.
![Page 8: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/8.jpg)
Threekindsofvectormodels
Sparsevectorrepresentations1. Mutual-informationweightedwordco-occurrencematrices
Densevectorrepresentations:2. Singularvaluedecomposition(andLatentSemanticAnalysis)3. Neural-network-inspiredmodels(skip-grams,CBOW)
8
![Page 9: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/9.jpg)
Sharedintuition
• Modelthemeaningofawordby“embedding”inavectorspace.• Themeaningofawordisavectorofnumbers
• Vectormodelsarealsocalled“embeddings”.
• Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)
• Oldphilosophyjoke:Q:What’sthemeaningoflife?A:LIFE’
9
![Page 10: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/10.jpg)
Vector SemanticsWordsandco-occurrence
vectors
![Page 11: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/11.jpg)
Co-occurrenceMatrices
• Werepresenthowoftenawordoccursinadocument• Term-documentmatrix
• Orhowoftenawordoccurswithanother• Term-termmatrix
(orword-wordco-occurrencematrixorword-contextmatrix)
11
![Page 12: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/12.jpg)
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
Term-documentmatrix
• Eachcell:countofwordwinadocumentd:• Eachdocumentisacountvectorinℕv:acolumnbelow
12
![Page 13: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/13.jpg)
Similarityinterm-documentmatrices
Twodocumentsaresimilariftheirvectorsaresimilar
13
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
![Page 14: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/14.jpg)
Thewordsinaterm-documentmatrix
• EachwordisacountvectorinℕD:arowbelow
14
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
![Page 15: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/15.jpg)
Thewordsinaterm-documentmatrix
• Twowordsaresimilariftheirvectorsaresimilar
15
AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0
![Page 16: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/16.jpg)
Theword-wordorword-contextmatrix•
16
![Page 17: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/17.jpg)
17
aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0… …
![Page 18: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/18.jpg)
Word-wordmatrix
•
18
![Page 19: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/19.jpg)
2kindsofco-occurrencebetween2words
• First-orderco-occurrence(syntagmaticassociation):• Theyaretypicallynearbyeachother.• wroteisafirst-orderassociateofbookorpoem.
• Second-orderco-occurrence(paradigmaticassociation):• Theyhavesimilarneighbors.• wroteisasecond-orderassociateofwordslikesaidorremarked.
19
(Schütze and Pedersen, 1993)
![Page 20: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/20.jpg)
Vector SemanticsPositivePointwiseMutual
Information(PPMI)
![Page 21: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/21.jpg)
Problemwithrawcounts
• Rawwordfrequencyisnotagreatmeasureofassociationbetweenwords• It’sveryskewed
• “the”and“of”areveryfrequent,butmaybenotthemostdiscriminative
• We’dratherhaveameasurethataskswhetheracontextwordisparticularlyinformativeaboutthetargetword.• PositivePointwiseMutualInformation(PPMI)
21
![Page 22: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/22.jpg)
PointwiseMutualInformation
•
PMI(X,Y) = log2P(x,y)P(x)P(y)
![Page 23: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/23.jpg)
PositivePointwiseMutualInformation•
![Page 24: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/24.jpg)
ComputingPPMIonaterm-contextmatrix
• MatrixFwithWrows(words)andCcolumns(contexts)• fijis#oftimeswioccursincontextcj
24
pij =fij
fijj=1
C
∑i=1
W
∑pi* =
fijj=1
C
∑
fijj=1
C
∑i=1
W
∑p* j =
fiji=1
W
∑
fijj=1
C
∑i=1
W
∑
pmiij = log2pijpi*p* j
ppmiij =pmiij if pmiij > 0
0 otherwise
⎧⎨⎪
⎩⎪
![Page 25: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/25.jpg)
p(w=information,c=data)=p(w=information)=p(c=data)=
25
p(w,context) p(w)computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
=.326/19
11/19 =.58
7/19 =.37
pij =fij
fijj=1
C
∑i=1
W
∑
p(wi ) =fij
j=1
C
∑
Np(cj ) =
fiji=1
W
∑
N
![Page 26: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/26.jpg)
26
pmiij = log2pijpi*p* j
• pmi(information,data)=log2(
p(w,context) p(w)computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)computer data pinch result sugar
apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -
.32/ (.37*.58)) =.58(.57usingfullprecision)
![Page 27: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/27.jpg)
WeightingPMI
• PMIisbiasedtowardinfrequentevents• VeryrarewordshaveveryhighPMIvalues
• Twosolutions:• Giverarewordsslightlyhigherprobabilities• Useadd-deltasmoothing(whichhasasimilareffect)
27
![Page 28: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/28.jpg)
WeightingPMI:Givingrarecontextwordsslightlyhigherprobability
•
28
![Page 29: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/29.jpg)
29
Add-2!Smoothed!Count(w,context)computer data pinch result sugar
apricot 2 2 3 2 3pineapple 2 2 3 2 3digital 4 3 2 3 2information 3 8 2 6 2
p(w,context)![add-2] p(w)computer data pinch result sugar
apricot 0.03 0.03 0.05 0.03 0.05 0.20pineapple 0.03 0.03 0.05 0.03 0.05 0.20digital 0.07 0.05 0.03 0.05 0.03 0.24information 0.05 0.14 0.03 0.10 0.03 0.36
p(context) 0.19 0.25 0.17 0.22 0.17
![Page 30: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/30.jpg)
PPMIversusadd-2smoothedPPMI
30
PPMI(w,context)![add-2]computer data pinch result sugar
apricot 0.00 0.00 0.56 0.00 0.56pineapple 0.00 0.00 0.56 0.00 0.56digital 0.62 0.00 0.00 0.00 0.00information 0.00 0.58 0.00 0.37 0.00
PPMI(w,context)computer data pinch result sugar
apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -
![Page 31: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/31.jpg)
Vector SemanticsMeasuringsimilarity:the
cosine
![Page 32: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/32.jpg)
Measuringsimilarity
• Given2targetwordsvandw• We’llneedawaytomeasuretheirsimilarity.• Mostmeasureofvectorssimilarityarebasedonthe:• Dotproductorinnerproductfromlinearalgebra
• Highwhentwovectorshavelargevaluesinsamedimensions.• Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution32
![Page 33: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/33.jpg)
Problemwithdotproduct
• Dotproductislongerifthevectorislonger.Vectorlength:
• Vectorsarelongeriftheyhavehighervaluesineachdimension• Thatmeansmorefrequentwordswillhavehigherdotproducts• That’sbad:wedon’twantasimilaritymetrictobesensitivetoword
frequency33
![Page 34: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/34.jpg)
Solution:cosine
• Justdividethedotproductbythelengthofthetwovectors!
• Thisturnsouttobethecosineoftheanglebetweenthem!
34
![Page 35: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/35.jpg)
Cosineforcomputingsimilarity
cos(!v, !w) =
!v• !w!v !w
=!v!v•!w!w=
viwii=1
N∑vi2
i=1
N∑ wi
2i=1
N∑
Dot product Unit vectors
vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i.
Cos(v,w) is the cosine similarity of v and w
Sec. 6.3
![Page 36: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/36.jpg)
Cosineasasimilaritymetric
• -1:vectorspointinoppositedirections• +1:vectorspointinsamedirections• 0:vectorsareorthogonal
• RawfrequencyorPPMIarenon-negative,socosinerange0-1
36
![Page 37: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/37.jpg)
large data computer
apricot 2 0 0
digital 0 1 2
information 1 6 1
37
Whichpairofwordsismoresimilar? cosine(apricot,information)=
cosine(digital,information)=
cosine(apricot,digital)=
cos(!v, !w) =
!v• !w!v !w
=!v!v•!w!w=
viwii=1
N∑vi2
i=1
N∑ wi
2i=1
N∑
1+ 0+ 0
1+36+1
1+36+1
0+1+ 4
0+1+ 4 0+ 6+ 2
0+ 0+ 0
=838 5
= .58
= 0
![Page 38: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/38.jpg)
Visualizingvectorsandangles
1 2 3 4 5 6 7
1
2
3
digital
apricotinformation
Dim
ensio
n 1:
‘lar
ge’
Dimension 2: ‘data’38
large data
apricot 2 0
digital 0 1
information 1 6
![Page 39: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/39.jpg)
Clusteringvectorstovisualizesimilarityinco-occurrencematrices
Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence
HEAD
HANDFACE
DOG
AMERICA
CAT
EYE
EUROPE
FOOT
CHINAFRANCE
CHICAGO
ARM
FINGER
NOSE
LEG
RUSSIA
MOUSE
AFRICA
ATLANTA
EAR
SHOULDER
ASIA
COW
BULL
PUPPY LION
HAWAII
MONTREAL
TOKYO
TOE
MOSCOW
TOOTH
NASHVILLE
BRAZIL
WRIST
KITTEN
ANKLE
TURTLE
OYSTER
Figure 8: Multidimensional scaling for three noun classes.
WRISTANKLE
SHOULDERARMLEGHAND
FOOTHEADNOSEFINGER
TOEFACEEAREYE
TOOTHDOGCAT
PUPPYKITTEN
COWMOUSE
TURTLEOYSTER
LIONBULLCHICAGOATLANTA
MONTREALNASHVILLE
TOKYOCHINARUSSIAAFRICAASIAEUROPEAMERICA
BRAZILMOSCOW
FRANCEHAWAII
Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations.
20
39 Rohdeetal.(2006)
![Page 40: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/40.jpg)
Otherpossiblesimilaritymeasures
![Page 41: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/41.jpg)
Evaluatingsimilarity(thesameasforthesaurus-based)
• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings
• Extrinsic(task-based,end-to-end)Evaluation:• Spellingerrordetection,WSD,essaygrading• TakingTOEFLmultiple-choicevocabularytests
Levied is closest in meaning to which of these: imposed, believed, requested, correlated
![Page 42: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/42.jpg)
AlternativetoPPMIformeasuringassociation
• tf-idf(that’sahyphennotaminussign)• Thecombinationoftwofactors
• Termfrequency(Luhn1957):frequencyoftheword(canbelogged)• Inversedocumentfrequency(IDF)(SparckJones1972)
• Nisthetotalnumberofdocuments
• dfi=“documentfrequencyofwordi”
• =#ofdocumentswithwordI
• wij=wordiindocumentj
wij=tfij idfi42
idfi = logNdfi
⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
![Page 43: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/43.jpg)
tf-idfnotgenerallyusedforword-wordsimilarity
• Butisbyfarthemostcommonweightingwhenweareconsideringtherelationshipofwordstodocuments
43
![Page 44: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/44.jpg)
Vector SemanticsDenseVectors
![Page 45: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/45.jpg)
Sparseversusdensevectors
• PPMIvectorsare• long(length|V|=20,000to50,000)• sparse(mostelementsarezero)
• Alternative:learnvectorswhichare• short(length200-1000)• dense(mostelementsarenon-zero)
45
![Page 46: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/46.jpg)
Sparseversusdensevectors
• Whydensevectors?• Shortvectorsmaybeeasiertouseasfeaturesinmachinelearning(lessweightstotune)
• Densevectorsmaygeneralizebetterthanstoringexplicitcounts• Theymaydobetteratcapturingsynonymy:
• carandautomobilearesynonyms;butarerepresentedasdistinctdimensions;thisfailstocapturesimilaritybetweenawordwithcarasaneighborandawordwithautomobileasaneighbor
46
![Page 47: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/47.jpg)
Threemethodsforgettingshortdensevectors
• SingularValueDecomposition(SVD)• AspecialcaseofthisiscalledLSA–LatentSemanticAnalysis
• “NeuralLanguageModel”-inspiredpredictivemodels• skip-gramsandCBOW
• Brownclustering
47
![Page 48: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/48.jpg)
Vector SemanticsDenseVectorsviaSVD
![Page 49: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/49.jpg)
Intuition• ApproximateanN-dimensionaldatasetusingfewerdimensions• Byfirstrotatingtheaxesintoanewspace• Inwhichthehighestorderdimensioncapturesthemostvarianceinthe
originaldataset• Andthenextdimensioncapturesthenextmostvariance,etc.• Manysuch(related)methods:
• PCA–principlecomponentsanalysis• FactorAnalysis• SVD
49
![Page 50: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/50.jpg)
50
Dimensionalityreduction
![Page 51: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/51.jpg)
SingularValueDecomposition
51
AnyrectangularwxcmatrixXequalstheproductof3matrices:W:rowscorrespondingtooriginalbutmcolumnsrepresentsadimensioninanewlatentspace,suchthat
• Mcolumnvectorsareorthogonaltoeachother• Columnsareorderedbytheamountofvarianceinthedataseteachnewdimensionaccountsfor
S:diagonalmxmmatrixofsingularvaluesexpressingtheimportanceofeachdimension.C:columnscorrespondingtooriginalbutmrowscorrespondingtosingularvalues
![Page 52: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/52.jpg)
SingularValueDecomposition
52 LanduaerandDumais1997
![Page 53: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/53.jpg)
SVDappliedtoterm-documentmatrix: LatentSemanticAnalysis
• Ifinsteadofkeepingallmdimensions,wejustkeepthetopksingularvalues.Let’ssay300.• Theresultisaleast-squaresapproximationtotheoriginalX
53 k /
/ k
/ k
/ k
Deerwesteretal(1988)
![Page 54: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/54.jpg)
LSAmoredetails
• 300dimensionsarecommonlyused• Thecellsarecommonlyweightedbyaproductoftwoweights
• Localweight:Logtermfrequency• Globalweight:eitheridforanentropymeasure
54
![Page 55: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/55.jpg)
Let’sreturntoPPMIword-wordmatrices
• CanweapplytoSVDtothem?
55
![Page 56: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/56.jpg)
SVDappliedtoterm-termmatrix
56 (I’msimplifyingherebyassumingthematrixhasrank|V|)
![Page 57: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/57.jpg)
TruncatedSVDonterm-termmatrix
57
![Page 58: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/58.jpg)
TruncatedSVDproducesembeddings
58
• EachrowofWmatrixisak-dimensionalrepresentationofeachwordw
• Kmightrangefrom50to1000• Generallywekeepthetopkdimensions,
butsomeexperimentssuggestthatgettingridofthetop1dimensionoreventhetop50dimensionsishelpful(LapesaandEvert2014).
![Page 59: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/59.jpg)
Embeddingsversussparsevectors
• DenseSVDembeddingssometimesworkbetterthansparsePPMImatricesattaskslikewordsimilarity• Denoising:low-orderdimensionsmayrepresentunimportantinformation
• Truncationmayhelpthemodelsgeneralizebettertounseendata.• Havingasmallernumberofdimensionsmaymakeiteasierforclassifierstoproperlyweightthedimensionsforthetask.
• Densemodelsmaydobetteratcapturinghigherorderco-occurrence.59
![Page 60: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/60.jpg)
Vector SemanticsEmbeddingsinspiredbyneural
languagemodels:skip-gramsandCBOW
![Page 61: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/61.jpg)
Prediction-basedmodels: Analternativewaytogetdensevectors
• Skip-gram(Mikolovetal.2013a)CBOW(Mikolovetal.2013b)• Learnembeddingsaspartoftheprocessofwordprediction.• Trainaneuralnetworktopredictneighboringwords
• Inspiredbyneuralnetlanguagemodels.• Insodoing,learndenseembeddingsforthewordsinthetrainingcorpus.
• Advantages:• Fast,easytotrain(muchfasterthanSVD)• Availableonlineintheword2vecpackage• Includingsetsofpretrainedembeddings!
61
![Page 62: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/62.jpg)
Skip-grams
• Predicteachneighboringword• inacontextwindowof2Cwords• fromthecurrentword.
• SoforC=2,wearegivenwordwtandpredictingthese4
words:
62
![Page 63: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/63.jpg)
Skip-gramslearn2embeddingsforeachw
inputembeddingv,intheinputmatrixW• ColumnioftheinputmatrixWisthe1×d
embeddingviforwordiinthevocabulary.
outputembeddingvʹ,inoutputmatrixW’
• RowioftheoutputmatrixWʹisad×1vector
embeddingvʹiforwordiinthevocabulary.
63
![Page 64: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/64.jpg)
Setup
• Walkingthroughcorpuspointingatwordw(t),whoseindexin
thevocabularyisj,sowe’llcallitwj(1<j<|V|).
• Let’spredictw(t+1),whoseindexinthevocabularyisk(1<k<|
V|).HenceourtaskistocomputeP(wk|wj).
64
![Page 65: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/65.jpg)
One-hotvectors
• Avectoroflength|V|• 1forthetargetwordand0forotherwords• Soif“popsicle”isvocabularyword5• Theone-hotvectoris• [0,0,0,0,1,0,0,0,0…….0]
65
![Page 66: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/66.jpg)
66
Skip-gram
![Page 67: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/67.jpg)
67
Skip-gramh=vj
o=W’h
o=W’h
![Page 68: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/68.jpg)
68
Skip-gram
h=vjo=W’h
ok=v’khok=v’k·vj
![Page 69: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/69.jpg)
Turningoutputsintoprobabilities
• ok=v’k·vj• Weusesoftmaxtoturnintoprobabilities
69
![Page 70: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/70.jpg)
EmbeddingsfromWandW’
• Sincewehavetwoembeddings,vjandv’jforeachwordwj
• Wecaneither:• Justusevj• Sumthem• Concatenatethemtomakeadouble-lengthembedding
70
![Page 71: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/71.jpg)
Butwait;howdowelearntheembeddings?
71
![Page 72: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/72.jpg)
RelationbetweenskipgramsandPMI!
• IfwemultiplyWW’T
• Wegeta|V|x|V|matrixM,eachentrymijcorrespondingtosome
associationbetweeninputwordiandoutputwordj• LevyandGoldberg(2014b)showthatskip-gramreachesitsoptimumjust
whenthismatrixisashiftedversionofPMI:
WWʹT=MPMI−logk• Soskip-gramisimplicitlyfactoringashiftedversionofthePMImatrixinto
thetwoembeddingmatrices.72
![Page 73: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/73.jpg)
CBOW(ContinuousBagofWords)
73
![Page 74: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/74.jpg)
Propertiesofembeddings
74
• Nearestwordstosomeembeddings(Mikolovetal.20131)
![Page 75: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/75.jpg)
Embeddingscapturerelationalmeaning!
•
75
![Page 76: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/76.jpg)
Vector SemanticsEvaluatingsimilarity
![Page 77: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/77.jpg)
Evaluatingsimilarity
• Extrinsic(task-based,end-to-end)Evaluation:• QuestionAnswering• SpellChecking• Essaygrading
• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings
• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77• TakingTOEFLmultiple-choicevocabularytests
• Levied is closest in meaning to: imposed, believed, requested, correlated
![Page 78: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/78.jpg)
Summary
• Distributional(vector)modelsofmeaning• Sparse(PPMI-weightedword-wordco-occurrencematrices)• Dense:
• Word-wordSVD50-2000dimensions• Skip-gramsandCBOW(embeddingsavailableinword2vec)
78
![Page 79: Introduction - Klinton Bicknell · learning (less weights to tune) • Dense vectors may generalize better than storing explicit counts • They may do better at capturing synonymy:](https://reader030.vdocuments.site/reader030/viewer/2022041014/5ec5560e13b08355f20a9ae4/html5/thumbnails/79.jpg)
Agreatsemanticvectorspacefordocuments
• wordshavelow-dimensionalembeddings,usefulformanycomputationallinguisticapplications
• documentsareaweightedcombinationofwords• documentsasavectorinthelow-dimensionalspace• thisallows
• semanticdocumentclustering(k-means,hierarchical,etc.)• searchforsimilardocuments(priorartinpatents,etc.)
79