introduction - klinton bicknell · learning (less weights to tune) • dense vectors may generalize...

Vector SemanticsIntroduction

Klinton Bicknell

(borrowing from: Dan Jurafsky and Jim Martin)

Whyvectormodelsofmeaning? computingthesimilaritybetweenwords

“fast”issimilarto“rapid”“tall”issimilarto“height”

Questionanswering:Q:“HowtallisMt.Everest?” CandidateA:“TheofficialheightofMountEverestis29029feet”

2

Wordsimilarityforplagiarismdetection

Wordsimilarityforhistoricalchange:semanticchangeovertime

4

Kulkarni,Al-Rfou,Perozzi,Skiena2015Sagi,KaufmannClark2013

Seman

;cBroad

ening

0

10

20

30

40

dog deer hound

<1250Middle1350-1500Modern1500-1710

Problemswiththesaurus-basedmeaning

• Wedon’thaveathesaurusforeverylanguage• Wecan’thaveathesaurusforeveryyear

• Forchangedetection,weneedtocomparewordmeaningsinyearttoyeart+1

• Thesauruseshaveproblemswithrecall• Manywordsandphrasesaremissing• Thesauriworklesswellforverbs,adjectives

Distributionalmodelsofmeaning =vector-spacemodelsofmeaning =vectorsemantics

Intuitions:ZelligHarris(1954):• “oculistandeye-doctor…occurinalmostthesameenvironments”

• “IfAandBhavealmostidenticalenvironmentswesaythattheyaresynonyms.”

Firth(1957):• “Youshallknowawordbythecompanyitkeeps!”

6

Intuitionofdistributionalwordsimilarity

• Nidaexample:SupposeIaskedyouwhatistesgüino? A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.

• From context words humans can guess tesgüino means• analcoholicbeveragelikebeer

• Intuitionforalgorithm:• Twowordsaresimilariftheyhavesimilarwordcontexts.

Threekindsofvectormodels

Sparsevectorrepresentations1. Mutual-informationweightedwordco-occurrencematrices

Densevectorrepresentations:2. Singularvaluedecomposition(andLatentSemanticAnalysis)3. Neural-network-inspiredmodels(skip-grams,CBOW)

8

Sharedintuition

• Modelthemeaningofawordby“embedding”inavectorspace.• Themeaningofawordisavectorofnumbers

• Vectormodelsarealsocalled“embeddings”.

• Contrast:wordmeaningisrepresentedinmanycomputationallinguisticapplicationsbyavocabularyindex(“wordnumber545”)

• Oldphilosophyjoke:Q:What’sthemeaningoflife?A:LIFE’

9

Vector SemanticsWordsandco-occurrence

vectors

Co-occurrenceMatrices

• Werepresenthowoftenawordoccursinadocument• Term-documentmatrix

• Orhowoftenawordoccurswithanother• Term-termmatrix

(orword-wordco-occurrencematrixorword-contextmatrix)

11

AsYouLikeIt TwelfthNight JuliusCaesar HenryVbattle 1 1 8 15soldier 2 2 12 36fool 37 58 1 5clown 6 117 0 0

Term-documentmatrix

• Eachcell:countofwordwinadocumentd:• Eachdocumentisacountvectorinℕv:acolumnbelow

12

Similarityinterm-documentmatrices

Twodocumentsaresimilariftheirvectorsaresimilar

13


Thewordsinaterm-documentmatrix

• EachwordisacountvectorinℕD:arowbelow

14


Thewordsinaterm-documentmatrix

• Twowordsaresimilariftheirvectorsaresimilar

15


Theword-wordorword-contextmatrix•

16

17

aardvark computer data pinch result sugar …apricot 0 0 0 1 0 1pineapple 0 0 0 1 0 1digital 0 2 1 0 1 0information 0 1 6 0 4 0… …

Word-wordmatrix

•

18

2kindsofco-occurrencebetween2words

• First-orderco-occurrence(syntagmaticassociation):• Theyaretypicallynearbyeachother.• wroteisafirst-orderassociateofbookorpoem.

• Second-orderco-occurrence(paradigmaticassociation):• Theyhavesimilarneighbors.• wroteisasecond-orderassociateofwordslikesaidorremarked.

19

(Schütze and Pedersen, 1993)

Vector SemanticsPositivePointwiseMutual

Information(PPMI)

Problemwithrawcounts

• Rawwordfrequencyisnotagreatmeasureofassociationbetweenwords• It’sveryskewed

• “the”and“of”areveryfrequent,butmaybenotthemostdiscriminative

• We’dratherhaveameasurethataskswhetheracontextwordisparticularlyinformativeaboutthetargetword.• PositivePointwiseMutualInformation(PPMI)

21

PointwiseMutualInformation

•

PMI(X,Y) = log2P(x,y)P(x)P(y)

PositivePointwiseMutualInformation•

ComputingPPMIonaterm-contextmatrix

• MatrixFwithWrows(words)andCcolumns(contexts)• fijis#oftimeswioccursincontextcj

24

pij =fij

fijj=1

C

∑i=1

W

∑pi* =

fijj=1

C

∑

fijj=1

C

∑i=1

W

∑p* j =

fiji=1

W

∑

fijj=1

C

∑i=1

W

∑

pmiij = log2pijpi*p* j

ppmiij =pmiij if pmiij > 0

0 otherwise

⎧⎨⎪

⎩⎪

p(w=information,c=data)=p(w=information)=p(c=data)=

25

p(w,context) p(w)computer data pinch result sugar

apricot 0.00 0.00 0.05 0.00 0.05 0.11pineapple 0.00 0.00 0.05 0.00 0.05 0.11digital 0.11 0.05 0.00 0.05 0.00 0.21information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

=.326/19

11/19 =.58

7/19 =.37

pij =fij

fijj=1

C

∑i=1

W

∑

p(wi ) =fij

j=1

C

∑

Np(cj ) =

fiji=1

W

∑

N

26

pmiij = log2pijpi*p* j

• pmi(information,data)=log2(

p(w,context) p(w)computer data pinch result sugar


p(context) 0.16 0.37 0.11 0.26 0.11

PPMI(w,context)computer data pinch result sugar

apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -

.32/ (.37*.58)) =.58(.57usingfullprecision)

WeightingPMI

• PMIisbiasedtowardinfrequentevents• VeryrarewordshaveveryhighPMIvalues

• Twosolutions:• Giverarewordsslightlyhigherprobabilities• Useadd-deltasmoothing(whichhasasimilareffect)

27

WeightingPMI:Givingrarecontextwordsslightlyhigherprobability

•

28

29

Add-2!Smoothed!Count(w,context)computer data pinch result sugar

apricot 2 2 3 2 3pineapple 2 2 3 2 3digital 4 3 2 3 2information 3 8 2 6 2

p(w,context)![add-2] p(w)computer data pinch result sugar


p(context) 0.19 0.25 0.17 0.22 0.17

PPMIversusadd-2smoothedPPMI

30

PPMI(w,context)![add-2]computer data pinch result sugar

apricot 0.00 0.00 0.56 0.00 0.56pineapple 0.00 0.00 0.56 0.00 0.56digital 0.62 0.00 0.00 0.00 0.00information 0.00 0.58 0.00 0.37 0.00

PPMI(w,context)computer data pinch result sugar

apricot - - 2.25 - 2.25pineapple - - 2.25 - 2.25digital 1.66 0.00 - 0.00 -information 0.00 0.57 - 0.47 -

Vector SemanticsMeasuringsimilarity:the

cosine

Measuringsimilarity

• Given2targetwordsvandw• We’llneedawaytomeasuretheirsimilarity.• Mostmeasureofvectorssimilarityarebasedonthe:• Dotproductorinnerproductfromlinearalgebra

• Highwhentwovectorshavelargevaluesinsamedimensions.• Low(infact0)fororthogonalvectorswithzerosincomplementarydistribution32

Problemwithdotproduct

• Dotproductislongerifthevectorislonger.Vectorlength:

• Vectorsarelongeriftheyhavehighervaluesineachdimension• Thatmeansmorefrequentwordswillhavehigherdotproducts• That’sbad:wedon’twantasimilaritymetrictobesensitivetoword

frequency33

Solution:cosine

• Justdividethedotproductbythelengthofthetwovectors!

• Thisturnsouttobethecosineoftheanglebetweenthem!

34

Cosineforcomputingsimilarity

cos(!v, !w) =

!v• !w!v !w

=!v!v•!w!w=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

Dot product Unit vectors

vi is the PPMI value for word v in context i wi is the PPMI value for word w in context i.

Cos(v,w) is the cosine similarity of v and w

Sec. 6.3

Cosineasasimilaritymetric

• -1:vectorspointinoppositedirections• +1:vectorspointinsamedirections• 0:vectorsareorthogonal

• RawfrequencyorPPMIarenon-negative,socosinerange0-1

36

large data computer

apricot 2 0 0

digital 0 1 2

information 1 6 1

37

Whichpairofwordsismoresimilar? cosine(apricot,information)=

cosine(digital,information)=

cosine(apricot,digital)=

cos(!v, !w) =

!v• !w!v !w

=!v!v•!w!w=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4 0+ 6+ 2

0+ 0+ 0

=838 5

= .58

= 0

Visualizingvectorsandangles

1 2 3 4 5 6 7

1

2

3

digital

apricotinformation

Dim

ensio

n 1:

‘lar

ge’

Dimension 2: ‘data’38

large data

apricot 2 0

digital 0 1

information 1 6

Clusteringvectorstovisualizesimilarityinco-occurrencematrices

Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence

HEAD

HANDFACE

DOG

AMERICA

CAT

EYE

EUROPE

FOOT

CHINAFRANCE

CHICAGO

ARM

FINGER

NOSE

LEG

RUSSIA

MOUSE

AFRICA

ATLANTA

EAR

SHOULDER

ASIA

COW

BULL

PUPPY LION

HAWAII

MONTREAL

TOKYO

TOE

MOSCOW

TOOTH

NASHVILLE

BRAZIL

WRIST

KITTEN

ANKLE

TURTLE

OYSTER

Figure 8: Multidimensional scaling for three noun classes.

WRISTANKLE

SHOULDERARMLEGHAND

FOOTHEADNOSEFINGER

TOEFACEEAREYE

TOOTHDOGCAT

PUPPYKITTEN

COWMOUSE

TURTLEOYSTER

LIONBULLCHICAGOATLANTA

MONTREALNASHVILLE

TOKYOCHINARUSSIAAFRICAASIAEUROPEAMERICA

BRAZILMOSCOW

FRANCEHAWAII

Figure 9: Hierarchical clustering for three noun classes using distances based on vector correlations.

20

39 Rohdeetal.(2006)

Otherpossiblesimilaritymeasures

Evaluatingsimilarity(thesameasforthesaurus-based)

• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings

• Extrinsic(task-based,end-to-end)Evaluation:• Spellingerrordetection,WSD,essaygrading• TakingTOEFLmultiple-choicevocabularytests

Levied is closest in meaning to which of these: imposed, believed, requested, correlated

AlternativetoPPMIformeasuringassociation

• tf-idf(that’sahyphennotaminussign)• Thecombinationoftwofactors

• Termfrequency(Luhn1957):frequencyoftheword(canbelogged)• Inversedocumentfrequency(IDF)(SparckJones1972)

• Nisthetotalnumberofdocuments

• dfi=“documentfrequencyofwordi”

• =#ofdocumentswithwordI

• wij=wordiindocumentj

wij=tfij idfi42

idfi = logNdfi

⎛

⎝

⎜⎜

⎞

⎠

⎟⎟

tf-idfnotgenerallyusedforword-wordsimilarity

• Butisbyfarthemostcommonweightingwhenweareconsideringtherelationshipofwordstodocuments

43

Vector SemanticsDenseVectors

Sparseversusdensevectors

• PPMIvectorsare• long(length|V|=20,000to50,000)• sparse(mostelementsarezero)

• Alternative:learnvectorswhichare• short(length200-1000)• dense(mostelementsarenon-zero)

45

Sparseversusdensevectors

• Whydensevectors?• Shortvectorsmaybeeasiertouseasfeaturesinmachinelearning(lessweightstotune)

• Densevectorsmaygeneralizebetterthanstoringexplicitcounts• Theymaydobetteratcapturingsynonymy:

• carandautomobilearesynonyms;butarerepresentedasdistinctdimensions;thisfailstocapturesimilaritybetweenawordwithcarasaneighborandawordwithautomobileasaneighbor

46

Threemethodsforgettingshortdensevectors

• SingularValueDecomposition(SVD)• AspecialcaseofthisiscalledLSA–LatentSemanticAnalysis

• “NeuralLanguageModel”-inspiredpredictivemodels• skip-gramsandCBOW

• Brownclustering

47

Vector SemanticsDenseVectorsviaSVD

Intuition• ApproximateanN-dimensionaldatasetusingfewerdimensions• Byfirstrotatingtheaxesintoanewspace• Inwhichthehighestorderdimensioncapturesthemostvarianceinthe

originaldataset• Andthenextdimensioncapturesthenextmostvariance,etc.• Manysuch(related)methods:

• PCA–principlecomponentsanalysis• FactorAnalysis• SVD

49

50

Dimensionalityreduction

SingularValueDecomposition

51

AnyrectangularwxcmatrixXequalstheproductof3matrices:W:rowscorrespondingtooriginalbutmcolumnsrepresentsadimensioninanewlatentspace,suchthat

• Mcolumnvectorsareorthogonaltoeachother• Columnsareorderedbytheamountofvarianceinthedataseteachnewdimensionaccountsfor

S:diagonalmxmmatrixofsingularvaluesexpressingtheimportanceofeachdimension.C:columnscorrespondingtooriginalbutmrowscorrespondingtosingularvalues

SingularValueDecomposition

52 LanduaerandDumais1997

SVDappliedtoterm-documentmatrix: LatentSemanticAnalysis

• Ifinsteadofkeepingallmdimensions,wejustkeepthetopksingularvalues.Let’ssay300.• Theresultisaleast-squaresapproximationtotheoriginalX

53 k /

/ k

/ k

/ k

Deerwesteretal(1988)

LSAmoredetails

• 300dimensionsarecommonlyused• Thecellsarecommonlyweightedbyaproductoftwoweights

• Localweight:Logtermfrequency• Globalweight:eitheridforanentropymeasure

54

Let’sreturntoPPMIword-wordmatrices

• CanweapplytoSVDtothem?

55

SVDappliedtoterm-termmatrix

56 (I’msimplifyingherebyassumingthematrixhasrank|V|)

TruncatedSVDonterm-termmatrix

57

TruncatedSVDproducesembeddings

58

• EachrowofWmatrixisak-dimensionalrepresentationofeachwordw

• Kmightrangefrom50to1000• Generallywekeepthetopkdimensions,

butsomeexperimentssuggestthatgettingridofthetop1dimensionoreventhetop50dimensionsishelpful(LapesaandEvert2014).

Embeddingsversussparsevectors

• DenseSVDembeddingssometimesworkbetterthansparsePPMImatricesattaskslikewordsimilarity• Denoising:low-orderdimensionsmayrepresentunimportantinformation

• Truncationmayhelpthemodelsgeneralizebettertounseendata.• Havingasmallernumberofdimensionsmaymakeiteasierforclassifierstoproperlyweightthedimensionsforthetask.

• Densemodelsmaydobetteratcapturinghigherorderco-occurrence.59

Vector SemanticsEmbeddingsinspiredbyneural

languagemodels:skip-gramsandCBOW

Prediction-basedmodels: Analternativewaytogetdensevectors

• Skip-gram(Mikolovetal.2013a)CBOW(Mikolovetal.2013b)• Learnembeddingsaspartoftheprocessofwordprediction.• Trainaneuralnetworktopredictneighboringwords

• Inspiredbyneuralnetlanguagemodels.• Insodoing,learndenseembeddingsforthewordsinthetrainingcorpus.

• Advantages:• Fast,easytotrain(muchfasterthanSVD)• Availableonlineintheword2vecpackage• Includingsetsofpretrainedembeddings!

61

Skip-grams

• Predicteachneighboringword• inacontextwindowof2Cwords• fromthecurrentword.

• SoforC=2,wearegivenwordwtandpredictingthese4

words:

62

Skip-gramslearn2embeddingsforeachw

inputembeddingv,intheinputmatrixW• ColumnioftheinputmatrixWisthe1×d

embeddingviforwordiinthevocabulary.

outputembeddingvʹ,inoutputmatrixW’

• RowioftheoutputmatrixWʹisad×1vector

embeddingvʹiforwordiinthevocabulary.

63

Setup

• Walkingthroughcorpuspointingatwordw(t),whoseindexin

thevocabularyisj,sowe’llcallitwj(1<j<|V|).

• Let’spredictw(t+1),whoseindexinthevocabularyisk(1<k<|

V|).HenceourtaskistocomputeP(wk|wj).

64

One-hotvectors

• Avectoroflength|V|• 1forthetargetwordand0forotherwords• Soif“popsicle”isvocabularyword5• Theone-hotvectoris• [0,0,0,0,1,0,0,0,0…….0]

65

66

Skip-gram

67

Skip-gramh=vj

o=W’h

o=W’h

68

Skip-gram

h=vjo=W’h

ok=v’khok=v’k·vj

Turningoutputsintoprobabilities

• ok=v’k·vj• Weusesoftmaxtoturnintoprobabilities

69

EmbeddingsfromWandW’

• Sincewehavetwoembeddings,vjandv’jforeachwordwj

• Wecaneither:• Justusevj• Sumthem• Concatenatethemtomakeadouble-lengthembedding

70

Butwait;howdowelearntheembeddings?

71

RelationbetweenskipgramsandPMI!

• IfwemultiplyWW’T

• Wegeta|V|x|V|matrixM,eachentrymijcorrespondingtosome

associationbetweeninputwordiandoutputwordj• LevyandGoldberg(2014b)showthatskip-gramreachesitsoptimumjust

whenthismatrixisashiftedversionofPMI:

WWʹT=MPMI−logk• Soskip-gramisimplicitlyfactoringashiftedversionofthePMImatrixinto

thetwoembeddingmatrices.72

CBOW(ContinuousBagofWords)

73

Propertiesofembeddings

74

• Nearestwordstosomeembeddings(Mikolovetal.20131)

Embeddingscapturerelationalmeaning!

•

75

Vector SemanticsEvaluatingsimilarity

Evaluatingsimilarity

• Extrinsic(task-based,end-to-end)Evaluation:• QuestionAnswering• SpellChecking• Essaygrading

• IntrinsicEvaluation:• Correlationbetweenalgorithmandhumanwordsimilarityratings

• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77• TakingTOEFLmultiple-choicevocabularytests

• Levied is closest in meaning to: imposed, believed, requested, correlated

Summary

• Distributional(vector)modelsofmeaning• Sparse(PPMI-weightedword-wordco-occurrencematrices)• Dense:

• Word-wordSVD50-2000dimensions• Skip-gramsandCBOW(embeddingsavailableinword2vec)

78

Agreatsemanticvectorspacefordocuments

• wordshavelow-dimensionalembeddings,usefulformanycomputationallinguisticapplications

• documentsareaweightedcombinationofwords• documentsasavectorinthelow-dimensionalspace• thisallows

• semanticdocumentclustering(k-means,hierarchical,etc.)• searchforsimilardocuments(priorartinpatents,etc.)

79

introduction - klinton bicknell · learning (less weights to tune) • dense vectors may generalize...

Documents