lecture 7: word embeddings - computer sciencekc2wc/teaching/nlp16/slides/07-wordembedding.pdflecture...

33
Lecture 7: Word Embeddings Kai-Wei Chang CS @ University of Virginia [email protected] Couse webpage: http://kwchang.net/teaching/NLP16 1 6501 Natural Language Processing

Upload: dinhkhue

Post on 26-May-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Lecture 7: Word Embeddings

Kai-Wei ChangCS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

Page 2: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

This lecture

v Learning word vectors (Cont.)

v Representation learning in NLP

26501 Natural Language Processing

Page 3: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Recap: Latent Semantic Analysis

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., from a general corpus)v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

Page 4: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Recap: Mapping to Latent Space via SVD

v SVD generalizes the original datav Uncovers relationships not explicit in the thesaurusv Term vectors projected to 𝑘-dim latent space

v Word similarity: cosine of two column vectors in 𝚺𝐕$

𝑪 𝐔𝐕'≈

𝑑×𝑛 𝑑×𝑘

𝑘×𝑘 𝑘×𝑛

𝚺

Page 5: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Low rank approximation

vFrobenius norm. C is a 𝑚×𝑛 matrix

||𝐶||/ = 11|𝑐34|56

478

9

378

vRank of a matrix.vHow many vectors in the matrix are

independent to each other

6501 Natural Language Processing 5

Page 6: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Low rank approximation

v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

6501 Natural Language Processing 6

Essentially,weminimizethe“reconstructionloss”underalowrankconstraint

Page 7: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Low rank approximation

v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?

6501 Natural Language Processing 7

Essentially,weminimizethe“reconstructionloss”underalowrankconstraint

Page 8: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Low rank approximation

v Assume rank of 𝐶 is rv SVD: 𝐶 = 𝑈Σ𝑉', Σ = diag(𝜎8, 𝜎5 …𝜎P, 0,0,0, …0)

v Zero-out the r − 𝑘 trailing valuesΣ′ = diag(𝜎8, 𝜎5 …𝜎U, 0,0,0,… 0)

v 𝐶V = UΣV𝑉' is the best k-rank approximation: CV = 𝑎𝑟𝑔min

=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

6501 Natural Language Processing 8

Σ =𝜎8 0 00 ⋱ 00 0 0

𝑟 non-zeros

Page 9: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Word2Vec

v LSA: a compact representation of co-occurrence matrix

v Word2Vec:Predict surrounding words (skip-gram)vSimilar to using co-occurrence counts Levy&Goldberg

(2014), Pennington et al. (2014)

v Easy to incorporate new wordsor sentences

6501 Natural Language Processing 9

Page 10: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Word2Vec

v Similar to language model, but predicting next word is not the goal.

v Idea: words that are semantically similar often occur near each other in textv Embeddings that are good at predicting neighboring

words are also good at representing similarity

6501 Natural Language Processing 10

Page 11: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Skip-gram v.s Continuous bag-of-words

vWhat differences?

6501 Natural Language Processing 11

Page 12: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Skip-gram v.s Continuous bag-of-words

6501 Natural Language Processing 12

Page 13: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Objective of Word2Vec (Skip-gram)

vMaximize the log likelihood of context word 𝑤\]9,𝑤\]9^8, … ,𝑤\]8 , 𝑤\^8, 𝑤\^5,… ,𝑤\^9given word 𝑤\

vm is usually 5~10

6501 Natural Language Processing 13

Page 14: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Objective of Word2Vec (Skip-gram)

vHow to model log 𝑃(𝑤\^4|𝑤\)?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vsoftmax function Again!

vEvery word has 2 vectorsv𝑣p : when 𝑤 is the center wordv𝑢p: when 𝑤 is the outside word (context word)

6501 Natural Language Processing 14

Page 15: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

How to update?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vHow to minimize 𝐽(𝜃)vGradient descent!vHow to compute the gradient?

6501 Natural Language Processing 15

Page 16: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Recap: Calculus

6501 Natural Language Processing 16

vGradient:𝒙' = 𝑥8 𝑥5 𝑥z ,

∇𝜙 𝒙 =

𝜕𝜙(𝒙)𝜕𝑥8𝜕𝜙(𝒙)𝜕𝑥5𝜕𝜙(𝒙)𝜕𝑥z

v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂'𝒙)∇𝜙 𝒙 = 𝒂

Page 17: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Recap: Calculus

6501 Natural Language Processing 17

v If𝑦 = 𝑓 𝑢 and𝑢 = 𝑔 𝑥 (i.e,.𝑦 = 𝑓(𝑔 𝑥 )����= ��(f)

�f��(�)��

( ���f

�f��

)

1. 𝑦 = 𝑥� + 6 z 2. y = ln(𝑥5 + 5)3. y = exp(x� + 3𝑥 + 2)

Page 18: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Other useful formulation

v 𝑦 = exp 𝑥dydx = exp x

v y = log xdydx =

1x

6501 Natural Language Processing 18

WhenIsaylog(inthiscourse), usuallyImeanln

Page 19: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

6501 Natural Language Processing 19

Page 20: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Example

vAssume vocabulary set is 𝑊. We have one center word 𝑐, and one context word 𝑜.

vWhat is the conditional probability 𝑝 𝑜 𝑐

𝑝 𝑜 𝑐 =exp(𝑢� ⋅ 𝑣�)

∑ exp(𝑢pn ⋅ 𝑣�)pV vWhat is the gradient of the log likelihood

w.r.t 𝑣�?𝜕 log 𝑝 𝑜 𝑐

𝜕𝑣�= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

6501 Natural Language Processing 20

Page 21: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Gradient Descent

minp𝐽(𝑤)

Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)

6501 Natural Language Processing 21

Page 22: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Local minimum v.s. global minimum

6501 Natural Language Processing 22

Page 23: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Stochastic gradient descent

v Let 𝐽 𝑤 = 86∑ 𝐽4(𝑤)6478

v Gradient descent update rule:

𝑤 ← 𝑤 − �6∑ 𝛻𝐽4 𝑤6478

v Stochastic gradient descent:

v Approximate 86∑ 𝛻𝐽4 𝑤6478 by the gradient at a

single example 𝛻𝐽3 𝑤 (why?)v At each step:

6501 Natural Language Processing 23

Randomlypickanexample𝑖𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤

Page 24: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Negative sampling

vWith a large vocabulary set, stochastic gradient descent is still not enough (why?)

𝜕 log𝑝 𝑜 𝑐𝜕𝑣�

= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

vLet’s approximate it again!vOnly sample a few words that do not appear

in the contextvEssentially, put more weights on positive

samples

6501 Natural Language Processing 24

Page 25: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

More about Word2Vec – relation to LSA

v LSA factorizes a matrix of co-occurrence counts

v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix!

vPMI(w,c) =log ¡(�|�)¡(�)

= log ¡(�,�)�(�)¡(�)

= log# 𝑤, 𝑐 ⋅ |𝐷|#(𝑤)#(𝑐)

6501 Natural Language Processing 25

Page 26: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

All problem solved?

6501 Natural Language Processing 26

Page 27: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Continuous Semantic Representations

sunnyrainy

windycloudy

car

wheel

cab sad

joy

emotion

feeling

6501 Natural Language Processing 27

Page 28: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?

6501 Natural Language Processing 28

Page 29: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Polarity Inducing LSA [Yih, Zweig, Platt 2012]

vData representationvEncode two opposite relations in a matrix using

“polarity”v Synonyms & antonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

Page 30: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: +𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

Page 31: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: −𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

Page 32: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Continuous representations for entities

6501 Natural Language Processing 32

?

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Page 33: Lecture 7: Word Embeddings - Computer Sciencekc2wc/teaching/NLP16/slides/07-WordEmbedding.pdfLecture 7: Word Embeddings ... v† = exp y dy dx = exp x ... vEssentially, put more weights

Continuous representations for entities

6501 Natural Language Processing 33

• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction