lecture 7: word embeddings - computer sciencekc2wc/teaching/nlp16/slides/07-wordembedding.pdflecture...

Lecture 7: Word Embeddings

Kai-Wei ChangCS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

16501 Natural Language Processing

This lecture

v Learning word vectors (Cont.)

v Representation learning in NLP

26501 Natural Language Processing

Recap: Latent Semantic Analysis

vData representationvEncode single-relational data in a matrix

v Co-occurrence (e.g., from a general corpus)v Synonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

Recap: Mapping to Latent Space via SVD

v SVD generalizes the original datav Uncovers relationships not explicit in the thesaurusv Term vectors projected to 𝑘-dim latent space

v Word similarity: cosine of two column vectors in 𝚺𝐕$

𝑪 𝐔𝐕'≈

𝑑×𝑛 𝑑×𝑘

𝑘×𝑘 𝑘×𝑛

𝚺

Low rank approximation

vFrobenius norm. C is a 𝑚×𝑛 matrix

||𝐶||/ = 11|𝑐34|56

478

9

378

vRank of a matrix.vHow many vectors in the matrix are

independent to each other

6501 Natural Language Processing 5


v Low rank approximation problem:min=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘

v If I can only use k independent vectors to describe the points in the space, what are the best choices?


Essentially,weminimizethe“reconstructionloss”underalowrankconstraint


v Assume rank of 𝐶 is rv SVD: 𝐶 = 𝑈Σ𝑉', Σ = diag(𝜎8, 𝜎5 …𝜎P, 0,0,0, …0)

v Zero-out the r − 𝑘 trailing valuesΣ′ = diag(𝜎8, 𝜎5 …𝜎U, 0,0,0,… 0)

v 𝐶V = UΣV𝑉' is the best k-rank approximation: CV = 𝑎𝑟𝑔min

=||𝐶 − 𝑋||/𝑠. 𝑡. 𝑟𝑎𝑛𝑘 𝑋 = 𝑘


Σ =𝜎8 0 00 ⋱ 00 0 0

𝑟 non-zeros

Word2Vec

v LSA: a compact representation of co-occurrence matrix

v Word2Vec:Predict surrounding words (skip-gram)vSimilar to using co-occurrence counts Levy&Goldberg

(2014), Pennington et al. (2014)

v Easy to incorporate new wordsor sentences


Word2Vec

v Similar to language model, but predicting next word is not the goal.

v Idea: words that are semantically similar often occur near each other in textv Embeddings that are good at predicting neighboring

words are also good at representing similarity


Skip-gram v.s Continuous bag-of-words

vWhat differences?


Skip-gram v.s Continuous bag-of-words


Objective of Word2Vec (Skip-gram)

vMaximize the log likelihood of context word 𝑤\]9,𝑤\]9^8, … ,𝑤\]8 , 𝑤\^8, 𝑤\^5,… ,𝑤\^9given word 𝑤\

vm is usually 5~10


Objective of Word2Vec (Skip-gram)

vHow to model log 𝑃(𝑤\^4|𝑤\)?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vsoftmax function Again!

vEvery word has 2 vectorsv𝑣p : when 𝑤 is the center wordv𝑢p: when 𝑤 is the outside word (context word)


How to update?

𝑝 𝑤\^4 𝑤\ =cde(fghij⋅lgh)

∑ cde(fgn ⋅lgh)gn

vHow to minimize 𝐽(𝜃)vGradient descent!vHow to compute the gradient?


Recap: Calculus


vGradient:𝒙' = 𝑥8 𝑥5 𝑥z ,

∇𝜙 𝒙 =

𝜕𝜙(𝒙)𝜕𝑥8𝜕𝜙(𝒙)𝜕𝑥5𝜕𝜙(𝒙)𝜕𝑥z

v 𝜙 𝒙 = 𝒂 ⋅ 𝒙 (or represented as 𝒂'𝒙)∇𝜙 𝒙 = 𝒂

Recap: Calculus


v If𝑦 = 𝑓 𝑢 and𝑢 = 𝑔 𝑥 (i.e,.𝑦 = 𝑓(𝑔 𝑥 )��= ��(f)

�f��(�)��

( ��f

�f��

)

1. 𝑦 = 𝑥� + 6 z 2. y = ln(𝑥5 + 5)3. y = exp(x� + 3𝑥 + 2)

Other useful formulation

v 𝑦 = exp 𝑥dydx = exp x

v y = log xdydx =

1x


WhenIsaylog(inthiscourse), usuallyImeanln

Example

vAssume vocabulary set is 𝑊. We have one center word 𝑐, and one context word 𝑜.

vWhat is the conditional probability 𝑝 𝑜 𝑐

𝑝 𝑜 𝑐 =exp(𝑢� ⋅ 𝑣�)

∑ exp(𝑢pn ⋅ 𝑣�)pV vWhat is the gradient of the log likelihood

w.r.t 𝑣�?𝜕 log 𝑝 𝑜 𝑐

𝜕𝑣�= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]


Gradient Descent

minp𝐽(𝑤)

Update w: 𝑤 ← 𝑤 − 𝜂∇𝐽(𝑤)


Local minimum v.s. global minimum


Stochastic gradient descent

v Let 𝐽 𝑤 = 86∑ 𝐽4(𝑤)6478

v Gradient descent update rule:

𝑤 ← 𝑤 − �6∑ 𝛻𝐽4 𝑤6478

v Stochastic gradient descent:

v Approximate 86∑ 𝛻𝐽4 𝑤6478 by the gradient at a

single example 𝛻𝐽3 𝑤 (why?)v At each step:


Randomlypickanexample𝑖𝑤 ← 𝑤 − 𝜂𝛻𝐽3 𝑤

Negative sampling

vWith a large vocabulary set, stochastic gradient descent is still not enough (why?)

𝜕 log𝑝 𝑜 𝑐𝜕𝑣�

= 𝑢� − 𝐸p∼� 𝑤 𝑐 [𝑢p]

vLet’s approximate it again!vOnly sample a few words that do not appear

in the contextvEssentially, put more weights on positive

samples


More about Word2Vec – relation to LSA

v LSA factorizes a matrix of co-occurrence counts

v (Levy and Goldberg 2014) proves that skip-gram model implicitly factorizes a (shifted) PMI matrix!

vPMI(w,c) =log ¡(�|�)¡(�)

= log ¡(�,�)�(�)¡(�)

= log# 𝑤, 𝑐 ⋅ |𝐷|#(𝑤)#(𝑐)


All problem solved?


Continuous Semantic Representations

sunnyrainy

windycloudy

car

wheel

cab sad

joy

emotion

feeling


Semantics Needs More Than Similarity

Tomorrow will be rainy.

Tomorrow will be sunny.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟(rainy, sunny)?

𝑎𝑛𝑡𝑜𝑛𝑦𝑚(rainy, sunny)?


Polarity Inducing LSA [Yih, Zweig, Platt 2012]

vData representationvEncode two opposite relations in a matrix using

“polarity”v Synonyms & antonyms (e.g., from a thesaurus)

vFactorizationvApply SVD to the matrix to find latent

components

vMeasuring degree of relationvCosine of latent vectors

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: +𝑆𝑦𝑛𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

joy gladden sorrow sadden goodwill

Group 1:“joyfulness” 1 1 -1 -1 0

Group2:“sad” -1 -1 1 1 0

Group3:“affection” 0 0 0 0 1

Encode Synonyms & Antonyms in Matrix

v Joyfulness: joy, gladden; sorrow, saddenv Sad: sorrow, sadden; joy, gladden

Inducing polarity

Cosine Score: −𝐴𝑛𝑡𝑜𝑛𝑦𝑚𝑠

Target word: row-vector

Continuous representations for entities


?

MichelleObama

DemocraticParty

GeorgeWBush

LauraBush

RepublicParty

Continuous representations for entities


• Useful resources for NLP applications• Semantic Parsing & Question Answering • Information Extraction

lecture 7: word embeddings - computer sciencekc2wc/teaching/nlp16/slides/07-wordembedding.pdflecture...

Documents