word embeddings - wei xu · ‣ huge improvements across nearly all nlp tasks over word2vec &...
TRANSCRIPT
WordEmbeddings
WeiXu(many slides from Greg Durrett)
Administrivia
‣ Homework2duenextTuesday
‣ ChangeofOfficeHourthisweek:5:00-5:45pmThursday
‣ Reading:Eisenstein3.3.4,14.5,14.6,J+M6
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
soSmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classesprobs
Recall:BackpropagaXon
V
dhiddenunits
soSmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)@z
@Verr(z)
zf(x)
ThisLecture
‣ Training
‣WordrepresentaXons
‣ word2vec/GloVe
‣ EvaluaXngwordembeddings
TrainingTips
Batching
‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaXons
‣ NeedtomakethecomputaXongraphprocessabatchatthesameXme
probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))
...
‣ Batchsizesfrom1-100oSenworkwell
def make_update(input, gold_label)
# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]
...
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopXmizaXonmethod(SGD,Adagrad,etc.)
‣ HowtoiniXalize?Howtoregularize?WhatopXmizertouse?
‣ Thislecture:somepracXcaltricks.TakedeeplearningoropXmizaXoncoursestounderstandthisfurther
HowdoesiniXalizaXonaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soSmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
‣ HowdoweiniXalizeVandW?Whatconsequencesdoesthishave?
‣ Nonconvexproblem,soiniXalizaXonmaders!
‣ Nonlinearmodel…howdoesthisaffectthings?
‣ Tanh:IfcellacXvaXonsaretoolargeinabsolutevalue,gradientsaresmall
‣ ReLU:largerdynamicrange(allposiXvenumbers),butcanproducebigvalues,andcanbreakdownifeverythingistoonegaXve(“dead”ReLU)
HowdoesiniXalizaXonaffectlearning?
Krizhevskyetal.(2012)
IniXalizaXon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniXalizaXonwithappropriatescale
U
"�r
6
fan-in + fan-out,+
r6
fan-in + fan-out
#‣ XavieriniXalizer:
‣Wantvarianceofinputsandgradientsforeachlayertobethesame
‣ BatchnormalizaXon(IoffeandSzegedy,2015):periodicallyshiS+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)
2)IniXalizetoolargeandcellsaresaturated
Dropout‣ ProbabilisXcallyzerooutpartsofthenetworkduringtrainingtopreventoverfimng,usewholenetworkattestXme
Srivastavaetal.(2014)
‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy
‣ FormofstochasXcregularizaXon
‣ OnelineinPytorch/Tensorflow
�13
OpImizer‣ Adam(KingmaandBa,ICLR2015):verywidelyused.AdapIvestepsize+momentum
‣Wilsonetal.NIPS2017:adapIvemethodscanactuallyperformbadlyattestIme(Adamisinpink,SGDinblack)
‣ Onemoretrick:gradientclipping(setamaxvalueforyourgradients)
‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
‣ ObjecXve:manylossfuncXonslooksimilar,justchangesthelastlayeroftheneuralnetwork
‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)
‣ Training:lotsofchoicesforopXmizaXon/hyperparameters
FourElementsofNNs
WordRepresentaXons
WordRepresentaXons
‣ ConXnuousmodel<->expectsconXnuoussemanXcsfrominput
‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)
‣ NeuralnetworksworkverywellatconXnuousdata,butwordsarediscrete
slidecredit:DanKlein
DiscreteWordRepresentaXons
good
enjoyablegreat
0
fishcat
‣ Brownclusters:hierarchicalagglomeraXvehardclustering(eachwordhasonecluster,notsomeposteriordistribuXonlikeinmixturemodels)
‣Maximize
‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs
dog…
isgo
0
0 1 1
11
1
1
0
0
P (wi|wi�1) = P (ci|ci�1)P (wi|ci)
Brownetal.(1992)
WordEmbeddings
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
emb(interest)
emb(rates)
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
‣WhatproperXesshouldthesevectorshave?
SenXmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput
Iyyeretal.(2015)
goodenjoyable
bad
dog
great
is
‣Wantavectorspacewheresimilarwordshavesimilarembeddings
themoviewasgreat
themoviewasgood
~~
WordEmbeddings
‣ Goal:comeupwithawaytoproducetheseembeddings
‣ Foreachword,want“medium”dimensionalvector(50-300dims)represenXngit.
word2vec/GloVe
NeuralProbabilisXcLanguageModel
�22 Bengioetal.(2003)
word2vec:ConXnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
‣ Parameters:dx|V|(oned-lengthcontextvectorpervocword),|V|xdoutputparameters(W)
dog
the
+
sized
soSmax
goldlabel=bit,nomanuallabelingrequired!
Mikolovetal.(2013)
d-dimensional wordembeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size|V|xd
W
size|V|
word2vec:Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
soSmax
goldlabel=dog
‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)
‣ Anothertrainingexample:bit->the
P (w0|w) = softmax(We(w))
Mikolovetal.(2013)
W
d-dimensional wordembeddings
size|V|xd
HierarchicalSoSmax
‣Matmul+soSmaxover|V|isveryslowtocomputeforCBOWandSG
‣ HierarchicalsoSmax:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoSmax: O(|V|)dotproductsofsized-pertraininginstancepercontextword
O(log(|V|))dotproductsofsized,
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
|V|xdparametersMikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
http://building-babylon.net/2017/08/01/hierarchical-softmax/
Skip-GramwithNegaXveSampling
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaXveexamplesbysamplingfromunigramdistribuXon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
‣ ObjecXve= logP (y = 1|w, c)� 1
k
nX
i=1
logP (y = 0|wi, c)sampled
thedogbittheman
logP (y = 1|w, c)� 1
k
nX
i=1
logP (y = 0|wi, c)
ConnecXonswithMatrixFactorizaXon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V| |V|
d
d
|V|
contextvecsword vecs
‣ LooksalmostlikeamatrixfactorizaXon…canweinterpretitthisway?
Skip-GramasMatrixFactorizaXon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
‣ IfwesamplenegaXveexamplesfromtheuniformdistribuXonoverwords
numnegaXvesamples
‣ …andit’saweightedfactorizaXonproblem(weightedbywordfreq)
Skip-gramobjecXveexactlycorrespondstofactoringthismatrix:
GloVe(GlobalVectors)
Penningtonetal.(2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ Loss=
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix
‣ Constantinthedatasetsize(justneedcounts),quadraXcinvocsize
‣ Byfarthemostcommonwordvectorsusedtoday(10000+citaXons)
wordpaircounts
|V|
|V|
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaXonsaswordvectors
‣ Context-sensiCvewordembeddings:dependonrestofthesentence
‣ HugeimprovementsacrossnearlyallNLPtasksoverword2vec&GloVe
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
EvaluaXon
EvaluaXngWordEmbeddings
‣WhatproperXesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
‣ Analogy:
ParisistoFranceasTokyoisto???
goodistobestassmartisto???
Similarity
Levyetal.(2015)
‣ SVD=singularvaluedecomposiXononPMImatrix
‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisXncXonsdon’tmaderinpracXce
HypernymyDetecXon
‣ Hypernyms:detecXveisaperson,dogisaanimal
‣ word2vec(SGNS)worksbarelybederthanrandomguessinghere
‣ DowordvectorsencodetheserelaXonships?
Changetal.(2017)
Analogies
queen
king
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin
king+(woman-man)=queen
‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen
Analogies
Levyetal.(2015)
‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods
cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)
cos(b2, a1) + ✏
UsingSemanXcKnowledge
Faruquietal.(2015)
‣ StructurederivedfromaresourcelikeWordNet
Originalvectorforfalse
Adaptedvectorforfalse
‣ Doesn’thelpmostproblems
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata
‣ Approach2:iniXalizeusingGloVe/word2vec/ELMo,keepfixed
‣ Approach3:iniXalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters
‣Worksbestforsometasks,notusedforELMo,oSenusedforBERT
‣ OSenworkspredywell
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaXonapproaches(constantindatasetsize)
‣ Training:opXmizer,iniXalizer,regularizaXon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣ NextXme:RNNsandCNNs
‣ LotsofpretrainedembeddingsworkwellinpracXce,theycapturesomedesirableproperXes
‣ Evenbeder:context-sensiXvewordembeddings(ELMo/BERT)