composition in distributional semantics - semantic · pdf file · 2015-07-29plan...
TRANSCRIPT
Composition in distributional semantics
Marco Baroni and Georgiana Dinu
Center for Mind/Brain SciencesUniversity of Trento
ESSLLI 2014
!"#$"%&%'!"#$()*+*(,-./"012-+*(,)3,/%&4-,+*5/%0-51
In collaboration with:Roberto Zamparelli,Raffaella Bernardi,Marco Marelli,Denis Paperno,Jiming Li,Eva Maria Vecchi,Nghia The Pham,Germán Kruszewski,Angeliki Lazaridou,Edward Grefenstette,Fabio Zanzotto,Louise McNally,Gemma Boleda. . .
Plan for the course
Day 1 General introduction: background, motivation,simple compositional models
Day 2 Compositional models: structure and estimationDay 3 Empirical evaluationDay 4 Practice with the DISSECT toolkit:
http://clic.cimec.unitn.it/composes/toolkit
Day 5 Conclusion: open issues, broader implications,discussion
Outline
Distributional semantics
Towards a compositional distributional semantics
Simple composition methods
The vastness of word meaning
The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .
The meaning of a word is (can be approximatedby, derived from) the set of contexts in which itoccurs in texts
We found a little, hairy wampimuk sleepingbehind the tree
See also MacDonald & Ramscar CogSci 2001
Distributional semantic models in a nutshell
I Represent words through vectors recording theirco-occurrence counts with context elements in a corpus
I (Optionally) apply a re-weighting scheme to the resultingco-occurrence matrix
I (Optionally) apply dimensionality reduction techniques tothe co-occurrence matrix
I Measure geometric distance of word vectors in“distributional space” as proxy to semanticsimilarity/relatedness
Co-occurrence
he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white
sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a
bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the
rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of
the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh
oud obscured part of the moon . The Allied guns behind
Extracting co-occurrence counts
Variations in the type of context features
Doc1 Doc2 Doc3
stars 38 45 2
dobj←−−see mod−−→bright mod−−→shinystars 38 45 44
The nearest • to Earth stories of • and theirstars 12 10
Extracting co-occurrence counts
Variations in the definition of co-occurrence
E.g.: Co-occurrence with words, window of size 2, scaling bydistance to target:
... two [intensely bright stars in the] night sky ...
intensely bright in thestars 0.5 1 1 0.5
From co-occurrences to vectors
bright in skystars 8 10 6sun 10 15 4dog 2 20 1
Weighting
Re-weight the counts using corpus-level statistics to reflectco-occurrence significance
Point-wise Mutual Information (PMI)
PMI(target, ctxt) = logP(target, ctxt)
P(target)P(ctxt)
Weighting
Adjusting raw co-occurrence counts:
bright instars 385 10788 ... ← Counts
stars 43.6 5.3 ... ← PMI
Other weighting schemes:I TF-IDFI Local Mutual InformationI Dice
See Ch4 of J.R. Curran’s thesis (2004) and S. Evert’s thesis(2007) for great surveys
Dimensionality reduction
I Vector spaces often range from tens of thousands tomillions of context dimensions
I Some of the methods to reduce dimensionality:I Select context features based on various relevance criteriaI Random indexingI Claimed to also have a beneficial smoothing effect
I Singular Value DecompositionI Non-negative matrix factorizationI Probabilistic Latent Semantic AnalysisI Latent Dirichlet Allocation
From geometry to similarity in meaning
stars
sun
Vectors
stars 2.5 2.1sun 2.9 3.1
Cosine similarity
cos(x, y) =〈x, y〉‖x‖‖y‖
=
∑i=ni=1 xi × yi√∑i=n
i=1 x2 ×√∑i=n
i=1 y2
Other similarity measures: Euclidean Distance, Dice, Jaccard,Lin. . .
Other ways to obtain vectors“Word embeddings”
Build vectors optimizing some context prediction objective.
Bengio et al. JMLR 2003, Collobert and Weston ICML 2008, Turian etal. ACL 2010, Mikolov et al. NIPS 2013, systematic evaluation byBaroni et al. ACL 2014
Geometric neighbours ≈ semantic neighbours
rhino fall good singwoodpecker rise bad dancerhinoceros increase excellent whistleswan fluctuation superb mimewhale drop poor shoutivory decrease improved soundplover reduction perfect listenelephant logarithm clever recitebear decline terrific playsatin cut lucky hearsweatshirt hike smashing hiss
The TOEFL synonym match task
I 80 itemsI Target: levied
Candidates: imposed, believed, requested, correlated
I In semantic space, measure angles between target andcandidate context vectors, pick candidate with narrowestangle with target
The TOEFL synonym match task
I 80 itemsI Target: levied
Candidates: imposed, believed, requested, correlated
I In semantic space, measure angles between target andcandidate context vectors, pick candidate with narrowestangle with target
The TOEFL synonym match task
I 80 itemsI Target: levied
Candidates: imposed, believed, requested, correlatedI In semantic space, measure angles between target and
candidate context vectors, pick candidate with narrowestangle with target
Human performance on the synonym match task
I Average foreign test taker: 64.5%I Macquarie University staff (Rapp 2004):
I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%
Distributional Semantics takes the TOEFL
I Humans:I Foreign test takers: 64.5%I Macquarie non-natives: 86.75%I Macquarie natives: 97.75%
I Machines:I Classic LSA: 64.4%I Padó and Lapata’s dependency-based model: 73%I Rapp’s 2003 SVD-based model trained on lemmatized
BNC: 92.5%I Bullinaria and Levy’s 2012 SVD-based model trained on
ukWaC: 100% (in extensive search of parameter space)
Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci 2010
I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in
CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase
broke, John minces the meat - *the meat minced)
Selectional preferences in semantic spacePadó et al. EMNLP 2007
To kill. . .object cosine with cosinekangaroo 0.51 hammer 0.26person 0.45 stone 0.25robot 0.15 brick 0.18hate 0.11 smile 0.15flower 0.11 flower 0.12stone 0.05 antibiotic 0.12fun 0.05 person 0.12book 0.04 heroin 0.12conversation 0.03 kindness 0.07sympathy 0.01 graduation 0.04
Distributional semantics: some references
I Classics:I Schütze’s 1997 CSLI bookI Landauer and Dumais PsychRev 1997I Griffiths et al. PsychRev 2007
I Overviews:I Turney and Pantel JAIR 2010I Erk LLC 2012I Clark to appear in Handbook of Contemporary Semantics
I Evaluation:I Sahlgren’s 2006 thesisI Bullinaria and Levy BRM 2007, 2012I Baroni, Dinu and Kruszewski ACL 2014I Kiela and Clark CVSC 2014
Outline
Distributional semantics
Towards a compositional distributional semantics
Simple composition methods
The infinity of sentence meaning
CompositionalityThe meaning of an utterance is a function of the meaning of its partsand their composition rules (Frege 1892)
Compositionality in formal semanticsSome donkey flies
[some(donkey)](flies)=FALSE
donkey
flies
t
<<e,t>,t>
<<e,t>,<<e,t>,t>>
some
<e,t>
donkey
<e,t>
flies
E.g., Heim and Kratzer 1998
A compositional distributional semantics for phrasesand sentences?Mitchell and Lapata ACL 2008, CogSci journal 2010. . .
planet night full blood shine
moon 10 22 43 3 29
red moon 12 21 40 20 28
the red moon shines 11 23 21 15 45
The unavoidability of distributional representationsof phrases
!"#$%&
&$'(%)&
#"*%&
"+&$'(%)
,-.,/.#"*%&0/1-20"+&$'(%)2
The unavoidability of distributional representationsof phrases
!"#$
%"&%$'()#'*+&%#(
%"&%$
()#'*,--
./0.10%"&%$213/42()#'*,--4
5%"&%$'()#'),*"(
What can you do with distributional representationsof phrases and sentences?
NOT this:!"!#$%&'#(!#%)*+%+),-%'%.+,/!$%0!'1-%21
What can you do with distributional representationsof phrases and sentences?
NOT this:
!"#$%&#'(#)*+,'-&
."#$%&#'(#/01'-&
What can you do with distributional representationsof phrases and sentences?Paraphrasing
0 10 20 30 40 50
010
2030
4050
dim 1
dim
2
"cookie dwarfs hopunder the crimson planet" "gingerbread gnomes
dance underthe red moon"
"red gnomes love gingerbread cookies"
"students eatcup noodles"
What can you do with distributional representationsof phrases and sentences?Contextual disambiguation
!"#$%&%&'(#)$*+$),!!#-
!"#$%&%&'(#)$*+$,./
!"#$%&%&'(#)$*+$0-%*#-!
What can you do with distributional representationsof phrases and sentences?Semantic acceptability
!"#$%"&'()"#*)+&")+
#%!',)+&")+ )+&")+
!"#$%"&'()"#*)+&")+
#%!',)+&")+
!" !# !$ !% !& !'
()*+,-./0-.
*0(1)0/+2-0(3,-./0-.
Compositional distributional semanticsHow?
planet night space color blood brown
red 15.3 3.7 2.2 24.3 19.1 20.2
moon 24.3 15.2 20.1 3.0 1.2 0.5
red+moon 39.6 18.9 22.3 27.3 20.3 20.7
red�moon 371.8 56.2 44.2 72.9 22.9 10.1
red(moon) 24.6 19.3 12.4 22.6 23.9 7.1
Outline
Distributional semantics
Towards a compositional distributional semantics
Simple composition methods
(Weighted) additive modelMitchell and Lapata 2010
~p = α~u + β~v
music solution economy craft reasonable
practical 0 6 2 10 4
difficulty 1 8 4 4 0
practical + difficulty 1 14 6 14 4
0.4×practical + 0.6×difficulty 0.6 5.6 3.2 6.4 1.6
Multiplicative modelMitchell and Lapata 2010
~p = ~u � ~v
pi = uivi
music solution economy craft reasonable
practical 0 6 2 10 4
difficulty 1 8 4 4 0
practical + difficulty 1 14 6 14 4
0.4×practical + 0.6×difficulty 0.6 5.6 3.2 6.4 1.6
practical � difficulty 0 48 8 40 0
Composition as dimension averaging/mixing
I red moon
I red face? fake gun?? buy guns??
I the car??? of the car????
I pandas eat bamboovs. bamboo eats pandas???
I lice and dogsvs. lice on dogs???
I some children walk happily in the valley of themoon????
Composition as (distributional) function application
Coecke+Clark+Grefenstette+Sadrzadeh et al.,COMPOSESRelated: Guevara, Socher et al., Zanzotto et al.
Composition as (distributional) function application
0.0 0.5 1.0 1.5
02
46
810
12
runs
barks
dog
old + dog
old
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
runs
barks
dogold(dog)
old()
Composition in distributional semantics
Marco Baroni and Georgiana Dinu
Center for Mind/Brain SciencesUniversity of Trento
ESSLLI 2014
Plan for the course
Day 1 General introduction: background, motivation,simple compositional models
Day 2 Compositional models: structure andestimation
Day 3 Empirical evaluationDay 4 Practice with the DISSECT toolkit:
http://clic.cimec.unitn.it/composes/toolkit
Day 5 Conclusion: open issues, broader implications,discussion
Outline
Composition as function application
Related models and general estimation of composition models
Neural network methods for composition
Outline
Composition as function application
Related models and general estimation of composition models
Neural network methods for composition
Function application in vector space
Baroni and Zamparelli’s EMNLP 2010 proposal
I Adjectives are linear functionsI Nouns are vectorsI Linear functions are matrices, function application is
function-by-vector multiplication:
−→c = A−→n
Linear composition function application asmatrix-by-vector multiplication
−→c = A−→n
c1c2· · ·cm
=
a11 a12 · · · a1ma21 a22 · · · a2m· · · · · · · · · · · ·am1 am2 · · · amm
×
n1n2· · ·nm
Linear composition function application asmatrix-by-vector multiplication
−→c = A−→n
c1c2· · ·cm
=
a11 a12 · · · a1ma21 a22 · · · a2m· · · · · · · · · · · ·am1 am2 · · · amm
×
n1n2· · ·nm
Linear composition function application asmatrix-by-vector multiplication
−→c = A−→n
c1c2· · ·cm
=
a11 a12 · · · a1ma21 a22 · · · a2m· · · · · · · · · · · ·am1 am2 · · · amm
×
n1n2· · ·nm
Corpus-observed phrase vectors
n and the moon shining iwith the moon shining srainbowed moon . And thecrescent moon , thrillein a blue moon only , winow , the moon has risend now the moon rises , fy at full moon , get upcrescent moon . Mr Angu
f a large red moon , Campana, a blood red moon hung overglorious red moon turning tThe round red moon , she ’sl a blood red moon emerged fn rains , red moon blows , wmonstrous red moon had climb. A very red moon rising isunder the red moon a vampire
shine blood Soviet
moon 301 93 1
red moon 11 90 0
army 2 454 20
red army 0 22 18
Learning to approximate corpus-observed phrasevectors
I Input:I N: matrix of noun vectorsI C: matrix of adj-noun observed phrase vectors
c1 −−−−−−→red.moon n1 −−−→moon
c2 −−−−−−→red.army n2 −−−→army
c3 −−−−→red.car n3 −→car
I Estimate:I Matrix of function word (Ared)
c11
c12· · ·c1
m
c21
c22· · ·c2
m
≈
a11 a12 · · · a1ma21 a22 · · · a2m· · · · · · · · · · · ·am1 am2 · · · amm
×
n11
n12· · ·n1
m
n21
n22· · ·n2
m
Learning to approximate corpus-observed phrasevectors
I Least squares regression problem
A = arg min ||C − AN||2F = arg min Σi,j(Cij − (AN)ij)2
c11
c12· · ·c1
m
c21
c22· · ·c2
m
≈
a11 a12 · · · a1ma21 a22 · · · a2m· · · · · · · · · · · ·am1 am2 · · · amm
×
n11
n12· · ·n1
m
n21
n22· · ·n2
m
I Closed form solution
From adjectives to higher valency relations
I How about two-place (or n-place) relations?I Linear functions→ matricesI Multilinear functions→ tensorsI Tensor of order n + 1 models functions with n argumentsI T: Third-order tensor, R: binary relation
R(a,b) = R(a)(b) = (T× ~a)× ~b
I ×: tensor contraction
(T× ~a)ij = ΣkTijkak
(Ta× ~b)i = Σj(Ta)ijbj
From adjectives to higher valency relations
I How about two-place (or n-place) relations?I Linear functions→ matricesI Multilinear functions→ tensorsI Tensor of order n + 1 models functions with n argumentsI T: Third-order tensor, R: binary relation
R(a,b) = R(a)(b) = (T× ~a)× ~b
I ×: tensor contraction
(T× ~a)ij = ΣkTijkak
(Ta× ~b)i = Σj(Ta)ijbj
From adjectives to higher valency relations
I Transitive verbs:
−−−−−−−−−−→dogs chase cats = chase×−−→cats ×−−−→dogs
~c = V×−−→nobj ×−−→nsubj
I V: third order tensor (cube)I V×−−→nobj : second order tensor (matrix)
I Composition as function application, connectingdistributional implementation to general grammaticalformalism: B. Coecke, M. Sadrzadeh and S. Clark. 2010,E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke andS. Pulman. 2011
Recursive estimation of transitive verbsGrefenstette, Dinu, Zhang, Sadrzadeh and Baroni 2013
EATMEAT
EATMEAT
EATPIE
EAT
dogs
cats
dogs.eat.meat
cats.eat.meat
STEP 1: ESTIMATE VP MATRICES
STEP 2: ESTIMATE V TENSOR
EATPIE
boys
girls
boys.eat.pie
girls.eat.pie
meat
pie
training example (output)
training example (input)
function to estimate
Evaluation
Similarity of small phrases/sentences, against human assignedsimilarity scores.
I Subject-intransitive verb (Mitchell and Lapata 2008)
SV1 SV2 Sim.fire glow fire burn 6fire glow fire beam 1face glow face burn 1face glow face beam 1discussion stray discussion digress 7child stray child digress 2
Evaluation
I Subject-verb-object (Grefenstette et al 2011)SVO1 SVO2 Sim.table show result table express result 7map show location map express location 1
ResultsI Rank correlation (ρ) with human assigned scores
Comp. model SV SVOHumans 0.40 0.62Verb only 0.06 0.08Add 0.13 0.12Mult 0.19 0.23LexFunc 0.23 0.32
ProblemToo many parameters! Assume vectors of size 100:
I transitive verbs: 1003, adverbs: 1004 paramters
Paperno et al, ACL 2014
I Each word has a vector and a set of matrices. Each matrixmodels its functional behaviour in one argument at a time.
ResultsI Rank correlation (ρ) with human assigned scores
Comp. model SV SVOHumans 0.40 0.62Verb only 0.06 0.08Add 0.13 0.12Mult 0.19 0.23LexFunc 0.23 0.32
ProblemToo many parameters! Assume vectors of size 100:
I transitive verbs: 1003, adverbs: 1004 paramters
Paperno et al, ACL 2014
I Each word has a vector and a set of matrices. Each matrixmodels its functional behaviour in one argument at a time.
Outline
Composition as function application
Related models and general estimation of composition models
Neural network methods for composition
Related methods
Guevara 2010, Zanzotto et al 2010 propose similar ways toestimate a generic additive model
I Given two word vectors ~u and ~vI Compute composed vector ~c:
~c = A~u + B~v
I Parameters: matrices A and B. They are not lexicalized -common to all types of composition or syntax-dependent.
Related methods
~c = A~u + B~v
EstimationI Corpus-observed phrases (Guevara 2010)I (Adj-Noun, Paraphrase) tuples (Zanzotto et al 2010) e.g.
(close interaction,contact)I Least squares regression problem
~c = [A,B]
[~u~v
]
arg min ||C− [A,B]
[UV
]||2F
Compositional Distributional Semantics Models(CDSMs). Big picture
From:Simple functions
−−→very+−−−→
good+−−−−→
movie
−−−−−−−−−−−→very good movie
To:Complex composition operations
Socher at al. 2013
Learning CDSMs. Big picture
From:Distributional word vectorsas input, no parameters
To:Various learning objectives
I Corpus-extracted phrasevectors
I Reconstruction errorI Sentiment predictionI Cross-lingual signalsI ...
General estimation of CDSMs
Use corpus phrase vectors to learn the parameters of othercomposition functions:
Solve:θ∗ = arg min
θ||P− fcompθ(U,V)||2F
P/U,V - Phrase/Word occurrence matrices
General estimation of CDSMs
Name Composition function ParametersAdd w1
−−→pink + w2
−−−→dogs 1 w1,w2 ∈ R
Mult−−→pink w1 �−−−→dogs w2 2 w1,w2 ∈ R
Dil ||−−→pink ||22−−−→dogs + (λ− 1)〈−−→pink ,
−−−→dogs 〉−−→pink 3 λ ∈ R
Fulladd W1−−→pink + W2
−−−→dogs 4 W1,W2 ∈ Rm×m
Lexfunc A pink−−−→dogs 5 Au ∈ Rm×m
Fulllex tanh([W1,W2]
[A pink
−−→dogs
A dogs−−→pink
]) 6 W1,W2,Au ,Av ∈ Rm×m
I Mitchell and Lapata 2008 (1,2,3) Guavara 2010, Zanzottoet al 2010 (4) Baroni and Zamparelli 2010 (5), Socher et al2012 (6)
General estimation of CDSMs
Name Composition function ParametersAdd w1
−−→pink + w2
−−−→dogs 1 w1,w2 ∈ R
Mult−−→pink w1 �−−−→dogs w2 2 w1,w2 ∈ R
Dil ||−−→pink ||22−−−→dogs + (λ− 1)〈−−→pink ,
−−−→dogs 〉−−→pink 3 λ ∈ R
Fulladd W1−−→pink + W2
−−−→dogs 4 W1,W2 ∈ Rm×m
Lexfunc A pink−−−→dogs 5 Au ∈ Rm×m
Fulllex tanh([W1,W2]
[A pink
−−→dogs
A dogs−−→pink
]) 6 W1,W2,Au ,Av ∈ Rm×m
I Mitchell and Lapata 2008 (1,2,3) Guavara 2010, Zanzottoet al 2010 (4) Baroni and Zamparelli 2010 (5), Socher et al2012 (6)
General estimation of CDSMs: Data sets
Subj-Verb
fire beam fire burn 7skin glow skin burn 1
Adj. Noun-Noun paraphrase
acheological site digspousal relationship marriagepersonal appeal charismaelectrical storm thunderstormindoor garden hothousedouble star binary
General estimation of CDSMs: Data sets
Determiner phrase - Noun(Bernardi et al 2013)
I Nouns that are strongly related to a determiner phrasesI 23 determiners (two, various, no, many, too many, . . .)
Noun Correct DP Foilsduel two opponents various opponents, three
opponents, two engineers,two, opponents
homeless no home too few homes, one home,no incision, no, home
polygamy several wives most wives, fewer wives,several negotiators, sev-eral, wives
General estimation of CDSMs: ResultsA
dd
Dil
Mu
lt
Fu
llad
d
Lexfu
nc
Fu
lllex
Co
rpu
s
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0 Intransitive sentences
●
●
●
●
●
Ad
d
Dil
Mu
lt
Fu
llad
d
Lexfu
nc
Fu
lllex
Co
rpu
s
10
00
80
06
00
40
02
00
ANs
●
●
Ad
d
Dil
Mu
lt
Fu
llad
d
Lexfu
nc
Fu
lllex
Co
rpu
s
0.1
50
.20
0.2
50
.30
0.3
5
DPs
Performance of composition models:I Lexical function performs best and stable across all tasks
Parameter estimation:I For the simple composition models, observed-phrase
estimation works just as well as chosing parametersthrough (supervised) cross-validation
General estimation of CDSMs: Examples
Target Method Neighboursbelief Corpus moral, dogma, worldview, religion,
world-view, morality, theismAdditive belief, assertion, falsity, falsehood,
truth, credence, dogma, supposition
false beliefFulladd pantheist, belief, agnosticism, religios-
ity, dogmatism, pantheism, theistLexfunc self-deception, untruth, credulity, ob-
fuscation, misapprehension, deceiver,disservice, falsehood
General estimation of CDSMs: Examples
DP neighbours of homeless
Method NeighboursAdditive three people, several orphanages, more
people, many people, too many peopleLexfunc many people, more people, many victims,
three people, no people, more deathsCorpus those living, those homeless, more lives,
those people, many families
Outline
Composition as function application
Related models and general estimation of composition models
Neural network methods for composition
Socher et al 2011Recursive Autoencoder (RAE)
Vector representations for sentences are computed recursivelythrough a parse tree.
Compositionp = g
(W[
c1c2
]+ b
)
c1, c2 ∈ Rn are inputsW ∈ Rn×2n,b ∈ Rn weightmatrix and biasg- non-linear function
Socher et al 2011, Dynamic Pooling and Unfolding RecursiveAutoencoders for Paraphrase Detection
Socher et al 2011Recursive Autoencoder (RAE)
Learn encoding/decoding matrices in order to be able toreconstruct the input.
Encodingp = g(We
[c1c2
]+ be)
Decoding[c′1c′2
]= g(Wdp + bd )
Reconstruction error||[c′1; c′2]− [c1; c2]||2
E.g.Wd (We([
−−→red ;−−−→moon ])) close to [
−−→red ;−−−→moon ]
Socher et al 2011
At sentence levelI Compute vector representations for all nodes in the parse
tree (using unfolding RAE, embeddings as word inputrepresentations)
I Dynamic pooling for computing sentence similarities takingall nodes (words and phrases) into account
EvaluationMSR Paraphrase corpus:(1) The lies and deceptions from Saddam have been welldocuemtned for over 12 years.(2) It has been all documented over 12 years of lies anddeception from Saddam.
Dynamic pooling for paraphrase detection
I Given binary-branching parse of sentences of differinglengths n and m, first compute a matrix S of size(2n − 1)× (2m − 1) with all possible pairwise Euclideandistances between all nodes of both sentences.
I Superimpose a fixed-size np × np grid over the cells of Sand pick min of values in same grid section as number tobe entered in corresponding cell of pooled Sp matrix
Dynamic pooling
2.4 Deep Recursive AutoencoderBoth types of RAE can be extended to have multiple encoding layers at each node in the tree. Insteadof transforming both children directly into parent p, we can have another hidden layer h in between.While the top layer at each node has to have the same dimensionality as each child (in order forthe same network to be recursively compatible), the hidden layer may have arbitrary dimensionality.For the two-layer encoding network, we would replace Eq. 1 with the following:
h = f(W (1)e [c1; c2] + b(1)e ) (7)
p = f(W (2)e h + b(2)e ). (8)2.5 RAE Training
For training we use a set of parse trees and then minimize the sum of all nodes’ reconstruction errors.We compute the gradient efficiently via backpropagation through structure [13]. Even though theobjective is not convex, we found that L-BFGS run with mini-batch training works well in practice.Convergence is smooth and the algorithm typically finds a good locally optimal solution.
After the unsupervised training of the RAE, we demonstrate that the learned feature representationscapture syntactic and semantic similarities and can be used for paraphrase detection.
3 An Architecture for Variable-Sized Similarity MatricesNow that we have described the unsupervised feature learning, we explain how to usethese features to classify sentence pairs as being in a paraphrase relationship or not.
!"# !"$
!"%!"&
!"&
Figure 3: Example of the dy-namic min-pooling layer find-ing the smallest number in apooling window region of theoriginal similarity matrix S .
3.1 Computing Sentence Similarity MatricesOur method incorporates both single word and phrase similaritiesin one framework. First, the RAE computes phrase vectors for thenodes in a given parse tree. We then compute Euclidean distancesbetween all word and phrase vectors of the two sentences. Thesedistances fill a similarity matrix S as shown in Fig. 1. For comput-ing the similarity matrix, the rows and columns are first filled by thewords in their original sentence order. We then add to each row andcolumn the nonterminal nodes in a depth-first, right-to-left order.
Simply extracting aggregate statistics of this table such as the av-erage distance or a histogram of distances cannot accurately cap-ture the global structure of the similarity comparison. For instance,paraphrases often have low or zero Euclidean distances in elementsclose to the diagonal of the similarity matrix. This happens whensimilar words align well between the two sentences. However, sincethe matrix dimensions vary based on the sentence lengths one can-not simply feed the similarity matrix into a standard neural networkor classifier.
3.2 Dynamic PoolingConsider a similarity matrix S generated by sentences of lengths n and m. Since the parse trees arebinary and we also compare all nonterminal nodes, S ∈ R(2n−1)×(2m−1). We would like to map Sinto a matrix Spooled of fixed size, np × np. Our first step in constructing such a map is to partitionthe rows and columns of S into np roughly equal parts, producing an np × np grid.1 We then defineSpooled to be the matrix of minimum values of each rectangular region within this grid, as shown inFig. 3.
The matrix Spooled loses some of the information contained in the original similarity matrix but it stillcaptures much of its global structure. Since elements of S with small Euclidean distances show that
1The partitions will only be of equal size if 2n − 1 and 2m − 1 are divisible by np. We account for this inthe following way, although many alternatives are possible. Let the number of rows of S be R = 2n− 1. Eachpooling window then has �R/np� many rows. Let M = R mod np, be the number of remaining rows. Wethen evenly distribute these extra rows to the last M window regions which will have �R/np� + 1 rows. Thesame procedure applies to the number of columns for the windows. This procedure will have a slightly finergranularity for the single word similarities which is desired for our task since word overlap is a good indicatorfor paraphrases. In the rare cases when np > R, the pooling layer needs to first up-sample. We achieve this bysimply duplicating pixels row-wise until R ≥ np.
4
I Since Sp has same size for all sentence pairs, the values inits cells can be used as input to supervised classifier
[Socher et al. 2011, Fig. 3]
Socher et al 2011. Results
Paraphrase detection results
model Acc.
all paraphrase baseline 66.5
unfolding RAE 72.6
unfolding RAE + extra feats. 76.8
interannotator agreement 83.0
Extra features
I Overlap, sentencescontain samenumbers, sentencelength difference, ...
Matrix-Vector Recursive Neural NetworksSocher et al 2012
I Each word is represented both as a vector and a matrix(lexicalised model)
Training and evaluationTrained to predict a distribution over sentiment of movie reviewsand over semantic relations.
See New directions in vetor space models of meaning ACL 2014 tutorial formore on these models.
Comparison of Compositional DSM modelsBlacoe and Lapata 2012
Compare
I Different type of input vector representationsI Different composition models (addition, multiplication,
recursive autoencoders (RAE - simplified Socher 2011))
EvaluationI Phrase similarity: Adj-Noun, Noun-Noun, Verb-ObjectI Paraphrase detection
Comparison of Compositional DSM modelsPhrase similarity results
Rank correlation
Adj-N N-N V-ObjAdd 0.37 0.38 0.28Mult 0.48 0.50 0.35RAE 0.34 0.29 0.32
I Input vectors: traditional distributional (count) vectorsI Shallow approaches are very good
Comparison of Compositional DSM modelsParaphrase detection
Classifier features: cosine of sentence vectors, sentencevectors, word overlap, ...
Classification accuracy
Add 73.5Mult 73.0RAE 73.2Socher 2011∗ 74.2Socher 2011 76.8
∗ without dynamic poolingI Best results for each method obtained with different type of
input vectors and different combination of featuresI Shallow approaches are very good, again!
Evaluation and extensionsof compositional distributional models
Marco Baroni and Georgiana Dinu
Center for Mind/Brain SciencesUniversity of Trento
ESSLLI 2014
Outline
Scaling up to sentences: The SICK dataset and SemEval-2014Task 1
Measuring phrase plausibility
Composition in morphology
Phrase generation
Outline
Scaling up to sentences: The SICK dataset and SemEval-2014Task 1
Measuring phrase plausibility
Composition in morphology
Phrase generation
Here comes SICKness
I Sentences Involving Compositional Knowledge (orsomething. . . )
I Marco Marelli, Marco Baroni, Raffaella Bernardi, RobertoZamparelli (UNITN), Stefano Menini, Luisa Bentivogli(FBK)
I http://clic.cimec.unitn.it/composes/sick.html
I Used for SemEval-2014 Task 1: Evaluation ofcompositional distributional semantic models on fullsentences through semantic relatedness and textualentailment
I http://alt.qcri.org/semeval2014/task1
I SICK slides courtesy of Marco Marelli
How to test compositional distributional semanticmodels at the sentence level?
I Existing datasets are not ideally suited to testlinguistically-motivated semantic models
I either limited in size, variety of the representedphenomena, and complexity of the used sentences (e.g.,Mitchell and Lapata, 2008)The fire burnedThe face burned
I or focusing on aspects that these models are not meant tohandle (e.g., STS, RTE)Deaths in Ukraine and Poland in freezing Europe weatherUkraine: Europe hit by deadly cold snap
SICK: a new data set
I Not requiring to deal with “irrelevant” aspects of existingsentential data sets
I named entitiesPoland, Obama
I world knowledgePoland is an European CountryObama is the USA president
I “multiword expressions”cold snapLeague of Nations
I telegraphic languageUkraine: Europe hit by deadly cold snap
SICK
I Rich in phenomena of intrinsic linguistic interest
I contextual synonymy and other lexical variation phenomenaThe girl is brightThe sun is bright
I active/passive and other syntactic alternationsThe turtle is following the fishThe turtle is followed by the fish
I negation and other grammatical phenomenaA boy is singing in the parkThere is no boy singing in the park
The SemEval task
I Two subtasks:
I predicting the degree of semantic relatedness betweentwo sentences (1-5 scale)
I detecting the entailment relation holding between twosentences (ENTAILMENT, CONTRADICTION, NEUTRAL)
Data set building
I Starting from the materials of two existing data set (8KImageFlickr and SemEval-2012 STS MSR-VideoDescription)
I normalization: removing unwanted phenomenaA woman is playing Mozart→ A woman is playing classicalmusic
I expansion: enriching the existing sentences withunder-represented aspectsThe surfer is riding a big wave→ A surfer is riding a bigwaveA woman is using a sewing machine→ A woman isworking on a machine used for sewing
I pairing: creating the final sentence pairs
Data set building
I Pairs of sentences with similar meanings
A man is driving a car
The car is being driven by a man
Data set building
I Pairs of sentences with opposed/contrasting meanings
The girl is spraying the plants with water
The boy is spraying the plants with water
Data set building
I Pairs of sentences with with similar vocabulary anddifferent meanings
Two children are lying in the snow and are making snow angels
Two angels are making snow on the lying children
Data set building
I Pairs of sentences with no apparent relation
A sea turtle is hunting for fish
A young woman is playing the guitar
Data set building
I Procedure aimed at generating a balanced distribution ofpossible sentence relations.
I Eventually, about 10k sentence pairs
I half of them released as training set for SemEval-2014evaluation
I half of them used as test set
Gold standard
I Each pair was evaluated in a crowdsourcing study
Gold standard
I Double annotation task
I Final gold scores for each pair:
I Relatedness subtask: average of the obtained judgments
I Entailment subtask: majority vote of the obtainedjudgments
SemEval-2014 systems
I 21 participating systemsI Relatedness subtask: 17 submissions (66 runs)I Entailment subtask: 18 submissions (65 runs)
Results: relatedness subtask
System Pearson rECNU .8280StanfordNLP .8272The_Meaning_Factory .8268UNAL-NLP .8043Illinois-LH .7993CECL_ALL .7804...others .7802–.4795Word Overlap 0.63
Results: entailment subtask
System AccuracyIllinois-LH 84.58 %ECNU 83.64 %UNAL-NLP 83.05 %SemanticKLUE 82.32 %The_Meaning_Factory 81.59 %CECL_ALL 79.99 %...others 79.66%–48.73%Majority 56.7%Word Overlap 56.2%Chance 33.3%
Approaches
I All participating systems combine a number of differentmethods
I Distributional similarity, compositional or notI External resources (mostly WordNet, some paraphrase
corpora)I Supervised machine learning (SVMs)I Ad hoc features, exploiting properties of the dataset (e.g.,
using negative words as cue for contradictory pairs)
Outline
Scaling up to sentences: The SICK dataset and SemEval-2014Task 1
Measuring phrase plausibility
Composition in morphology
Phrase generation
Proximity, density and entropy
!"#$%"&'()"#*)+&")+
#%!',)+&")+ )+&")+
!"#$%"&'()"#*)+&")+
#%!',)+&")+
!" !# !$ !% !& !'
()*+,-./0-.
*0(1)0/+2-0(3,-./0-.
Measuring adjective-noun phrase plausibility
I Among phrases unattested in large corpus, distinguish theones that make sense (vulnerable gunman, huge joystick,academic crusade) from those that do not (academicbladder, parliamentary tomato, blind pronunciation)
I Proximity of composed vector to phrase is most stablepredictor of phrase acceptability (Vecchi et al. submitted)
I Composed-vector plausibility measures can besuccessfully used to predict right bracketing of nounphrases (miracle [home run] vs. [miracle home] run)(Lazaridou et al. EMNLP 2013)
Nearest neighbours of deviant/plausiblecomposed AN vectorsVecchi et al. submitted
*mathematical biscuit mathematical, mathematical result,mathematical system
*moral protein moral, moral system, morality*empty fungus empty tin, empty packet, empty container
continuous uprising constant warfare, constant conflict,continuous war
diverse farmland diverse wildlife, varied habitat, rare floraimportant coordinator instrumental role, integral role, significant role
Predicting the recursive behaviour of adjectivesVecchi et al. EMNLP 2013
I “Flexible” (daily national newspaper/national dailynewspaper) vs. “rigid” (rapid social change/*social rapidchange) adjective-adjective-noun phrases
I Proximity of recursively-constructed phrase vector tocomponents predicts if phrase is flexible or rigid, andcorrect order for rigid phrases
I Adjective nearest to noun has strong effect on phrasemeaning in rigid constructions
Outline
Scaling up to sentences: The SICK dataset and SemEval-2014Task 1
Measuring phrase plausibility
Composition in morphology
Phrase generation
Functional composition in morphologyLazaridou et al. ACL 2013, Marelli and Baroni submitted;see also Luong et al. CoNLL 2013
I Affixes as functions from stems to derived forms:
~redo = RE× ~do
I Affix matrices learned with least-squares techniques fromcorpus-extracted stem/derived vector pair examples(try/retry, climb/reclimb, open/reopen)
A look at composed derived forms
word nearest neighbours
carve+er potter, engraver, goldsmithbroil+er oven, stove, cooking, kebab, done
inter+ment cremate, disinter, burial, entombment, funeralequip+ment maintenance, servicing, transportable, deploy
re+issue original, expanded, long-awaitedre+touch repair, refashion, reconfigurere+sound reverberate, clangorous, echo
A look at composed derived forms
word nearest neighbourstype keyword, subtype, parsetype+ify embody, characterize, essentially
column arch, pillar, bracket, numericcolumn+ist publicist, journalist, correspondent
cello+ist flutist, virtuoso, quintetentomology+ist zoologist, biologist, botanistpropaganda+ist left-wing, agitator, dissidentrape+ist extortionist, bigamist, pornographer, pimp
A look at composed derived forms
word nearest neighbours
industry+al environmental, land-use, agricultureindustry+ous frugal, studious, hard-working
nervous anxious, excitability, panickynerve+ous bronchial, nasal, intestinal
Plausibility measures for morphology
!"#$%!"&'
!"#$%!()&&!"#$%!
!"#$%$&&
'()$%$&&
!" !# !$ !% !& !'
()*+,-.,**
*.)!/0*-
Predicting nonce derived-form acceptabilityMarelli and Baroni submitted
I Subject ratings of unattested formsI Acceptable: sketchable, hologramistI Unacceptable: happenable, windowist
I Plausibility measures applied to vectors derived byfunctional composition
I Entropy and proximity (non-linear effect) predict subjectintuitions (both p < 0.001 in mixed model regression)
Outline
Scaling up to sentences: The SICK dataset and SemEval-2014Task 1
Measuring phrase plausibility
Composition in morphology
Phrase generation
Phrase generation in distributional semanticsDinu and Baroni, ACL 2014
What for?
I Paraphrase generation, machine translationI From different modalities to language:
I Caption generationI “Read" brain activation patterns
Generation through decomposition functions
1. Decomposition[~u;~v ] = fdecompR(~p)
fdecompR : Rd → Rd × Rd
~p ∈ Rd phrase vector~u, ~v ∈ Rd output vectors
2. Nearest neighbor queryword = NNLex(~u)
Learning decomposition functions
Learn from corpus observed phrase vectorsExample: fdecompAN
−−−−→red.car [
−−→red ; −→car ]−−−−−−→
large.man → [−−−→large ;−−→man ]−−−−−−→
red.dress [−−→red ;
−−−→dress ]
Why corpus-observed phrase vectors?
I Polysemous behaviour of words: green jacket vs. greenpolitician
Learning decomposition functions
SpecificallyLearn linear transformations WR ∈ R2d×d
fdecompR(~p) = WR~p
arg minWR∈R2d×d
‖[U;V ]−WRP‖2F
Experiment 1: Paraphrasing
Noun to Adj-Noun paraphrasing
fallacy false beliefcharisma personal appealbinary double star
Adj-Noun to Noun-Prep-Noun dataset
pre-election promises promises before electioninexperienced user user without experience
Experiment 1: Noun to Adj-Noun
Decompose ANfdecompAN fallacy→[false ; belief]
Experiment 1: Adj-Noun to Noun-Prep-Nounparaphrasing
Compose AN - decompose NPN
fcompAN [pre-election ; promises] → ~pfdecompAN
~p → [~u ; promises]fdecompPN
~u → [before ; election]
Results
N-AN
Adjective NounNN Gen NN Gen
Median rank 168 67 60 29
I NN: Nearest Adj and Noun neighbours of input Noun
AN-NPN
Noun Prep NounPrec@1 0.98 0.08 0.13Median rank 1 5.5 20.5
I Search space: 20,000 Adjs and 20,000 Nouns, 14prepositions
Experiment 1: Examples
Noun → Adj Noun Gold
3 thunderstorm thundery storm electrical storm3 reasoning deductive thinking abstract thought3 jurisdiction legal authority legal power3 folk local music common people
7 superstition old-fashioned religion superstitious notion7 vitriol political bitterness sulfuric acid7 zoom fantastic camera rapid growth
I Better evaluation setting?I Most errors: related phrases instead of paraphrases
Experiment 1: Examples
Noun → Adj Noun Gold
3 thunderstorm thundery storm electrical storm3 reasoning deductive thinking abstract thought3 jurisdiction legal authority legal power3 folk local music common people7 superstition old-fashioned religion superstitious notion7 vitriol political bitterness sulfuric acid7 zoom fantastic camera rapid growth
I Better evaluation setting?I Most errors: related phrases instead of paraphrases
Experiment 1: Examples
Adj Noun Noun Prep Noun Gold
3 mountainous region region in highlands region with mountains3 undersea cable cable through ocean cable under sea? inter-war years years during 1930s years between wars7 post-operative pain pain through patient pain after operation7 superficial level level between levels level on surface
Cross-lingually
Map vectors from one language to another
I Use seed lexicon to learn a Lang1 → Lang2 mappingI Map arbitrary vectors
Experiment 2: Italian to English Adj-Noun translation
Compose AN - Translate - Decompose AN
fcompAN [spectacular ; woman] → ~pEnfEn→It ~pEn → ~pItfdecompAN
~pIt → [donna ; affascinante]
Experiment 2: Italian to English Adj-Noun translation
Compose AN - Translate - Decompose AN
I 1000 pairs from a movie subtitles phrase tableI It AN - En AN datasetI dictionary of size 5000 for learning a fEn→It map
spectacular woman donna affascinantevicious killer killer pericoloso
Experiment 2: Results
En→It It→EnThr. Accuracy Coverage Accuracy Coverage0.00 0.21 100% 0.32 100%0.55 0.25 70% 0.40 63%0.60 0.31 32% 0.45 37%0.65 0.45 9% 0.52 16%
I Accuracy when using generation confidence (cosine tonearest neighbour) as a threshold.
I Chance accuracy: 120,0002
Experiment 2: Examples
English → Italian
3 black tie cravatta nera3 indissoluble tie alleanza indissolubile3 vicious killer assasino feroce (killer pericoloso)3 rough neighborhood zona malfamata (ill-repute zone)7 mortal sin pecato eterno (eternal sin)7 canine star stella stellare (stellar star)
I Nice disambiguation effectI Some translations judged as more natural than the gold
standard
Object recognition in computer vision
Leverage language data to improve object recognitionZero-shot learning (Frome at al 2013, Socher et al 2013, Lazaridou et al 2014)
I Train a cross-modal map from visual to linguistic spaceI Tag an image of a previously unseen objects
How about attributes?Lazaridou, Dinu, Liška, and Baroni - in preparation
ObservationMapped vectors are closer to Adj-Noun phrase vectors
ObservationMapped vectors are closer to Adj-Noun phrase vectors
−200 −150 −100 −50 0 50 100 150 200−200
−150
−100
−50
0
50
100
150
cold butter
fresh strawberry
vintage champagne
dark rum
liqueur
herbal liqueur
yellow
white wine
mixed fruit tall glass
single bottle
only fruitcold milk
lemon
natural yogurt
smooth paste
purple
new orange
orange zestorange
irish cream
small cup dark beer
iced coffee
sparkling wineyoung wine
citrus
regional wine
dark yellow
clear honey
light orange
brandy
black institution
mixed herb
rich chocolate
cherry juice
other orange
more orange
vibrant red
mobile visit
first orange
sweet wine
orange liqueur
crushed ice
white rum
Image as visual phrase
I Gen - Generate Adj-Noun phrase from image projectionI NN - Nearest neighbour baseline
Image Rank Nearest NeighboursNN - crispy hyraxA:448 crispy, edible, tasty, delicious, crunchyN: 50 hyrax, fish, pangolin, dugong, bovidGen - wild fishA:340 wild, dark, edible, wet, saltyN: 27 fish, meat, animal, rabbit, bovidNN - yummy sauteA: 43 yummy, fluffy,cuddly, tasty, furryN:200 saute, food coloring, ramekin, tiramisuGen - fluffy rabbitA: 3 fluffy, white, pink, cuddly,crunchyN:144 rabbit, chinchilla, meat, cat, fish