composition in distributional semantics - semantic · pdf file · 2015-07-29plan...

Composition in distributional semantics

Marco Baroni and Georgiana Dinu

Center for Mind/Brain SciencesUniversity of Trento

ESSLLI 2014

!"#$"%&%'!"#$()*+*(,-./"012-+*(,)3,/%&4-,+*5/%0-51

In collaboration with:Roberto Zamparelli,Raffaella Bernardi,Marco Marelli,Denis Paperno,Jiming Li,Eva Maria Vecchi,Nghia The Pham,Germán Kruszewski,Angeliki Lazaridou,Edward Grefenstette,Fabio Zanzotto,Louise McNally,Gemma Boleda. . .

Plan for the course

Day 1 General introduction: background, motivation,simple compositional models

Day 2 Compositional models: structure and estimationDay 3 Empirical evaluationDay 4 Practice with the DISSECT toolkit:

http://clic.cimec.unitn.it/composes/toolkit

Day 5 Conclusion: open issues, broader implications,discussion

Outline

Distributional semantics

Towards a compositional distributional semantics

Simple composition methods

The vastness of word meaning

The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .

The meaning of a word is (can be approximatedby, derived from) the set of contexts in which itoccurs in texts

We found a little, hairy wampimuk sleepingbehind the tree

See also MacDonald & Ramscar CogSci 2001

Distributional semantic models in a nutshell

I Represent words through vectors recording theirco-occurrence counts with context elements in a corpus

I (Optionally) apply a re-weighting scheme to the resultingco-occurrence matrix

I (Optionally) apply dimensionality reduction techniques tothe co-occurrence matrix

I Measure geometric distance of word vectors in“distributional space” as proxy to semanticsimilarity/relatedness

Co-occurrence

he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white

sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a

bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the

rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of

the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh

oud obscured part of the moon . The Allied guns behind

Extracting co-occurrence counts

Variations in the type of context features

Doc1 Doc2 Doc3

stars 38 45 2

dobj←−−see mod−−→bright mod−−→shinystars 38 45 44

The nearest • to Earth stories of • and theirstars 12 10

Extracting co-occurrence counts

Variations in the definition of co-occurrence

E.g.: Co-occurrence with words, window of size 2, scaling bydistance to target:

... two [intensely bright stars in the] night sky ...

intensely bright in thestars 0.5 1 1 0.5

From co-occurrences to vectors

bright in skystars 8 10 6sun 10 15 4dog 2 20 1

Weighting

Re-weight the counts using corpus-level statistics to reflectco-occurrence significance

Point-wise Mutual Information (PMI)

PMI(target, ctxt) = logP(target, ctxt)

P(target)P(ctxt)

Weighting

Adjusting raw co-occurrence counts:

bright instars 385 10788 ... ← Counts

stars 43.6 5.3 ... ← PMI

Other weighting schemes:I TF-IDFI Local Mutual InformationI Dice

See Ch4 of J.R. Curran’s thesis (2004) and S. Evert’s thesis(2007) for great surveys

Dimensionality reduction

I Vector spaces often range from tens of thousands tomillions of context dimensions

I Some of the methods to reduce dimensionality:I Select context features based on various relevance criteriaI Random indexingI Claimed to also have a beneficial smoothing effect

I Singular Value DecompositionI Non-negative matrix factorizationI Probabilistic Latent Semantic AnalysisI Latent Dirichlet Allocation

From geometry to similarity in meaning

stars

sun

Vectors

stars 2.5 2.1sun 2.9 3.1

Cosine similarity

cos(x, y) =〈x, y〉‖x‖‖y‖

=

∑i=ni=1 xi × yi√∑i=n

i=1 x2 ×√∑i=n

i=1 y2

Other similarity measures: Euclidean Distance, Dice, Jaccard,Lin. . .

Other ways to obtain vectors“Word embeddings”

Build vectors optimizing some context prediction objective.

Bengio et al. JMLR 2003, Collobert and Weston ICML 2008, Turian etal. ACL 2010, Mikolov et al. NIPS 2013, systematic evaluation byBaroni et al. ACL 2014

Geometric neighbours ≈ semantic neighbours

rhino fall good singwoodpecker rise bad dancerhinoceros increase excellent whistleswan fluctuation superb mimewhale drop poor shoutivory decrease improved soundplover reduction perfect listenelephant logarithm clever recitebear decline terrific playsatin cut lucky hearsweatshirt hike smashing hiss

The TOEFL synonym match task

I 80 itemsI Target: levied

Candidates: imposed, believed, requested, correlated

I In semantic space, measure angles between target andcandidate context vectors, pick candidate with narrowestangle with target

The TOEFL synonym match task

I 80 itemsI Target: levied

Candidates: imposed, believed, requested, correlatedI In semantic space, measure angles between target and

candidate context vectors, pick candidate with narrowestangle with target

Human performance on the synonym match task

I Average foreign test taker: 64.5%I Macquarie University staff (Rapp 2004):

I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%

Distributional Semantics takes the TOEFL

I Humans:I Foreign test takers: 64.5%I Macquarie non-natives: 86.75%I Macquarie natives: 97.75%

I Machines:I Classic LSA: 64.4%I Padó and Lapata’s dependency-based model: 73%I Rapp’s 2003 SVD-based model trained on lemmatized

BNC: 92.5%I Bullinaria and Levy’s 2012 SVD-based model trained on

ukWaC: 100% (in extensive search of parameter space)

Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci 2010

I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in

CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase

broke, John minces the meat - *the meat minced)

Selectional preferences in semantic spacePadó et al. EMNLP 2007

To kill. . .object cosine with cosinekangaroo 0.51 hammer 0.26person 0.45 stone 0.25robot 0.15 brick 0.18hate 0.11 smile 0.15flower 0.11 flower 0.12stone 0.05 antibiotic 0.12fun 0.05 person 0.12book 0.04 heroin 0.12conversation 0.03 kindness 0.07sympathy 0.01 graduation 0.04

Distributional semantics: some references

I Classics:I Schütze’s 1997 CSLI bookI Landauer and Dumais PsychRev 1997I Griffiths et al. PsychRev 2007

I Overviews:I Turney and Pantel JAIR 2010I Erk LLC 2012I Clark to appear in Handbook of Contemporary Semantics

I Evaluation:I Sahlgren’s 2006 thesisI Bullinaria and Levy BRM 2007, 2012I Baroni, Dinu and Kruszewski ACL 2014I Kiela and Clark CVSC 2014

Outline




The infinity of sentence meaning

CompositionalityThe meaning of an utterance is a function of the meaning of its partsand their composition rules (Frege 1892)

Compositionality in formal semanticsSome donkey flies

[some(donkey)](flies)=FALSE

donkey

flies

t

<<e,t>,t>

<<e,t>,<<e,t>,t>>

some

<e,t>

donkey

<e,t>

flies

E.g., Heim and Kratzer 1998

A compositional distributional semantics for phrasesand sentences?Mitchell and Lapata ACL 2008, CogSci journal 2010. . .

planet night full blood shine

moon 10 22 43 3 29

red moon 12 21 40 20 28

the red moon shines 11 23 21 15 45

The unavoidability of distributional representationsof phrases

!"#$%&

&$'(%)&

#"*%&

"+&$'(%)

,-.,/.#"*%&0/1-20"+&$'(%)2

The unavoidability of distributional representationsof phrases

!"#$

%"&%$'()#'*+&%#(

%"&%$

()#'*,--

./0.10%"&%$213/42()#'*,--4

5%"&%$'()#'),*"(

What can you do with distributional representationsof phrases and sentences?

NOT this:!"!#$%&'#(!#%)*+%+),-%'%.+,/!$%0!'1-%21

What can you do with distributional representationsof phrases and sentences?

NOT this:

!"#$%&#'(#)*+,'-&

."#$%&#'(#/01'-&

What can you do with distributional representationsof phrases and sentences?Paraphrasing

0 10 20 30 40 50

010

2030

4050

dim 1

dim

2

"cookie dwarfs hopunder the crimson planet" "gingerbread gnomes

dance underthe red moon"

"red gnomes love gingerbread cookies"

"students eatcup noodles"

What can you do with distributional representationsof phrases and sentences?Contextual disambiguation

!"#$%&%&'(#)$*+$),!!#-

!"#$%&%&'(#)$*+$,./

!"#$%&%&'(#)$*+$0-%*#-!

What can you do with distributional representationsof phrases and sentences?Semantic acceptability

!"#$%"&'()"#*)+&")+

#%!',)+&")+ )+&")+

!"#$%"&'()"#*)+&")+

#%!',)+&")+

!" !# !$ !% !& !'

()*+,-./0-.

*0(1)0/+2-0(3,-./0-.

Compositional distributional semanticsHow?

planet night space color blood brown

red 15.3 3.7 2.2 24.3 19.1 20.2

moon 24.3 15.2 20.1 3.0 1.2 0.5

red+moon 39.6 18.9 22.3 27.3 20.3 20.7

red�moon 371.8 56.2 44.2 72.9 22.9 10.1

red(moon) 24.6 19.3 12.4 22.6 23.9 7.1

Outline




(Weighted) additive modelMitchell and Lapata 2010

~p = α~u + β~v

music solution economy craft reasonable

practical 0 6 2 10 4

difficulty 1 8 4 4 0

practical + difficulty 1 14 6 14 4

0.4×practical + 0.6×difficulty 0.6 5.6 3.2 6.4 1.6

Multiplicative modelMitchell and Lapata 2010

~p = ~u � ~v

pi = uivi

music solution economy craft reasonable

practical 0 6 2 10 4

difficulty 1 8 4 4 0

practical + difficulty 1 14 6 14 4

0.4×practical + 0.6×difficulty 0.6 5.6 3.2 6.4 1.6

practical � difficulty 0 48 8 40 0

Composition as dimension averaging/mixing

I red moon

I red face? fake gun?? buy guns??

I the car??? of the car????

I pandas eat bamboovs. bamboo eats pandas???

I lice and dogsvs. lice on dogs???

I some children walk happily in the valley of themoon????

Composition as (distributional) function application

Coecke+Clark+Grefenstette+Sadrzadeh et al.,COMPOSESRelated: Guevara, Socher et al., Zanzotto et al.

Composition as (distributional) function application

0.0 0.5 1.0 1.5

02

46

810

12

runs

barks

dog

old + dog

old

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

runs

barks

dogold(dog)

old()

Composition in distributional semantics



ESSLLI 2014

Plan for the course

Day 1 General introduction: background, motivation,simple compositional models

Day 2 Compositional models: structure andestimation

Day 3 Empirical evaluationDay 4 Practice with the DISSECT toolkit:

http://clic.cimec.unitn.it/composes/toolkit

Day 5 Conclusion: open issues, broader implications,discussion

Outline

Composition as function application

Related models and general estimation of composition models

Neural network methods for composition

Function application in vector space

Baroni and Zamparelli’s EMNLP 2010 proposal

I Adjectives are linear functionsI Nouns are vectorsI Linear functions are matrices, function application is

function-by-vector multiplication:

−→c = A−→n

Linear composition function application asmatrix-by-vector multiplication

−→c = A−→n

c1c2· · ·cm

=

a11 a12 · · · a1ma21 a22 · · · a2m· · · · · · · · · · · ·am1 am2 · · · amm

×

n1n2· · ·nm

Corpus-observed phrase vectors

n and the moon shining iwith the moon shining srainbowed moon . And thecrescent moon , thrillein a blue moon only , winow , the moon has risend now the moon rises , fy at full moon , get upcrescent moon . Mr Angu

f a large red moon , Campana, a blood red moon hung overglorious red moon turning tThe round red moon , she ’sl a blood red moon emerged fn rains , red moon blows , wmonstrous red moon had climb. A very red moon rising isunder the red moon a vampire

shine blood Soviet

moon 301 93 1

red moon 11 90 0

army 2 454 20

red army 0 22 18

Learning to approximate corpus-observed phrasevectors

I Input:I N: matrix of noun vectorsI C: matrix of adj-noun observed phrase vectors

c1 −−−−−−→red.moon n1 −−−→moon

c2 −−−−−−→red.army n2 −−−→army

c3 −−−−→red.car n3 −→car

I Estimate:I Matrix of function word (Ared)

c11

c12· · ·c1

m

c21

c22· · ·c2

m

≈


×

n11

n12· · ·n1

m

n21

n22· · ·n2

m

Learning to approximate corpus-observed phrasevectors

I Least squares regression problem

A = arg min ||C − AN||2F = arg min Σi,j(Cij − (AN)ij)2

c11

c12· · ·c1

m

c21

c22· · ·c2

m

≈


×

n11

n12· · ·n1

m

n21

n22· · ·n2

m

I Closed form solution

From adjectives to higher valency relations

I How about two-place (or n-place) relations?I Linear functions→ matricesI Multilinear functions→ tensorsI Tensor of order n + 1 models functions with n argumentsI T: Third-order tensor, R: binary relation

R(a,b) = R(a)(b) = (T× ~a)× ~b

I ×: tensor contraction

(T× ~a)ij = ΣkTijkak

(Ta× ~b)i = Σj(Ta)ijbj

From adjectives to higher valency relations

I Transitive verbs:

−−−−−−−−−−→dogs chase cats = chase×−−→cats ×−−−→dogs

~c = V×−−→nobj ×−−→nsubj

I V: third order tensor (cube)I V×−−→nobj : second order tensor (matrix)

I Composition as function application, connectingdistributional implementation to general grammaticalformalism: B. Coecke, M. Sadrzadeh and S. Clark. 2010,E. Grefenstette, M. Sadrzadeh, S. Clark, B. Coecke andS. Pulman. 2011

Recursive estimation of transitive verbsGrefenstette, Dinu, Zhang, Sadrzadeh and Baroni 2013

EATMEAT

EATMEAT

EATPIE

EAT

dogs

cats

dogs.eat.meat

cats.eat.meat

STEP 1: ESTIMATE VP MATRICES

STEP 2: ESTIMATE V TENSOR

EATPIE

boys

girls

boys.eat.pie

girls.eat.pie

meat

pie

training example (output)

training example (input)

function to estimate

Evaluation

Similarity of small phrases/sentences, against human assignedsimilarity scores.

I Subject-intransitive verb (Mitchell and Lapata 2008)

SV1 SV2 Sim.fire glow fire burn 6fire glow fire beam 1face glow face burn 1face glow face beam 1discussion stray discussion digress 7child stray child digress 2

Evaluation

I Subject-verb-object (Grefenstette et al 2011)SVO1 SVO2 Sim.table show result table express result 7map show location map express location 1

ResultsI Rank correlation (ρ) with human assigned scores

Comp. model SV SVOHumans 0.40 0.62Verb only 0.06 0.08Add 0.13 0.12Mult 0.19 0.23LexFunc 0.23 0.32

ProblemToo many parameters! Assume vectors of size 100:

I transitive verbs: 1003, adverbs: 1004 paramters

Paperno et al, ACL 2014

I Each word has a vector and a set of matrices. Each matrixmodels its functional behaviour in one argument at a time.

Outline




Related methods

Guevara 2010, Zanzotto et al 2010 propose similar ways toestimate a generic additive model

I Given two word vectors ~u and ~vI Compute composed vector ~c:

~c = A~u + B~v

I Parameters: matrices A and B. They are not lexicalized -common to all types of composition or syntax-dependent.

Related methods

~c = A~u + B~v

EstimationI Corpus-observed phrases (Guevara 2010)I (Adj-Noun, Paraphrase) tuples (Zanzotto et al 2010) e.g.

(close interaction,contact)I Least squares regression problem

~c = [A,B]

[~u~v

]

arg min ||C− [A,B]

[UV

]||2F

Compositional Distributional Semantics Models(CDSMs). Big picture

From:Simple functions

−−→very+−−−→

good+−−−−→

movie

−−−−−−−−−−−→very good movie

To:Complex composition operations

Socher at al. 2013

Learning CDSMs. Big picture

From:Distributional word vectorsas input, no parameters

To:Various learning objectives

I Corpus-extracted phrasevectors

I Reconstruction errorI Sentiment predictionI Cross-lingual signalsI ...

General estimation of CDSMs

Use corpus phrase vectors to learn the parameters of othercomposition functions:

Solve:θ∗ = arg min

θ||P− fcompθ(U,V)||2F

P/U,V - Phrase/Word occurrence matrices

General estimation of CDSMs

Name Composition function ParametersAdd w1

−−→pink + w2

−−−→dogs 1 w1,w2 ∈ R

Mult−−→pink w1 �−−−→dogs w2 2 w1,w2 ∈ R

Dil ||−−→pink ||22−−−→dogs + (λ− 1)〈−−→pink ,

−−−→dogs 〉−−→pink 3 λ ∈ R

Fulladd W1−−→pink + W2

−−−→dogs 4 W1,W2 ∈ Rm×m

Lexfunc A pink−−−→dogs 5 Au ∈ Rm×m

Fulllex tanh([W1,W2]

[A pink

−−→dogs

A dogs−−→pink

]) 6 W1,W2,Au ,Av ∈ Rm×m

I Mitchell and Lapata 2008 (1,2,3) Guavara 2010, Zanzottoet al 2010 (4) Baroni and Zamparelli 2010 (5), Socher et al2012 (6)

General estimation of CDSMs: Data sets

Subj-Verb

fire beam fire burn 7skin glow skin burn 1

Adj. Noun-Noun paraphrase

acheological site digspousal relationship marriagepersonal appeal charismaelectrical storm thunderstormindoor garden hothousedouble star binary

General estimation of CDSMs: Data sets

Determiner phrase - Noun(Bernardi et al 2013)

I Nouns that are strongly related to a determiner phrasesI 23 determiners (two, various, no, many, too many, . . .)

Noun Correct DP Foilsduel two opponents various opponents, three

opponents, two engineers,two, opponents

homeless no home too few homes, one home,no incision, no, home

polygamy several wives most wives, fewer wives,several negotiators, sev-eral, wives

General estimation of CDSMs: ResultsA

dd

Dil

Mu

lt

Fu

llad

d

Lexfu

nc

Fu

lllex

Co

rpu

s

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0 Intransitive sentences

●

●

●

●

●

Ad

d

Dil

Mu

lt

Fu

llad

d

Lexfu

nc

Fu

lllex

Co

rpu

s

10

00

80

06

00

40

02

00

ANs

●

●

Ad

d

Dil

Mu

lt

Fu

llad

d

Lexfu

nc

Fu

lllex

Co

rpu

s

0.1

50

.20

0.2

50

.30

0.3

5

DPs

Performance of composition models:I Lexical function performs best and stable across all tasks

Parameter estimation:I For the simple composition models, observed-phrase

estimation works just as well as chosing parametersthrough (supervised) cross-validation

General estimation of CDSMs: Examples

Target Method Neighboursbelief Corpus moral, dogma, worldview, religion,

world-view, morality, theismAdditive belief, assertion, falsity, falsehood,

truth, credence, dogma, supposition

false beliefFulladd pantheist, belief, agnosticism, religios-

ity, dogmatism, pantheism, theistLexfunc self-deception, untruth, credulity, ob-

fuscation, misapprehension, deceiver,disservice, falsehood

General estimation of CDSMs: Examples

DP neighbours of homeless

Method NeighboursAdditive three people, several orphanages, more

people, many people, too many peopleLexfunc many people, more people, many victims,

three people, no people, more deathsCorpus those living, those homeless, more lives,

those people, many families

Outline




Socher et al 2011Recursive Autoencoder (RAE)

Vector representations for sentences are computed recursivelythrough a parse tree.

Compositionp = g

(W[

c1c2

]+ b

)

c1, c2 ∈ Rn are inputsW ∈ Rn×2n,b ∈ Rn weightmatrix and biasg- non-linear function

Socher et al 2011, Dynamic Pooling and Unfolding RecursiveAutoencoders for Paraphrase Detection

Socher et al 2011Recursive Autoencoder (RAE)

Learn encoding/decoding matrices in order to be able toreconstruct the input.

Encodingp = g(We

[c1c2

]+ be)

Decoding[c′1c′2

]= g(Wdp + bd )

Reconstruction error||[c′1; c′2]− [c1; c2]||2

E.g.Wd (We([

−−→red ;−−−→moon ])) close to [

−−→red ;−−−→moon ]

Socher et al 2011

At sentence levelI Compute vector representations for all nodes in the parse

tree (using unfolding RAE, embeddings as word inputrepresentations)

I Dynamic pooling for computing sentence similarities takingall nodes (words and phrases) into account

EvaluationMSR Paraphrase corpus:(1) The lies and deceptions from Saddam have been welldocuemtned for over 12 years.(2) It has been all documented over 12 years of lies anddeception from Saddam.

Dynamic pooling for paraphrase detection

I Given binary-branching parse of sentences of differinglengths n and m, first compute a matrix S of size(2n − 1)× (2m − 1) with all possible pairwise Euclideandistances between all nodes of both sentences.

I Superimpose a fixed-size np × np grid over the cells of Sand pick min of values in same grid section as number tobe entered in corresponding cell of pooled Sp matrix

Dynamic pooling

2.4 Deep Recursive AutoencoderBoth types of RAE can be extended to have multiple encoding layers at each node in the tree. Insteadof transforming both children directly into parent p, we can have another hidden layer h in between.While the top layer at each node has to have the same dimensionality as each child (in order forthe same network to be recursively compatible), the hidden layer may have arbitrary dimensionality.For the two-layer encoding network, we would replace Eq. 1 with the following:

h = f(W (1)e [c1; c2] + b(1)e ) (7)

p = f(W (2)e h + b(2)e ). (8)2.5 RAE Training

For training we use a set of parse trees and then minimize the sum of all nodes’ reconstruction errors.We compute the gradient efficiently via backpropagation through structure [13]. Even though theobjective is not convex, we found that L-BFGS run with mini-batch training works well in practice.Convergence is smooth and the algorithm typically finds a good locally optimal solution.

After the unsupervised training of the RAE, we demonstrate that the learned feature representationscapture syntactic and semantic similarities and can be used for paraphrase detection.

3 An Architecture for Variable-Sized Similarity MatricesNow that we have described the unsupervised feature learning, we explain how to usethese features to classify sentence pairs as being in a paraphrase relationship or not.

!"# !"$

!"%!"&

!"&

Figure 3: Example of the dy-namic min-pooling layer find-ing the smallest number in apooling window region of theoriginal similarity matrix S .

3.1 Computing Sentence Similarity MatricesOur method incorporates both single word and phrase similaritiesin one framework. First, the RAE computes phrase vectors for thenodes in a given parse tree. We then compute Euclidean distancesbetween all word and phrase vectors of the two sentences. Thesedistances fill a similarity matrix S as shown in Fig. 1. For comput-ing the similarity matrix, the rows and columns are first filled by thewords in their original sentence order. We then add to each row andcolumn the nonterminal nodes in a depth-first, right-to-left order.

Simply extracting aggregate statistics of this table such as the av-erage distance or a histogram of distances cannot accurately cap-ture the global structure of the similarity comparison. For instance,paraphrases often have low or zero Euclidean distances in elementsclose to the diagonal of the similarity matrix. This happens whensimilar words align well between the two sentences. However, sincethe matrix dimensions vary based on the sentence lengths one can-not simply feed the similarity matrix into a standard neural networkor classifier.

3.2 Dynamic PoolingConsider a similarity matrix S generated by sentences of lengths n and m. Since the parse trees arebinary and we also compare all nonterminal nodes, S ∈ R(2n−1)×(2m−1). We would like to map Sinto a matrix Spooled of fixed size, np × np. Our first step in constructing such a map is to partitionthe rows and columns of S into np roughly equal parts, producing an np × np grid.1 We then defineSpooled to be the matrix of minimum values of each rectangular region within this grid, as shown inFig. 3.

The matrix Spooled loses some of the information contained in the original similarity matrix but it stillcaptures much of its global structure. Since elements of S with small Euclidean distances show that

1The partitions will only be of equal size if 2n − 1 and 2m − 1 are divisible by np. We account for this inthe following way, although many alternatives are possible. Let the number of rows of S be R = 2n− 1. Eachpooling window then has �R/np� many rows. Let M = R mod np, be the number of remaining rows. Wethen evenly distribute these extra rows to the last M window regions which will have �R/np� + 1 rows. Thesame procedure applies to the number of columns for the windows. This procedure will have a slightly finergranularity for the single word similarities which is desired for our task since word overlap is a good indicatorfor paraphrases. In the rare cases when np > R, the pooling layer needs to first up-sample. We achieve this bysimply duplicating pixels row-wise until R ≥ np.

4

I Since Sp has same size for all sentence pairs, the values inits cells can be used as input to supervised classifier

[Socher et al. 2011, Fig. 3]

Socher et al 2011. Results

Paraphrase detection results

model Acc.

all paraphrase baseline 66.5

unfolding RAE 72.6

unfolding RAE + extra feats. 76.8

interannotator agreement 83.0

Extra features

I Overlap, sentencescontain samenumbers, sentencelength difference, ...

Matrix-Vector Recursive Neural NetworksSocher et al 2012

I Each word is represented both as a vector and a matrix(lexicalised model)

Training and evaluationTrained to predict a distribution over sentiment of movie reviewsand over semantic relations.

See New directions in vetor space models of meaning ACL 2014 tutorial formore on these models.

Comparison of Compositional DSM modelsBlacoe and Lapata 2012

Compare

I Different type of input vector representationsI Different composition models (addition, multiplication,

recursive autoencoders (RAE - simplified Socher 2011))

EvaluationI Phrase similarity: Adj-Noun, Noun-Noun, Verb-ObjectI Paraphrase detection

Comparison of Compositional DSM modelsPhrase similarity results

Rank correlation

Adj-N N-N V-ObjAdd 0.37 0.38 0.28Mult 0.48 0.50 0.35RAE 0.34 0.29 0.32

I Input vectors: traditional distributional (count) vectorsI Shallow approaches are very good

Comparison of Compositional DSM modelsParaphrase detection

Classifier features: cosine of sentence vectors, sentencevectors, word overlap, ...

Classification accuracy

Add 73.5Mult 73.0RAE 73.2Socher 2011∗ 74.2Socher 2011 76.8

∗ without dynamic poolingI Best results for each method obtained with different type of

input vectors and different combination of featuresI Shallow approaches are very good, again!

Evaluation and extensionsof compositional distributional models



ESSLLI 2014

Outline

Scaling up to sentences: The SICK dataset and SemEval-2014Task 1

Measuring phrase plausibility

Composition in morphology

Phrase generation

Here comes SICKness

I Sentences Involving Compositional Knowledge (orsomething. . . )

I Marco Marelli, Marco Baroni, Raffaella Bernardi, RobertoZamparelli (UNITN), Stefano Menini, Luisa Bentivogli(FBK)

I http://clic.cimec.unitn.it/composes/sick.html

I Used for SemEval-2014 Task 1: Evaluation ofcompositional distributional semantic models on fullsentences through semantic relatedness and textualentailment

I http://alt.qcri.org/semeval2014/task1

I SICK slides courtesy of Marco Marelli

How to test compositional distributional semanticmodels at the sentence level?

I Existing datasets are not ideally suited to testlinguistically-motivated semantic models

I either limited in size, variety of the representedphenomena, and complexity of the used sentences (e.g.,Mitchell and Lapata, 2008)The fire burnedThe face burned

I or focusing on aspects that these models are not meant tohandle (e.g., STS, RTE)Deaths in Ukraine and Poland in freezing Europe weatherUkraine: Europe hit by deadly cold snap

SICK: a new data set

I Not requiring to deal with “irrelevant” aspects of existingsentential data sets

I named entitiesPoland, Obama

I world knowledgePoland is an European CountryObama is the USA president

I “multiword expressions”cold snapLeague of Nations

I telegraphic languageUkraine: Europe hit by deadly cold snap

SICK

I Rich in phenomena of intrinsic linguistic interest

I contextual synonymy and other lexical variation phenomenaThe girl is brightThe sun is bright

I active/passive and other syntactic alternationsThe turtle is following the fishThe turtle is followed by the fish

I negation and other grammatical phenomenaA boy is singing in the parkThere is no boy singing in the park

The SemEval task

I Two subtasks:

I predicting the degree of semantic relatedness betweentwo sentences (1-5 scale)

I detecting the entailment relation holding between twosentences (ENTAILMENT, CONTRADICTION, NEUTRAL)

Data set building

I Starting from the materials of two existing data set (8KImageFlickr and SemEval-2012 STS MSR-VideoDescription)

I normalization: removing unwanted phenomenaA woman is playing Mozart→ A woman is playing classicalmusic

I expansion: enriching the existing sentences withunder-represented aspectsThe surfer is riding a big wave→ A surfer is riding a bigwaveA woman is using a sewing machine→ A woman isworking on a machine used for sewing

I pairing: creating the final sentence pairs

Data set building

I Pairs of sentences with similar meanings

A man is driving a car

The car is being driven by a man

Data set building

I Pairs of sentences with opposed/contrasting meanings

The girl is spraying the plants with water

The boy is spraying the plants with water

Data set building

I Pairs of sentences with with similar vocabulary anddifferent meanings

Two children are lying in the snow and are making snow angels

Two angels are making snow on the lying children

Data set building

I Pairs of sentences with no apparent relation

A sea turtle is hunting for fish

A young woman is playing the guitar

Data set building

I Procedure aimed at generating a balanced distribution ofpossible sentence relations.

I Eventually, about 10k sentence pairs

I half of them released as training set for SemEval-2014evaluation

I half of them used as test set

Gold standard

I Each pair was evaluated in a crowdsourcing study

Gold standard

I Double annotation task

I Final gold scores for each pair:

I Relatedness subtask: average of the obtained judgments

I Entailment subtask: majority vote of the obtainedjudgments

SemEval-2014 systems

I 21 participating systemsI Relatedness subtask: 17 submissions (66 runs)I Entailment subtask: 18 submissions (65 runs)

Results: relatedness subtask

System Pearson rECNU .8280StanfordNLP .8272The_Meaning_Factory .8268UNAL-NLP .8043Illinois-LH .7993CECL_ALL .7804...others .7802–.4795Word Overlap 0.63

Results: entailment subtask

System AccuracyIllinois-LH 84.58 %ECNU 83.64 %UNAL-NLP 83.05 %SemanticKLUE 82.32 %The_Meaning_Factory 81.59 %CECL_ALL 79.99 %...others 79.66%–48.73%Majority 56.7%Word Overlap 56.2%Chance 33.3%

Approaches

I All participating systems combine a number of differentmethods

I Distributional similarity, compositional or notI External resources (mostly WordNet, some paraphrase

corpora)I Supervised machine learning (SVMs)I Ad hoc features, exploiting properties of the dataset (e.g.,

using negative words as cue for contradictory pairs)

Outline




Phrase generation

Proximity, density and entropy

!"#$%"&'()"#*)+&")+

#%!',)+&")+ )+&")+

!"#$%"&'()"#*)+&")+

#%!',)+&")+

!" !# !$ !% !& !'

()*+,-./0-.

*0(1)0/+2-0(3,-./0-.

Measuring adjective-noun phrase plausibility

I Among phrases unattested in large corpus, distinguish theones that make sense (vulnerable gunman, huge joystick,academic crusade) from those that do not (academicbladder, parliamentary tomato, blind pronunciation)

I Proximity of composed vector to phrase is most stablepredictor of phrase acceptability (Vecchi et al. submitted)

I Composed-vector plausibility measures can besuccessfully used to predict right bracketing of nounphrases (miracle [home run] vs. [miracle home] run)(Lazaridou et al. EMNLP 2013)

Nearest neighbours of deviant/plausiblecomposed AN vectorsVecchi et al. submitted

*mathematical biscuit mathematical, mathematical result,mathematical system

*moral protein moral, moral system, morality*empty fungus empty tin, empty packet, empty container

continuous uprising constant warfare, constant conflict,continuous war

diverse farmland diverse wildlife, varied habitat, rare floraimportant coordinator instrumental role, integral role, significant role

Predicting the recursive behaviour of adjectivesVecchi et al. EMNLP 2013

I “Flexible” (daily national newspaper/national dailynewspaper) vs. “rigid” (rapid social change/*social rapidchange) adjective-adjective-noun phrases

I Proximity of recursively-constructed phrase vector tocomponents predicts if phrase is flexible or rigid, andcorrect order for rigid phrases

I Adjective nearest to noun has strong effect on phrasemeaning in rigid constructions

Outline




Phrase generation

Functional composition in morphologyLazaridou et al. ACL 2013, Marelli and Baroni submitted;see also Luong et al. CoNLL 2013

I Affixes as functions from stems to derived forms:

~redo = RE× ~do

I Affix matrices learned with least-squares techniques fromcorpus-extracted stem/derived vector pair examples(try/retry, climb/reclimb, open/reopen)

A look at composed derived forms

word nearest neighbours

carve+er potter, engraver, goldsmithbroil+er oven, stove, cooking, kebab, done

inter+ment cremate, disinter, burial, entombment, funeralequip+ment maintenance, servicing, transportable, deploy

re+issue original, expanded, long-awaitedre+touch repair, refashion, reconfigurere+sound reverberate, clangorous, echo


word nearest neighbourstype keyword, subtype, parsetype+ify embody, characterize, essentially

column arch, pillar, bracket, numericcolumn+ist publicist, journalist, correspondent

cello+ist flutist, virtuoso, quintetentomology+ist zoologist, biologist, botanistpropaganda+ist left-wing, agitator, dissidentrape+ist extortionist, bigamist, pornographer, pimp


word nearest neighbours

industry+al environmental, land-use, agricultureindustry+ous frugal, studious, hard-working

nervous anxious, excitability, panickynerve+ous bronchial, nasal, intestinal

Plausibility measures for morphology

!"#$%!"&'

!"#$%!()&&!"#$%!

!"#$%$&&

'()$%$&&

!" !# !$ !% !& !'

()*+,-.,**

*.)!/0*-

Predicting nonce derived-form acceptabilityMarelli and Baroni submitted

I Subject ratings of unattested formsI Acceptable: sketchable, hologramistI Unacceptable: happenable, windowist

I Plausibility measures applied to vectors derived byfunctional composition

I Entropy and proximity (non-linear effect) predict subjectintuitions (both p < 0.001 in mixed model regression)

Outline




Phrase generation

Phrase generation in distributional semanticsDinu and Baroni, ACL 2014

What for?

I Paraphrase generation, machine translationI From different modalities to language:

I Caption generationI “Read" brain activation patterns

Generation through decomposition functions

1. Decomposition[~u;~v ] = fdecompR(~p)

fdecompR : Rd → Rd × Rd

~p ∈ Rd phrase vector~u, ~v ∈ Rd output vectors

2. Nearest neighbor queryword = NNLex(~u)

Learning decomposition functions

Learn from corpus observed phrase vectorsExample: fdecompAN

−−−−→red.car [

−−→red ; −→car ]−−−−−−→

large.man → [−−−→large ;−−→man ]−−−−−−→

red.dress [−−→red ;

−−−→dress ]

Why corpus-observed phrase vectors?

I Polysemous behaviour of words: green jacket vs. greenpolitician

Learning decomposition functions

SpecificallyLearn linear transformations WR ∈ R2d×d

fdecompR(~p) = WR~p

arg minWR∈R2d×d

‖[U;V ]−WRP‖2F

Experiment 1: Paraphrasing

Noun to Adj-Noun paraphrasing

fallacy false beliefcharisma personal appealbinary double star

Adj-Noun to Noun-Prep-Noun dataset

pre-election promises promises before electioninexperienced user user without experience

Experiment 1: Noun to Adj-Noun

Decompose ANfdecompAN fallacy→[false ; belief]

Experiment 1: Adj-Noun to Noun-Prep-Nounparaphrasing

Compose AN - decompose NPN

fcompAN [pre-election ; promises] → ~pfdecompAN

~p → [~u ; promises]fdecompPN

~u → [before ; election]

Results

N-AN

Adjective NounNN Gen NN Gen

Median rank 168 67 60 29

I NN: Nearest Adj and Noun neighbours of input Noun

AN-NPN

Noun Prep NounPrec@1 0.98 0.08 0.13Median rank 1 5.5 20.5

I Search space: 20,000 Adjs and 20,000 Nouns, 14prepositions

Experiment 1: Examples

Noun → Adj Noun Gold

3 thunderstorm thundery storm electrical storm3 reasoning deductive thinking abstract thought3 jurisdiction legal authority legal power3 folk local music common people

7 superstition old-fashioned religion superstitious notion7 vitriol political bitterness sulfuric acid7 zoom fantastic camera rapid growth

I Better evaluation setting?I Most errors: related phrases instead of paraphrases


Noun → Adj Noun Gold

3 thunderstorm thundery storm electrical storm3 reasoning deductive thinking abstract thought3 jurisdiction legal authority legal power3 folk local music common people7 superstition old-fashioned religion superstitious notion7 vitriol political bitterness sulfuric acid7 zoom fantastic camera rapid growth

I Better evaluation setting?I Most errors: related phrases instead of paraphrases


Adj Noun Noun Prep Noun Gold

3 mountainous region region in highlands region with mountains3 undersea cable cable through ocean cable under sea? inter-war years years during 1930s years between wars7 post-operative pain pain through patient pain after operation7 superficial level level between levels level on surface

Cross-lingually

Map vectors from one language to another

I Use seed lexicon to learn a Lang1 → Lang2 mappingI Map arbitrary vectors

Experiment 2: Italian to English Adj-Noun translation

Compose AN - Translate - Decompose AN

fcompAN [spectacular ; woman] → ~pEnfEn→It ~pEn → ~pItfdecompAN

~pIt → [donna ; affascinante]

Experiment 2: Italian to English Adj-Noun translation

Compose AN - Translate - Decompose AN

I 1000 pairs from a movie subtitles phrase tableI It AN - En AN datasetI dictionary of size 5000 for learning a fEn→It map

spectacular woman donna affascinantevicious killer killer pericoloso

Experiment 2: Results

En→It It→EnThr. Accuracy Coverage Accuracy Coverage0.00 0.21 100% 0.32 100%0.55 0.25 70% 0.40 63%0.60 0.31 32% 0.45 37%0.65 0.45 9% 0.52 16%

I Accuracy when using generation confidence (cosine tonearest neighbour) as a threshold.

I Chance accuracy: 120,0002


English → Italian

3 black tie cravatta nera3 indissoluble tie alleanza indissolubile3 vicious killer assasino feroce (killer pericoloso)3 rough neighborhood zona malfamata (ill-repute zone)7 mortal sin pecato eterno (eternal sin)7 canine star stella stellare (stellar star)

I Nice disambiguation effectI Some translations judged as more natural than the gold

standard

Object recognition in computer vision

Leverage language data to improve object recognitionZero-shot learning (Frome at al 2013, Socher et al 2013, Lazaridou et al 2014)

I Train a cross-modal map from visual to linguistic spaceI Tag an image of a previously unseen objects

How about attributes?Lazaridou, Dinu, Liška, and Baroni - in preparation

ObservationMapped vectors are closer to Adj-Noun phrase vectors

ObservationMapped vectors are closer to Adj-Noun phrase vectors

−200 −150 −100 −50 0 50 100 150 200−200

−150

−100

−50

0

50

100

150

cold butter

fresh strawberry

vintage champagne

dark rum

liqueur

herbal liqueur

yellow

white wine

mixed fruit tall glass

single bottle

only fruitcold milk

lemon

natural yogurt

smooth paste

purple

new orange

orange zestorange

irish cream

small cup dark beer

iced coffee

sparkling wineyoung wine

citrus

regional wine

dark yellow

clear honey

light orange

brandy

black institution

mixed herb

rich chocolate

cherry juice

other orange

more orange

vibrant red

mobile visit

first orange

sweet wine

orange liqueur

crushed ice

white rum

Image as visual phrase

I Gen - Generate Adj-Noun phrase from image projectionI NN - Nearest neighbour baseline

Image Rank Nearest NeighboursNN - crispy hyraxA:448 crispy, edible, tasty, delicious, crunchyN: 50 hyrax, fish, pangolin, dugong, bovidGen - wild fishA:340 wild, dark, edible, wet, saltyN: 27 fish, meat, animal, rabbit, bovidNN - yummy sauteA: 43 yummy, fluffy,cuddly, tasty, furryN:200 saute, food coloring, ramekin, tiramisuGen - fluffy rabbitA: 3 fluffy, white, pink, cuddly,crunchyN:144 rabbit, chinchilla, meat, cat, fish

composition in distributional semantics - semantic · pdf file · 2015-07-29plan...

Documents