jointly learning word representations and composition functions using …hassy/publications/... ·...

10/28/2014 EMNLP 2014 in Doha, Qatar

Jointly LearningWord Representations and Composition Functions

Using Predicate-Argument Structures

Kazuma Hashimoto (UT)

Pontus Stenetorp (UT)

Makoto Miwa (TTI)

Yoshimasa Tsuruoka (UT)

University of Tokyo (UT)Toyota Technological Institute (TTI)

• Neural networks + large unlabeled corpora

Neural Word Vector Representations

– Learn word (i.e. single token) representations

• e.g.) word2vec

(Mikolov+ 2013; Mnih and Kavukcuoglu 2013; inter alia)

– Learn word (i.e. single token) representations

• e.g.) word2vec

(Mikolov+ 2013; Mnih and Kavukcuoglu 2013; inter alia)

– Learn composed vector representations

• e.g.) compositional neural language models

for verb-object vectors (Tsubaki+ 2013)

Relation to Previous Work

word2vec

Compositional

neural

language models

Our model

single token

representations ✓ ✓ ✓recursive structures

of syntactic relations x x ✓pre-training

✓ x ✓composition

x ✓ ✓

word2vec

Compositional

neural

language models

Our model

single token

x ✓ ✓

word2vec

Compositional

neural

language models

Our model

single token

x ✓ ✓

word2vec

Compositional

neural

language models

Our model

single token

x ✓ ✓

• Learning word and composed representations

A Joint Learning Model

– using syntactic structures of unlabeled corpora

d vectors

– without pre-trained word vectors

downpourpay

overcome

downpour

heavy rainmake payment

paysolve problem

achieve objective

bridge gap

overcome

downpour

heavy rainmake payment

paysolve problem

achieve objective

bridge gap

overcome

State-of-the-art scores

for phrase similarity tasks with transitive verbs

1. Learning word representations

using predicate-argument structures

2. Jointly learning word representations and

composition functions

3. Evaluation on phrase similarity tasks

4. Conclusion

Overview

4. Conclusion

Overview

• Standard dependency structures

– Relations between heads and modifiers

Predicate-Argument Structures (PASs)

the heavy rain caused the car accidents

nsubj dobj

• Predicate-Argument Structures (PASs)

– Relations between predicates and arguments

nsubj dobj

• Each predicate in a sentence has

(Enju parser (Miyao and Tsujii 2008))

– a specific category

– zero or more arguments

the heavy rain caused the car accidentsadjective

argument 1

the heavy rain caused the car accidentsverbadjective

argument 1

argument 1 argument 2

the heavy rain caused the car accidentsverbadjective noun

• Given a PAS, discriminating between

A Word Prediction Model Using PASs

– a word in the specific PAS

– a word in the specific PAS and

– a word drawn from a noise distribution

rain cause accidentverb

a target word: cause

a noise distribution

(scaled unigram distribution

in (Mikolov+, 2013))

a target word: causevs

a drawn word: eat

rain eat accidentverb

a target word: causevs

a drawn word: eat

rain eat accidentverb

context information

𝑣 rain 𝑣 accident word vectors

𝑣 rain 𝑣 accident

argument 1

+argument

word vectors

𝑝 cause =

tanh(ℎ𝑎𝑟𝑔1𝑣𝑒𝑟𝑏_𝑎𝑟𝑔12

∗ 𝑣(rain) +

ℎ𝑎𝑟𝑔2𝑣𝑒𝑟𝑏_𝑎𝑟𝑔12

∗ 𝑣(accident))

argument 1

+argument

𝑝 cause =

∗ 𝑣(rain) +

∗ 𝑣(accident))

word vectors

argument 1

+argument

𝑝 cause =

∗ 𝑣(rain) +

∗ 𝑣(accident))

word vectors

argument 1

+argument

𝑝 cause =

∗ 𝑣(rain) +

∗ 𝑣(accident))

word vectors

argument 1

argument 2

𝑝 cause =

∗ 𝑣(rain) +

∗ 𝑣(accident))

𝑠 = 𝑣 cause ∙ 𝑝(cause)

𝑠′ = 𝑣 eat ∙ 𝑝 cause

word vectors

argument 1

cause eat

argument 2

𝑠′

𝑝 cause =

∗ 𝑣(rain) +

∗ 𝑣(accident))

word vectors

argument 1

cause eat

argument 2

𝑠′

𝐜𝐨𝐬𝐭: 𝐦𝐚𝐱(𝟎, 𝟏 − 𝒔 + 𝒔′)

𝑝 cause =

∗ 𝑣(rain) +

∗ 𝑣(accident))

word vectors

• Learning word representations based on

What We Expect from the Model

argument 1

cause eat

argument 2

𝑠′

– specific PAS categories

argument 1

cause eat

argument 2

𝑠′

– selectional preferences

argument 1

cause eat

argument 2

𝑠′

argument 1

cause eat

argument 2

𝑠′

argument 1

cause eat

argument 2

𝑠′

argument 1

cause eat

argument 2

𝑠′

``rain’’ can be

• a subject of ``cause’’

(not ``eat’’)

argument 1

cause eat

argument 2

𝑠′

``rain’’ can be

• a subject of ``cause’’

(not ``eat’’)

• a cause of ``accident’’

argument 1

cause eat

argument 2

𝑠′

Examples

eat at restaurantpreposition

heavy rainadjective

argument 1

Examples

𝑣 eat 𝑣 a𝑡

argument 1

+predicate

heavy rainadjective

argument 1

Examples

𝑣 eat 𝑣 a𝑡

argument 1

restaurant cupboard

predicate

𝑠′

heavy rainadjective

argument 1

Examples

𝑣 eat 𝑣 a𝑡

argument 1

restaurant cupboard

predicate

𝑠′

𝑣 rain

argument 1

heavy delicious

𝑠 𝑠′

heavy rainadjective

argument 1

• Providing additional context information

Adding Bag-of-Words Contexts

argument 1

cause eat

argument 2

𝑠′

– Nouns and Verbs in the same sentences

argument 1

cause eat

argument 2

𝑠′

𝑣 rain 𝑣 accident 𝑣 road 𝑣 injure

argument 1

cause eat

argument 2

𝑠′

𝑣 rain 𝑣 accident 𝑣 road 𝑣 injure

argument 1

cause eat

argument 2

𝑠′

• Learning representations composed by

Beyond Single Word Representations

– multiple words and

– specific relation categories

downpour

heavy rainadjective

argument 1

downpour

heavy rain

heavy rainadjective

argument 1

• Using connections on graphs of PASs

A Specific PAS as a Single Token

argument 1

cause eat

argument 2

𝑠′

argument 1

cause eat

argument 2

𝑠′

heavyadjective

argument 1

carnoun

argument 1

cause eat

argument 2

𝑠′

𝑣 heavy__rain 𝑣 car__accident

heavyadjective

argument 1

carnoun

argument 1

parameterization

argument 1

cause eat

argument 2

𝑠′

Same as Previously!

heavyadjective

argument 1

carnoun

argument 1

parameterization

• Similar tokens for each PAS representation

in terms of cosine similarity

Learned PAS Representations

heavy_rain chief_executive world_war

thunderstorm

downpour

blizzard

much_rain

general_manager

vice_president

executive_director

project_manager

managing_director

second_war

plane_crash

last_war

great_war

• Similar tokens for each PAS representation

in terms of cosine similarity

Learned PAS Representations

make_payment solve_problem meeting_take_place

make_order

carry_survey

pay_tax

impose_tax

achieve_objective

bridge_gap

improve_quality

deliver_information

encourage_development

hold_meeting

event_take_place

end_season

discussion_take_place

do_work

4. Conclusion

Overview

Why Composition?

argument 1

cause eat

argument 2

𝑠′

Why Composition?

argument 1

cause eat

argument 2

𝑠′

fully parameterized

PAS representations𝑣 heavy__rain 𝑣 car__accident

Why Composition?

argument 1

cause eat

argument 2

𝑠′

fully parameterized

PAS representations

• Very large number of combinations of words

Why Composition?

argument 1

cause eat

argument 2

𝑠′

fully parameterized

PAS representations

Data sparseness

Why Composition?

argument 1

cause eat

argument 2

𝑠′

fully parameterized

PAS representations

Data sparseness

• Ignoring information from individual words

Incorporating Composed Vectors

argument 1

cause eat

argument 2

𝑠′

𝑣 heavy rain 𝑣 car accident

argument 1

cause eat

argument 2

𝑠′

𝑣 heavy 𝑣 rain 𝑣 car 𝑣 accident word vectors

argument 1

cause eat

argument 2

𝑠′

𝑣 heavy 𝑣 rain 𝑣 car 𝑣 accident

𝒈𝒂𝒅𝒋_𝒂𝒓𝒈𝟏 𝒈𝒏𝒐𝒖𝒏_𝒂𝒓𝒈𝟏composition functions

word vectors

argument 1

cause eat

argument 2

𝑠′

composed vectors

word vectors

argument 1

cause eat

argument 2

𝑠′

composed vectors

Same as Previously!

word vectors

argument 1

cause eat

argument 2

𝑠′

• Simple element-wise composition functions

with and without tanh

Composition Functions in this Work

– e.g.)

Composition Function 𝒈𝒂𝒅𝒋_𝒂𝒓𝒈𝟏

𝑣 heavy rain = 𝒈𝒂𝒅𝒋_𝒂𝒓𝒈𝟏(𝑣 heavy , 𝑣 rain )

– e.g.)

Add𝑙 𝑣 heavy + 𝑣 rain

Add𝑛𝑙 tanh(𝑣 heavy + 𝑣 rain )

– e.g.)

Add𝑙 𝑣 heavy + 𝑣 rain

Add𝑛𝑙 tanh(𝑣 heavy + 𝑣 rain )

WAdd𝑙 𝑚𝑝𝑟𝑒𝑑𝑎𝑑𝑗_𝑎𝑟𝑔1

∗ 𝑣 heavy +𝑚𝑎𝑟𝑔1𝑎𝑑𝑗_𝑎𝑟𝑔1

∗ 𝑣 rain

WAdd𝑛𝑙 tanh(𝑚𝑝𝑟𝑒𝑑𝑎𝑑𝑗_𝑎𝑟𝑔1

∗ 𝑣 heavy + 𝑚𝑎𝑟𝑔1𝑎𝑑𝑗_𝑎𝑟𝑔1

∗ 𝑣 rain )

Learned Composed Vectors

make payment solve problem run company

make repayment

make money

make indemnity

make saving

make sum

solve dilemma

solve task

solve difficulty

solve trouble

solve contradiction

run firm

run industry

run corporation

run enterprise

run club

• Similar composed representations in terms of

cosine similarity

Learned Composed Vectors

people kill animal animal kill people meeting take place

anyone kill animal

man kill animal

person kill animal

people kill bird

predator kill animal

creature kill people

effusion kill people

elephant kill people

tiger kill people

people kill people

briefing take place

party take place

session take place

conference take place

investiture take place

• Similar composed representations in terms of

cosine similarity

• L2-norms of the weight vectors of WAdd𝑛𝑙

Learned Composition Weights

Category Predicate Argument 1 Argument 2

adj_arg1 2.38 6.55 -

noun_arg1 3.37 5.60 -

verb_arg12 6.78 2.57 2.18

• L2-norms of the weight vectors of WAdd𝑛𝑙

– Clearly emphasizing head words

Learned Composition Weights

Category Predicate Argument 1 Argument 2

adj_arg1 2.38 6.55 -

noun_arg1 3.37 5.60 -

verb_arg12 6.78 2.57 2.18

4. Conclusion

Overview

• Training data

– PASs from BNC (~6 million sentences)

• adjective-noun, noun-noun

• prepositions and verbs with 2 arguments

Experimental Settings

• Training data

• Dimensionality

– 50 and 1,000

• Training data

• Dimensionality

– 50 and 1,000

• Optimization

– AdaGrad (Duchi+ 2011)

• learning rate: 0.05, mini-batch size: 32

• Measuring the semantic similarity between

Datasets for Evaluation

– Adjective-Noun phrases (AN)

– Noun-Noun phrases (NN)

– Verb-Object phrases (VO)

(Mitchell and Lapata 2010)

– Subject-Verb-Object phrases (SVO)

(Grefenstette and Sadrzadeh 2011)

p1: vast amount

p2: large quantity

AN dataset

p1: vast amount

p2: large quantity

AN dataset

annotatorsimilarity score

p1: vast amount

p2: large quantity

AN dataset

annotator

cos 𝑣 𝑝1 , 𝑣 𝑝2 = 0.85

similarity score

p1: vast amount

p2: large quantity

AN dataset

annotator

Spearman’s rank correlation

cos 𝑣 𝑝1 , 𝑣 𝑝2 = 0.85

similarity score

• Examples of phrase pairs for noun phrase tasks

Examples of Phrase Pairs

phrase pair score

vast amount

large quantity7

important part

significant role7

efficient use

little room1

early stage

dark eye1

phrase pair score

wage increase

tax rate7

education course

training programme6

office worker

kitchen door2

study group

news agency1

• Examples of phrase pairs for verb phrase tasks

Examples of Phrase Pairs

phrase pair score

start work

begin career7

pour tea

drink water6

shut door

close eye1

wave hand

start work1

phrase pair score

student write name

student spell name7

child show sign

child express sign6

river meet sea

river visit sea1

system meet criterion

system visit criterion1

• Strong baselines produced by word2vec

Main Results (50dim)

AN NN VO SVO

tion S

Add_l Add_nl Wadd_l Wadd_nl word2vec Human

AN NN VO SVO

tion S

AN NN VO SVO

tion S

• Nice scores for verb phrase tasks

AN NN VO SVO

tion S

• Nice scores for verb phrase tasks

• Consistently outperforming 50 dimensional vectors

Main Results (1,000 dim)

AN NN VO SVO

tion S

• The AN, NN, and VO tasks

– BL: element-wise multiplications

(Blacoe and Lapata 2012)

– HB: recursive neural networks with CCGs

(Hermann and Blunsom 2013)

– KS: tensor-based composition models

(Kartsaklis and Sadrzadeh 2013)

• The SVO task

– GS, VC: tensor-based composition models

(Grefenstette and Sadrzadeh 2011), (Van de Cruys+ 2013)

Comparison with Previous Work

The AN, NN, and VO Tasks

AN NN VO

tion S

Add_nl Wadd_nl BL HB KS Human

• 50 dim

– Comparable to state-of-the-art scores

AN NN VO

tion S

• 1,000 dim

– New state-of-the-art score for the VO task

AN NN VO

tion S

The SVO Task

tion S

Wadd_nl GS VC Human

BNC ukWaC

• State-of-the-art models use large corpora

– e.g.) ukWaC corpus (~ 2B words)

• Achieving the state-of-the-art score using

a much smaller corpus

– BNC (~ 0.1B words) vs ukWaC (~ 2B words)

The SVO Task

tion S

Wadd_nl GS VC Human

BNC BNC ukWaC

• BoW contexts are helpful for the verb phrase tasks

– The results might be dependent on how to

construct BoW contexts

Effects of BoW Contexts

AN NN VO SVO

tion S

Wadd_nl w/o BoW Wadd_nl w/ BoW Human

4. Conclusion

Overview

• Jointly learning composition functions

– with syntactic structures

– without any pre-trained word vectors

• State-of-the-art scores for verb phrase similarity

Conclusion

• Incorporating more sophisticated composition

functions to improve verb phrase representations

• Learning full phrase representations rather than

only 2 or 3 word phrases

Future Work

• Any questions?

Thank You Very Much!

jointly learning word representations and composition functions using …hassy/publications/... ·...

Documents

program representations xiangyu zhang. cs590f software...

nonlinguistic representations

improving data recording in primary care data michelle page...

tensor representations

alternative representations for music...

representations introduction

alternative representations

chapter 7 standards: sc.912.p.8.7. interpret formula...

curriculum representations

shameless representations

artistic representations

alternative representations for music composition

intermediate representations -...

learning bilingual distributed phrase …2 distributed...

abstract representations

program representations

supercuspidal representations of p-adic classical...

architecture representations

negotiating representations, warranties and...

annual report 2012 - crescent textile mills limited report...