essential epistemology morphology esentations and d · d esentations and morphology adam lopez...

NLU lecture 5: Word

representations and m

orphologyAdam

Lopez alopez@

inf.ed.ac.uk

•Essential epistem

ology

•W

ord representations and word2vec

•W

ord representations and compositional

morphology

Reading: Mikolov et al. 2013, Luong et al. 2013

Essential epistemology

Exact sciencesEm

pirical sciences

Engineering

Deals withAxiom

s & theorem

sFacts & theories

Artifacts

Truth isForever

Temporary

It works

Examples

Mathem

atics C.S. theory F.L. theory

Physics Biology

Linguistics

Many, including applied C.S.

e.g. NLP


Exact sciencesEm

pirical sciences

Engineering

Deals withAxiom

s & theorem

sFacts & theories

Artifacts

Truth isForever

Temporary

It works

Examples

Mathem

atics C.S. theory

Physics Biology

Linguistics


e.g. MT


Exact sciencesEm

pirical sciences

Engineering

Deals withAxiom

s & theorem

sFacts & theories

Artifacts

Truth isForever

Temporary

It works

Examples

Mathem

atics C.S. theory

Physics Biology

Linguistics


e.g. MT

morphological

properties of words (facts)


Exact sciencesEm

pirical sciences

Engineering

Deals withAxiom

s & theorem

sFacts & theories

Artifacts

Truth isForever

Temporary

It works

Examples

Mathem

atics C.S. theory

Physics Biology

Linguistics


e.g. MT

morphological


Optim

ality theory


Exact sciencesEm

pirical sciences

Engineering

Deals withAxiom

s & theorem

sFacts & theories

Artifacts

Truth isForever

Temporary

It works

Examples

Mathem

atics C.S. theory

Physics Biology

Linguistics


e.g. MT

morphological


Optim

ality theory

Optim

ality theory is

finite-state


Exact sciencesEm

pirical sciences

Engineering

Deals withAxiom

s & theorem

sFacts & theories

Artifacts

Truth isForever

Temporary

It works

Examples

Mathem

atics C.S. theory

Physics Biology

Linguistics


e.g. MT

morphological


Optim

ality theory

Optim

ality theory is

finite-state

We can

represent m

orphological properties of words with finite-state autom

ata

Remem

ber the bandwagon

Word representations

Feedforward model

p(e)=

|e|Yi=

1

p(ei |

ei�

n+1 ,...,e

i�1 )

p(ei |

ei�

n+1 ,...,e

i�1 )

=

ei

ei�

1

ei�

2

ei�

3

CCC

WV

tanhsoftm

ax

x=x

p(e)=

|e|Yi=

1

p(ei |

ei�

n+1 ,...,e

i�1 )

p(ei |

ei�

n+1 ,...,e

i�1 )

=

ei

ei�

1

ei�

2

ei�

3

CCC

WV

tanhsoftm

ax

x=x

Every word is a vector (a one-hot vector)

The concatenation of these vectors is

an n-gram

Feedforward model

p(e)=

|e|Yi=

1

p(ei |

ei�

n+1 ,...,e

i�1 )

p(ei |

ei�

n+1 ,...,e

i�1 )

=

ei

ei�

1

ei�

2

ei�

3

CCC

WV

tanhsoftm

ax

x=x

Word em

beddings are vectors: continuous

representations of each word.

Feedforward model

p(e)=

|e|Yi=

1

p(ei |

ei�

n+1 ,...,e

i�1 )

p(ei |

ei�

n+1 ,...,e

i�1 )

=

ei

ei�

1

ei�

2

ei�

3

CCC

WV

tanhsoftm

ax

x=x

n-grams are

vectors: continuous

representations of n-gram

s (or, via recursion, larger

structures)

Feedforward model

p(e)=

|e|Yi=

1

p(ei |

ei�

n+1 ,...,e

i�1 )

p(ei |

ei�

n+1 ,...,e

i�1 )

=

ei

ei�

1

ei�

2

ei�

3

CCC

WV

tanhsoftm

ax

x=x

a discrete probability distribution over V

outcomes is a vector:

V non-negative reals sum

ming to 1.

Feedforward model

p(e)=

|e|Yi=

1

p(ei |

ei�

n+1 ,...,e

i�1 )

p(ei |

ei�

n+1 ,...,e

i�1 )

=

ei

ei�

1

ei�

2

ei�

3

CCC

WV

tanhsoftm

ax

x=x

No matter what we

do in NLP, we’ll (alm

ost) always have words…

Can we reuse these vectors?

Feedforward model

Design a POS tagger using

an RRNLM


an RRNLM

What are som

e difficulties with this?

What lim

itation do you have in learning a POS

tagger that you don’t have when learning a LM?


an RRNLM

What are som

e difficulties with this?

What lim

itation do you have in learning a POS

tagger that you don’t have when learning a LM?

One big problem

: LIM

ITED DATA

–John Rupert Firth (1957)

“You shall know a word by the company it

keeps”

Learning word representations using language m

odeling

•Idea: we’ll learn word representations using a language m

odel, then reuse them in our PO

S tagger (or any other thing we predict from

words).

•Problem

: Bengio language model is slow. Im

agine com

puting a softmax over 10,000 words!

Continuous bag-of-words (CBO

W)

Skip-gram

Skip-gramLearning skip-gram

Learning skip-gramW

ord representations capture som

e world knowledge

Continuous Word

Representations

man

woman king

queen

Semantics

walk

walks

Syntactic

read

reads

Will it learn this?

(Additional) limitations of

word2vec•

Closed vocabulary assumption

•Cannot exploit functional relationships in learning

?

Is this language?

A Lorillard spokeswoman said, “This is an old story.”

A UNK UNK said, “This is an old story.”

What our data contains:

What word2vec thinks our data contains:

Is it ok to ignore words?Is it ok to ignore words?

What w

e know about linguistic structure

Morphem

e: the smallest m

eaningful unit of language

“loves”

root/stem: love

affix: -s

morph. analysis: 3rd.SG

.PRES

love +s

What if we em

bed morphem

es rather than words?

Basic idea: compute

representation recursively from

children

Vectors in green are morphem

e embeddings (param

eters)Vectors in grey are com

puted as above (functions)

f is an activation function (e.g. tanh)

Train compositional m

orpheme m

odel by m

inimizing distance to reference vector

Target output: reference vector p

r

contructed vector is pc

Minim

ize:

Or, train in context using

backpropagation

Vectors in green are morphem

e embeddings (param

eters)Vectors in grey are com

puted as above (functions)

Vectors in blue are word or n-gram em

beddings (parameters)

(Basically a feedforward LM

)

Where do we get m

orphemes?

•Use an unsupervised m

orphological analyzer (we’ll talk about unsupervised learning later on).

•How m

any morphem

es are there?

New stems are invented

every day!

fleeking, fleeked, and fleeker are all

attested…

Representations learned by com

positional morphology m

odelSum

mary

•Deep learning is not m

agic and will not solve all of your problem

s, but representation learning is a very powerful idea.

•W

ord representations can be transferred between models.

•W

ord2vec trains word representations using an objective based on language m

odeling—so it can be trained on unlabeled data.

•Som

etimes called unsupervised, but objective is supervised!

•Vocabulary is not finite.

•Com

positional representations based on morphem

es make our

models closer to open vocabulary.

essential epistemology morphology esentations and d · d esentations and morphology adam lopez...

Documents