essential epistemology morphology esentations and d · d esentations and morphology adam lopez...
TRANSCRIPT
NLU lecture 5: Word
representations and m
orphologyAdam
Lopez alopez@
inf.ed.ac.uk
•Essential epistem
ology
•W
ord representations and word2vec
•W
ord representations and compositional
morphology
Reading: Mikolov et al. 2013, Luong et al. 2013
Essential epistemology
Exact sciencesEm
pirical sciences
Engineering
Deals withAxiom
s & theorem
sFacts & theories
Artifacts
Truth isForever
Temporary
It works
Examples
Mathem
atics C.S. theory F.L. theory
Physics Biology
Linguistics
Many, including applied C.S.
e.g. NLP
Essential epistemology
Exact sciencesEm
pirical sciences
Engineering
Deals withAxiom
s & theorem
sFacts & theories
Artifacts
Truth isForever
Temporary
It works
Examples
Mathem
atics C.S. theory
Physics Biology
Linguistics
Many, including applied C.S.
e.g. MT
Essential epistemology
Exact sciencesEm
pirical sciences
Engineering
Deals withAxiom
s & theorem
sFacts & theories
Artifacts
Truth isForever
Temporary
It works
Examples
Mathem
atics C.S. theory
Physics Biology
Linguistics
Many, including applied C.S.
e.g. MT
morphological
properties of words (facts)
Essential epistemology
Exact sciencesEm
pirical sciences
Engineering
Deals withAxiom
s & theorem
sFacts & theories
Artifacts
Truth isForever
Temporary
It works
Examples
Mathem
atics C.S. theory
Physics Biology
Linguistics
Many, including applied C.S.
e.g. MT
morphological
properties of words (facts)
Optim
ality theory
Essential epistemology
Exact sciencesEm
pirical sciences
Engineering
Deals withAxiom
s & theorem
sFacts & theories
Artifacts
Truth isForever
Temporary
It works
Examples
Mathem
atics C.S. theory
Physics Biology
Linguistics
Many, including applied C.S.
e.g. MT
morphological
properties of words (facts)
Optim
ality theory
Optim
ality theory is
finite-state
Essential epistemology
Exact sciencesEm
pirical sciences
Engineering
Deals withAxiom
s & theorem
sFacts & theories
Artifacts
Truth isForever
Temporary
It works
Examples
Mathem
atics C.S. theory
Physics Biology
Linguistics
Many, including applied C.S.
e.g. MT
morphological
properties of words (facts)
Optim
ality theory
Optim
ality theory is
finite-state
We can
represent m
orphological properties of words with finite-state autom
ata
Remem
ber the bandwagon
Word representations
Feedforward model
p(e)=
|e|Yi=
1
p(ei |
ei�
n+1 ,...,e
i�1 )
p(ei |
ei�
n+1 ,...,e
i�1 )
=
ei
ei�
1
ei�
2
ei�
3
CCC
WV
tanhsoftm
ax
x=x
p(e)=
|e|Yi=
1
p(ei |
ei�
n+1 ,...,e
i�1 )
p(ei |
ei�
n+1 ,...,e
i�1 )
=
ei
ei�
1
ei�
2
ei�
3
CCC
WV
tanhsoftm
ax
x=x
Every word is a vector (a one-hot vector)
The concatenation of these vectors is
an n-gram
Feedforward model
p(e)=
|e|Yi=
1
p(ei |
ei�
n+1 ,...,e
i�1 )
p(ei |
ei�
n+1 ,...,e
i�1 )
=
ei
ei�
1
ei�
2
ei�
3
CCC
WV
tanhsoftm
ax
x=x
Word em
beddings are vectors: continuous
representations of each word.
Feedforward model
p(e)=
|e|Yi=
1
p(ei |
ei�
n+1 ,...,e
i�1 )
p(ei |
ei�
n+1 ,...,e
i�1 )
=
ei
ei�
1
ei�
2
ei�
3
CCC
WV
tanhsoftm
ax
x=x
n-grams are
vectors: continuous
representations of n-gram
s (or, via recursion, larger
structures)
Feedforward model
p(e)=
|e|Yi=
1
p(ei |
ei�
n+1 ,...,e
i�1 )
p(ei |
ei�
n+1 ,...,e
i�1 )
=
ei
ei�
1
ei�
2
ei�
3
CCC
WV
tanhsoftm
ax
x=x
a discrete probability distribution over V
outcomes is a vector:
V non-negative reals sum
ming to 1.
Feedforward model
p(e)=
|e|Yi=
1
p(ei |
ei�
n+1 ,...,e
i�1 )
p(ei |
ei�
n+1 ,...,e
i�1 )
=
ei
ei�
1
ei�
2
ei�
3
CCC
WV
tanhsoftm
ax
x=x
No matter what we
do in NLP, we’ll (alm
ost) always have words…
Can we reuse these vectors?
Feedforward model
Design a POS tagger using
an RRNLM
Design a POS tagger using
an RRNLM
What are som
e difficulties with this?
What lim
itation do you have in learning a POS
tagger that you don’t have when learning a LM?
Design a POS tagger using
an RRNLM
What are som
e difficulties with this?
What lim
itation do you have in learning a POS
tagger that you don’t have when learning a LM?
One big problem
: LIM
ITED DATA
–John Rupert Firth (1957)
“You shall know a word by the company it
keeps”
Learning word representations using language m
odeling
•Idea: we’ll learn word representations using a language m
odel, then reuse them in our PO
S tagger (or any other thing we predict from
words).
•Problem
: Bengio language model is slow. Im
agine com
puting a softmax over 10,000 words!
Continuous bag-of-words (CBO
W)
Skip-gram
Skip-gramLearning skip-gram
Learning skip-gramW
ord representations capture som
e world knowledge
Continuous Word
Representations
man
woman king
queen
Semantics
walk
walks
Syntactic
read
reads
Will it learn this?
(Additional) limitations of
word2vec•
Closed vocabulary assumption
•Cannot exploit functional relationships in learning
?
Is this language?
A Lorillard spokeswoman said, “This is an old story.”
A UNK UNK said, “This is an old story.”
What our data contains:
What word2vec thinks our data contains:
Is it ok to ignore words?Is it ok to ignore words?
What w
e know about linguistic structure
Morphem
e: the smallest m
eaningful unit of language
“loves”
root/stem: love
affix: -s
morph. analysis: 3rd.SG
.PRES
love +s
What if we em
bed morphem
es rather than words?
Basic idea: compute
representation recursively from
children
Vectors in green are morphem
e embeddings (param
eters)Vectors in grey are com
puted as above (functions)
f is an activation function (e.g. tanh)
Train compositional m
orpheme m
odel by m
inimizing distance to reference vector
Target output: reference vector p
r
contructed vector is pc
Minim
ize:
Or, train in context using
backpropagation
Vectors in green are morphem
e embeddings (param
eters)Vectors in grey are com
puted as above (functions)
Vectors in blue are word or n-gram em
beddings (parameters)
(Basically a feedforward LM
)
Where do we get m
orphemes?
•Use an unsupervised m
orphological analyzer (we’ll talk about unsupervised learning later on).
•How m
any morphem
es are there?
New stems are invented
every day!
fleeking, fleeked, and fleeker are all
attested…
Representations learned by com
positional morphology m
odelSum
mary
•Deep learning is not m
agic and will not solve all of your problem
s, but representation learning is a very powerful idea.
•W
ord representations can be transferred between models.
•W
ord2vec trains word representations using an objective based on language m
odeling—so it can be trained on unlabeled data.
•Som
etimes called unsupervised, but objective is supervised!
•Vocabulary is not finite.
•Com
positional representations based on morphem
es make our
models closer to open vocabulary.