23 deep nlp - wei xu · 2020-06-08 · takeaways ‣ neural networks have several advantages for...

54
Deep Learning for NLP Many slides from Greg Durrett Instructor: Wei Xu Ohio State University CSE 5525

Upload: others

Post on 17-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Learning for NLP

Many slides from Greg Durrett

Instructor: Wei Xu Ohio State University

CSE 5525

Page 2: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Page 3: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Sentiment Analysis

the movie was very good 👍

Page 4: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Sentiment Analysis with Linear

the movie was very good

the movie was not bad

👍

the movie was very bad 👎

I[good]

I[bad]

I[not bad]

the movie was not very good 👎 I[not very good]

Unigrams

Unigrams

Bigrams

Trigrams

👍

the movie was not really very enjoyable 4-grams!

Label Feature TypeExample

Page 5: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Drawbacks

‣ More complex features capture interactions but scale badly(13M unigrams, 1.3B 4-grams in Google n-grams)

‣ Instead of more complex linear functions, let’s use simpler nonlinear functions, namely neural networks

the movie was not really very enjoyable

‣ Can we do better than seeing every n-gram once in the training data?

not very good not so great

Page 6: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks: XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

‣ Inputs

‣ Output

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’s see how we can use neural netsto learn a simple nonlinear function

Page 7: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks: XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

Page 8: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks: XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

Page 9: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks: XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

Page 10: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

(Linear model: ) y = w · x+ b

y = g(w · x+ b)

y = g(Wx+ b)

Page 11: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linear classifier

Neural network

…possible because we transformed the space!

Page 12: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Neural Networks

Adopted from Chris Dyer

y1 = g(w1 · x+ b1)

(this was our neural net from the XOR example)

Page 13: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Neural Networks

Adopted from Chris Dyer

y1 = g(w1 · x+ b1)

Page 14: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Neural Networks

Adopted from Chris Dyer

}output of first layer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input OutputHiddenLayer

Page 15: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linear classifier

Neural network

…possible because we transformed the space!

Page 16: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Page 17: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Neural Networks

Adopted from Chris Dyer

}output of first layer

z = g(Vg(Wx+ b) + c)

z = VWx+Vb+ c

With no nonlinearity:

z = Ux+ dEquivalent to

Input OutputHiddenLayer

Page 18: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Deep Neural NetworksInput OutputHidden

Layer

[good]

[not]

‣ Nodes in the hidden layercan learn interactions orconjunctions of features

II

not OR good

y = �2x1 � x2 + 2 tanh(x1 + x2)

Page 19: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Learning Neural Networks

change in output w.r.t. input

change in output w.r.t. hidden

change in hidden w.r.t. input

‣ Computing these looks like running thisnetwork in reverse (backpropagation)

Input OutputHiddenLayer

Page 20: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Page 21: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Feedforward Bag-of-words

[good]

[not]II

[bad]I[to]I[a]I

y = g(Wx+ b)

binary vector,length = vocabulary size

real-valued matrix,dims = vocabulary size (~10k) x hidden layer size (~100)

x1

x2

0

1 -1

0[not]

[good]

I

I

Page 22: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Drawbacks to FFBoW

[good]

[not]II

[bad]I[to]I[a]I

‣ really not very good and really not very enjoyable — we don’t know the relationship between good and enjoyable

‣ Doesn’t preserve ordering in the input

‣ Lots of parameters to learn

Page 23: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Word Embeddings

goodenjoyable

bad

dog

great

is

‣ word2vec: turn each word into a 100-dimensional vector

‣ Context-based embeddings: find a vector predictive of a word’s context

‣ Words in similar contexts will end up with similar vectors

Page 24: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Feedforward with word vectors

‣ Can capture word similarity

themovie

was

good.

themovie

was

great.

‣ Each x now represents multiple bits of inputy = g(Wx+ b)

binary vector,length = sentence length x vector size

hidden layer size ~100 x (sentence length (~10) x vector size (~100))

Page 25: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Feedforward with word vectorsthe

moviewas

good.

themovie

was

verygood

y = g(Wx+ b) ‣ Need our model to be shift-invariant, like bag-of-words is

Page 26: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler

nonlinear functions

‣ Feedforward bag-of-words: didn’t take advantage of word similarity,lots of parameters to learn

‣ Feedforward with word vectors: our parameters are attached to particular indices in a sentence

‣ Solution: convolutional neural nets

Page 27: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Page 28: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Convolutional Networksthe

moviewas

good.

0.030.020.11.10.0

max = 1.1“good” filter output}

Page 29: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

themovie

wasgood

.

0.030.020.11.10.0

max = 1.1

0.1

0.3

0.1

“bad”

“okay”

“terrible”

}Convolutional Networks

Page 30: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

themovie

wasgood

.

0.030.020.11.10.0

max = 1.1

0.1“bad”

}Convolutional Networks

‣ Filters are initialized randomly and then learned

‣ Input: n vectors of length m each k filters of length m each k filter outputs of length 1 each

1.10.10.30.1

Features for a classifier, or input to another neural net layer

‣ Takes variable-length input and turns it into fixed-length output

Page 31: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

themovie

wasgreat

.

0.030.020.11.80.0

max = 1.8}‣ Word vectors for similar words are similar, so convolutional filters

will have similar outputs

Convolutional Networks

Page 32: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

themovie

was

good.

0.030.05

0.1} max = 1.5

“not good”

not0.21.5

‣ Analogous to bigram features in bag-of-words models

Convolutional Networks

}}+

+

Page 33: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler

nonlinear functions

‣ Convolutional networks let us take advantage of word similarity

‣ Convolutional networks are translation-invariant like bag-of-words

‣ Convolutional networks can capture local interactions with filtersof width > 1 (i.e. )“not good”

Page 34: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Page 35: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Sentence Classification

fullyconnected

prediction

themovie

was

good.

not

convolutional

Page 36: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Object Recognitionconvolutional

layersfully connected layers

AlexNet (2012)

Conv layer 1

Conv layer 3

Page 37: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural networks are

‣ NNs are built from convolutional layers, fully connected layers, and some other types

‣ Can chain these together into various architectures

‣ Any neural network built this way can be learned from data!

Page 38: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Sentence Classification

themovie

was

good.

not

convolutional fullyconnected

prediction

Page 39: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Sentence Classification

question type classification

subjectivity/objectivitydetection

movie reviewsentiment

Taken from Kim (2014)

product reviews

‣ Outperforms highly-tuned bag-of-words model

Page 40: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Entity Linking

Francis-Landau, Durrett, and Klein (NAACL 2016)

Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Armstrong County is a county in Pennsylvania…

??

‣ Conventional: compare vectors from tf-idf features for overlap‣ Convolutional networks can capture many of the same effects: distill

notions of topic from n-grams

Page 41: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Entity Linking

Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.

Armstrong County is a county in Pennsylvania…

Lance Edward Armstrong is an American former professional road cyclist

topic vector topic vector topic vector

similar — probable link dissimilar — improbable link

convolutional convolutional convolutional

Francis-Landau, Durrett, and Klein (NAACL 2016)

Page 42: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Syntactic Parsing

He wrote a long report on Mars .

PP

NP

He wrote a long report on Mars .

PPNP

VP

VBDNP

My report

Fig. 1report—on Mars wrote—on Mars

Page 43: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Chart Parsing

He wrote a long report on Mars

NPPP

NP

score(rule) +chart(left child) +chart(right child)

chart value =

Page 44: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Syntactic Parsing

score

He wrote a long report on Mars .

PPNP

NP

2 5 7

Left child last word = report ∧ NP PPNP

w>f NP PPNP

2 5 7=

wrote a long report on Mars .

‣ Features need to combine surface information and syntactic information,but looking at words directly ends up being very sparse

feat = I

Page 45: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

v

neural network

He wrote a long report on Mars .2 5 7

Scoring parses with neural nets

s>

Durrett and Klein (ACL 2015)

scorePPNP

NP= s>. vector representation of

rule being applied

Jupiter

Page 46: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

He wrote a long report on Mars

NPPP

NP

‣ Feedforward pass on nets

‣ Run CKY dynamic program‣ Discrete feature computation

+Discrete Continuous

Parsing a sentence:

Syntactic Parsing

Durrett and Klein (ACL 2015)

Page 47: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Machine Translation

le chat a mangé

the cat ate STOP

‣ Long short-term memory units

Page 48: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Long Short-Term Memory Networks

‣ Map sequence of inputs to sequence of outputs

Taken from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 49: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Machine Translation

le chat a mangé

the cat ate STOP

‣ Google is moving towards this architecture, performance is constantly improving compared to phrase-based methods

Page 50: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Network

‣ Torch: http://torch.ch/

‣ Tensorflow: https://www.tensorflow.org/

‣ Theano: http://deeplearning.net/software/theano/

‣ By Google, actively maintained, bindings for many languages

‣ University of Montreal, less and less maintained

‣ Facebook AI Research, Lua

Page 51: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Neural Network

http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

Page 52: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Word Vector Tools‣ Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html

‣ GLoVe: http://nlp.stanford.edu/projects/glove/‣ Word vectors trained on very large corpora

‣ Python code, actively maintainedhttps://code.google.com/archive/p/word2vec/

Page 53: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Convolutional Networks‣ CNNs for sentence class.: https://github.com/yoonkim/CNN_sentence‣ Based on tutorial from: http://deeplearning.net/tutorial/lenet.html‣ Python code‣ Trains very quickly

Page 54: 23 deep NLP - Wei Xu · 2020-06-08 · Takeaways ‣ Neural networks have several advantages for NLP: ‣ We can use simpler nonlinear functions instead of more complex linear functions

Takeaways

‣ Neural networks have several advantages for NLP:‣ We can use simpler nonlinear functions instead of more

complex linear functions‣ We can take advantage of word similarity‣ We can build models that are both position-dependent

(feedforward neural networks) and position-independent (convolutional networks)

‣ NNs have natural applications to many problems

‣ While conventional linear models often still do well, neural nets are increasingly the state-of-the-art for many tasks