23 deep nlp - wei xu · 2020-06-08 · takeaways ‣ neural networks have several advantages for...

Post on 17-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning for NLP

Many slides from Greg Durrett

Instructor: Wei Xu Ohio State University

CSE 5525

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Sentiment Analysis

the movie was very good 👍

Sentiment Analysis with Linear

the movie was very good

the movie was not bad

👍

the movie was very bad 👎

I[good]

I[bad]

I[not bad]

the movie was not very good 👎 I[not very good]

Unigrams

Unigrams

Bigrams

Trigrams

👍

the movie was not really very enjoyable 4-grams!

Label Feature TypeExample

Drawbacks

‣ More complex features capture interactions but scale badly(13M unigrams, 1.3B 4-grams in Google n-grams)

‣ Instead of more complex linear functions, let’s use simpler nonlinear functions, namely neural networks

the movie was not really very enjoyable

‣ Can we do better than seeing every n-gram once in the training data?

not very good not so great

Neural Networks: XOR

x1

x2

x1 x2

1 1111

100 0

00

0

0

1 0

1

‣ Inputs

‣ Output

x1, x2

(generally x = (x1, . . . , xm))

y

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’s see how we can use neural netsto learn a simple nonlinear function

Neural Networks: XOR

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

Neural Networks: XORy = a1x1 + a2x2

x1

x2

x1 x2 x1 XOR x2

1 1111

100 0

00

0

0

1 0

1

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

x2

x1

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

Neural Networks: XOR

x1

x2

0

1 -1

0

x2

x1

[not]

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

I

I

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

(Linear model: ) y = w · x+ b

y = g(w · x+ b)

y = g(Wx+ b)

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linear classifier

Neural network

…possible because we transformed the space!

Deep Neural Networks

Adopted from Chris Dyer

y1 = g(w1 · x+ b1)

(this was our neural net from the XOR example)

Deep Neural Networks

Adopted from Chris Dyer

y1 = g(w1 · x+ b1)

Deep Neural Networks

Adopted from Chris Dyer

}output of first layer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input OutputHiddenLayer

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linear classifier

Neural network

…possible because we transformed the space!

Deep Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Deep Neural Networks

Adopted from Chris Dyer

}output of first layer

z = g(Vg(Wx+ b) + c)

z = VWx+Vb+ c

With no nonlinearity:

z = Ux+ dEquivalent to

Input OutputHiddenLayer

Deep Neural NetworksInput OutputHidden

Layer

[good]

[not]

‣ Nodes in the hidden layercan learn interactions orconjunctions of features

II

not OR good

y = �2x1 � x2 + 2 tanh(x1 + x2)

Learning Neural Networks

change in output w.r.t. input

change in output w.r.t. hidden

change in hidden w.r.t. input

‣ Computing these looks like running thisnetwork in reverse (backpropagation)

Input OutputHiddenLayer

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Feedforward Bag-of-words

[good]

[not]II

[bad]I[to]I[a]I

y = g(Wx+ b)

binary vector,length = vocabulary size

real-valued matrix,dims = vocabulary size (~10k) x hidden layer size (~100)

x1

x2

0

1 -1

0[not]

[good]

I

I

Drawbacks to FFBoW

[good]

[not]II

[bad]I[to]I[a]I

‣ really not very good and really not very enjoyable — we don’t know the relationship between good and enjoyable

‣ Doesn’t preserve ordering in the input

‣ Lots of parameters to learn

Word Embeddings

goodenjoyable

bad

dog

great

is

‣ word2vec: turn each word into a 100-dimensional vector

‣ Context-based embeddings: find a vector predictive of a word’s context

‣ Words in similar contexts will end up with similar vectors

Feedforward with word vectors

‣ Can capture word similarity

themovie

was

good.

themovie

was

great.

‣ Each x now represents multiple bits of inputy = g(Wx+ b)

binary vector,length = sentence length x vector size

hidden layer size ~100 x (sentence length (~10) x vector size (~100))

Feedforward with word vectorsthe

moviewas

good.

themovie

was

verygood

y = g(Wx+ b) ‣ Need our model to be shift-invariant, like bag-of-words is

Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler

nonlinear functions

‣ Feedforward bag-of-words: didn’t take advantage of word similarity,lots of parameters to learn

‣ Feedforward with word vectors: our parameters are attached to particular indices in a sentence

‣ Solution: convolutional neural nets

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Convolutional Networksthe

moviewas

good.

0.030.020.11.10.0

max = 1.1“good” filter output}

themovie

wasgood

.

0.030.020.11.10.0

max = 1.1

0.1

0.3

0.1

“bad”

“okay”

“terrible”

}Convolutional Networks

themovie

wasgood

.

0.030.020.11.10.0

max = 1.1

0.1“bad”

}Convolutional Networks

‣ Filters are initialized randomly and then learned

‣ Input: n vectors of length m each k filters of length m each k filter outputs of length 1 each

1.10.10.30.1

Features for a classifier, or input to another neural net layer

‣ Takes variable-length input and turns it into fixed-length output

themovie

wasgreat

.

0.030.020.11.80.0

max = 1.8}‣ Word vectors for similar words are similar, so convolutional filters

will have similar outputs

Convolutional Networks

themovie

was

good.

0.030.05

0.1} max = 1.5

“not good”

not0.21.5

‣ Analogous to bigram features in bag-of-words models

Convolutional Networks

}}+

+

Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler

nonlinear functions

‣ Convolutional networks let us take advantage of word similarity

‣ Convolutional networks are translation-invariant like bag-of-words

‣ Convolutional networks can capture local interactions with filtersof width > 1 (i.e. )“not good”

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Sentence Classification

fullyconnected

prediction

themovie

was

good.

not

convolutional

Object Recognitionconvolutional

layersfully connected layers

AlexNet (2012)

Conv layer 1

Conv layer 3

Neural networks are

‣ NNs are built from convolutional layers, fully connected layers, and some other types

‣ Can chain these together into various architectures

‣ Any neural network built this way can be learned from data!

Sentence Classification

themovie

was

good.

not

convolutional fullyconnected

prediction

Sentence Classification

question type classification

subjectivity/objectivitydetection

movie reviewsentiment

Taken from Kim (2014)

product reviews

‣ Outperforms highly-tuned bag-of-words model

Entity Linking

Francis-Landau, Durrett, and Klein (NAACL 2016)

Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Armstrong County is a county in Pennsylvania…

??

‣ Conventional: compare vectors from tf-idf features for overlap‣ Convolutional networks can capture many of the same effects: distill

notions of topic from n-grams

Entity Linking

Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.

Armstrong County is a county in Pennsylvania…

Lance Edward Armstrong is an American former professional road cyclist

topic vector topic vector topic vector

similar — probable link dissimilar — improbable link

convolutional convolutional convolutional

Francis-Landau, Durrett, and Klein (NAACL 2016)

Syntactic Parsing

He wrote a long report on Mars .

PP

NP

He wrote a long report on Mars .

PPNP

VP

VBDNP

My report

Fig. 1report—on Mars wrote—on Mars

Chart Parsing

He wrote a long report on Mars

NPPP

NP

score(rule) +chart(left child) +chart(right child)

chart value =

Syntactic Parsing

score

He wrote a long report on Mars .

PPNP

NP

2 5 7

Left child last word = report ∧ NP PPNP

w>f NP PPNP

2 5 7=

wrote a long report on Mars .

‣ Features need to combine surface information and syntactic information,but looking at words directly ends up being very sparse

feat = I

v

neural network

He wrote a long report on Mars .2 5 7

Scoring parses with neural nets

s>

Durrett and Klein (ACL 2015)

scorePPNP

NP= s>. vector representation of

rule being applied

Jupiter

He wrote a long report on Mars

NPPP

NP

‣ Feedforward pass on nets

‣ Run CKY dynamic program‣ Discrete feature computation

+Discrete Continuous

Parsing a sentence:

Syntactic Parsing

Durrett and Klein (ACL 2015)

Machine Translation

le chat a mangé

the cat ate STOP

‣ Long short-term memory units

Long Short-Term Memory Networks

‣ Map sequence of inputs to sequence of outputs

Taken from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Machine Translation

le chat a mangé

the cat ate STOP

‣ Google is moving towards this architecture, performance is constantly improving compared to phrase-based methods

Neural Network

‣ Torch: http://torch.ch/

‣ Tensorflow: https://www.tensorflow.org/

‣ Theano: http://deeplearning.net/software/theano/

‣ By Google, actively maintained, bindings for many languages

‣ University of Montreal, less and less maintained

‣ Facebook AI Research, Lua

Neural Network

http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

Word Vector Tools‣ Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html

‣ GLoVe: http://nlp.stanford.edu/projects/glove/‣ Word vectors trained on very large corpora

‣ Python code, actively maintainedhttps://code.google.com/archive/p/word2vec/

Convolutional Networks‣ CNNs for sentence class.: https://github.com/yoonkim/CNN_sentence‣ Based on tutorial from: http://deeplearning.net/tutorial/lenet.html‣ Python code‣ Trains very quickly

Takeaways

‣ Neural networks have several advantages for NLP:‣ We can use simpler nonlinear functions instead of more

complex linear functions‣ We can take advantage of word similarity‣ We can build models that are both position-dependent

(feedforward neural networks) and position-independent (convolutional networks)

‣ NNs have natural applications to many problems

‣ While conventional linear models often still do well, neural nets are increasingly the state-of-the-art for many tasks

top related