23 deep nlp - wei xu · 2020-06-08 · takeaways ‣ neural networks have several advantages for...

Deep Learning for NLP

Many slides from Greg Durrett

Instructor: Wei Xu Ohio State University

CSE 5525

Outline

‣ Motivation for neural networks

‣ Application examples

‣ Convolutional neural networks

‣ Tools

‣ Feedforward neural networks

‣ Applying feedforward neural networks to NLP

Sentiment Analysis

the movie was very good 👍

Sentiment Analysis with Linear

the movie was very good

the movie was not bad

the movie was very bad 👎

I[good]

I[bad]

I[not bad]

the movie was not very good 👎 I[not very good]

Unigrams

Bigrams

Trigrams

the movie was not really very enjoyable 4-grams!

Label Feature TypeExample

Drawbacks

‣ More complex features capture interactions but scale badly(13M unigrams, 1.3B 4-grams in Google n-grams)

‣ Instead of more complex linear functions, let’s use simpler nonlinear functions, namely neural networks

the movie was not really very enjoyable

‣ Can we do better than seeing every n-gram once in the training data?

not very good not so great

Neural Networks: XOR

1 1111

‣ Inputs

‣ Output

x1, x2

(generally x = (x1, . . . , xm))

(generally y = (y1, . . . , yn))y = x1 XOR x2

‣ Let’s see how we can use neural netsto learn a simple nonlinear function

x1 x2 x1 XOR x2

1 1111

1“or”

y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

(looks like action potential in neuron)

Neural Networks: XORy = a1x1 + a2x2

x1 x2 x1 XOR x2

1 1111

Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)

“or”y = �x1 � x2 + 2 tanh(x1 + x2)

[good] y = �2x1 � x2 + 2 tanh(x1 + x2)

Neural Networks

Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp space

ShiftNonlinear transformation

(Linear model: ) y = w · x+ b

y = g(w · x+ b)

y = g(Wx+ b)

Neural Networks

Linear classifier

Neural network

…possible because we transformed the space!

Deep Neural Networks

Adopted from Chris Dyer

y1 = g(w1 · x+ b1)

(this was our neural net from the XOR example)

y1 = g(w1 · x+ b1)

}output of first layer

z = g(Vg(Wx+ b) + c)

z = g(Vy + c)

Input OutputHiddenLayer

Neural Networks

Linear classifier

Neural network

…possible because we transformed the space!

}output of first layer

z = g(Vg(Wx+ b) + c)

z = VWx+Vb+ c

With no nonlinearity:

z = Ux+ dEquivalent to

Deep Neural NetworksInput OutputHidden

[good]

‣ Nodes in the hidden layercan learn interactions orconjunctions of features

not OR good

y = �2x1 � x2 + 2 tanh(x1 + x2)

Learning Neural Networks

change in output w.r.t. input

change in output w.r.t. hidden

change in hidden w.r.t. input

‣ Computing these looks like running thisnetwork in reverse (backpropagation)

Outline

‣ Tools

Feedforward Bag-of-words

[good]

[not]II

[bad]I[to]I[a]I

y = g(Wx+ b)

binary vector,length = vocabulary size

real-valued matrix,dims = vocabulary size (~10k) x hidden layer size (~100)

0[not]

[good]

Drawbacks to FFBoW

[good]

[not]II

[bad]I[to]I[a]I

‣ really not very good and really not very enjoyable — we don’t know the relationship between good and enjoyable

‣ Doesn’t preserve ordering in the input

‣ Lots of parameters to learn

Word Embeddings

goodenjoyable

‣ word2vec: turn each word into a 100-dimensional vector

‣ Context-based embeddings: find a vector predictive of a word’s context

‣ Words in similar contexts will end up with similar vectors

Feedforward with word vectors

‣ Can capture word similarity

themovie

great.

‣ Each x now represents multiple bits of inputy = g(Wx+ b)

binary vector,length = sentence length x vector size

hidden layer size ~100 x (sentence length (~10) x vector size (~100))

Feedforward with word vectorsthe

moviewas

themovie

verygood

y = g(Wx+ b) ‣ Need our model to be shift-invariant, like bag-of-words is

Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler

nonlinear functions

‣ Feedforward bag-of-words: didn’t take advantage of word similarity,lots of parameters to learn

‣ Feedforward with word vectors: our parameters are attached to particular indices in a sentence

‣ Solution: convolutional neural nets

Outline

‣ Tools

Convolutional Networksthe

moviewas

0.030.020.11.10.0

max = 1.1“good” filter output}

themovie

wasgood

0.030.020.11.10.0

max = 1.1

“bad”

“okay”

“terrible”

}Convolutional Networks

themovie

wasgood

0.030.020.11.10.0

max = 1.1

0.1“bad”

}Convolutional Networks

‣ Filters are initialized randomly and then learned

‣ Input: n vectors of length m each k filters of length m each k filter outputs of length 1 each

1.10.10.30.1

Features for a classifier, or input to another neural net layer

‣ Takes variable-length input and turns it into fixed-length output

themovie

wasgreat

0.030.020.11.80.0

max = 1.8}‣ Word vectors for similar words are similar, so convolutional filters

will have similar outputs

Convolutional Networks

themovie

0.030.05

0.1} max = 1.5

“not good”

not0.21.5

‣ Analogous to bigram features in bag-of-words models

Convolutional Networks

Comparing Architectures‣ Instead of more complex linear functions, let’s use simpler

nonlinear functions

‣ Convolutional networks let us take advantage of word similarity

‣ Convolutional networks are translation-invariant like bag-of-words

‣ Convolutional networks can capture local interactions with filtersof width > 1 (i.e. )“not good”

Outline

‣ Tools

Sentence Classification

fullyconnected

prediction

themovie

convolutional

Object Recognitionconvolutional

layersfully connected layers

AlexNet (2012)

Conv layer 1

Conv layer 3

Neural networks are

‣ NNs are built from convolutional layers, fully connected layers, and some other types

‣ Can chain these together into various architectures

‣ Any neural network built this way can be learned from data!

themovie

convolutional fullyconnected

prediction

question type classification

subjectivity/objectivitydetection

movie reviewsentiment

Taken from Kim (2014)

product reviews

‣ Outperforms highly-tuned bag-of-words model

Entity Linking

Francis-Landau, Durrett, and Klein (NAACL 2016)

Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.

Lance Edward Armstrong is an American former professional road cyclist

Armstrong County is a county in Pennsylvania…

‣ Conventional: compare vectors from tf-idf features for overlap‣ Convolutional networks can capture many of the same effects: distill

notions of topic from n-grams

Entity Linking

Although he originally won the event, the United States Anti-Doping Agency announced in August 2012 that they had disqualified Armstrong from his seven consecutive Tour de France wins from 1999–2005.

Armstrong County is a county in Pennsylvania…

Lance Edward Armstrong is an American former professional road cyclist

topic vector topic vector topic vector

similar — probable link dissimilar — improbable link

convolutional convolutional convolutional

Francis-Landau, Durrett, and Klein (NAACL 2016)

Syntactic Parsing

He wrote a long report on Mars .

My report

Fig. 1report—on Mars wrote—on Mars

Chart Parsing

He wrote a long report on Mars

score(rule) +chart(left child) +chart(right child)

chart value =

Syntactic Parsing

He wrote a long report on Mars .

Left child last word = report ∧ NP PPNP

w>f NP PPNP

2 5 7=

wrote a long report on Mars .

‣ Features need to combine surface information and syntactic information,but looking at words directly ends up being very sparse

feat = I

neural network

He wrote a long report on Mars .2 5 7

Scoring parses with neural nets

Durrett and Klein (ACL 2015)

scorePPNP

NP= s>. vector representation of

rule being applied

Jupiter

He wrote a long report on Mars

‣ Feedforward pass on nets

‣ Run CKY dynamic program‣ Discrete feature computation

+Discrete Continuous

Parsing a sentence:

Syntactic Parsing

Durrett and Klein (ACL 2015)

Machine Translation

le chat a mangé

the cat ate STOP

‣ Long short-term memory units

Long Short-Term Memory Networks

‣ Map sequence of inputs to sequence of outputs

Taken from http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Machine Translation

le chat a mangé

the cat ate STOP

‣ Google is moving towards this architecture, performance is constantly improving compared to phrase-based methods

Neural Network

‣ Torch: http://torch.ch/

‣ Tensorflow: https://www.tensorflow.org/

‣ Theano: http://deeplearning.net/software/theano/

‣ By Google, actively maintained, bindings for many languages

‣ University of Montreal, less and less maintained

‣ Facebook AI Research, Lua

Neural Network

http://tmmse.xyz/content/images/2016/02/theano-computation-graph.png

Word Vector Tools‣ Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html

‣ GLoVe: http://nlp.stanford.edu/projects/glove/‣ Word vectors trained on very large corpora

‣ Python code, actively maintainedhttps://code.google.com/archive/p/word2vec/

Convolutional Networks‣ CNNs for sentence class.: https://github.com/yoonkim/CNN_sentence‣ Based on tutorial from: http://deeplearning.net/tutorial/lenet.html‣ Python code‣ Trains very quickly

Takeaways

‣ Neural networks have several advantages for NLP:‣ We can use simpler nonlinear functions instead of more

complex linear functions‣ We can take advantage of word similarity‣ We can build models that are both position-dependent

(feedforward neural networks) and position-independent (convolutional networks)

‣ NNs have natural applications to many problems

‣ While conventional linear models often still do well, neural nets are increasingly the state-of-the-art for many tasks

23 deep nlp - wei xu · 2020-06-08 · takeaways ‣ neural networks have several advantages for...

Documents

sxsw 2011 takeaways

continuous delivery - takeaways

nlp life academy - hypnotherapieuniversiteit.nl · nlp life...

final four takeaways

university events: combined major takeaways ·...

sxsw 2016 takeaways

bdi key takeaways

conceptual vectors for nlp lexical functions

the evolution of nlp - welcome to gwiz nlp - gwiz nlp

nlp uebungen der nlp ausbildung im nlp uebungsbuch

restaurants / takeaways

nlp for pmo practitioners dr peter · pdf fileconsider a...

kalaari china takeaways

conceptual vectors for nlp lexical functions mma 2001...

nlp master coach course - edge nlp limited...nlp master...

nlp diploma + nlp...

nlp mastering nlp coaching skills.pdf

carlos salgado, 20 jahre nlp - erfahrung nlp...

nlp practitioner notes · nlp practitioner notes nlp...

simple takeaways