deep learning architectures for nlp (hungarian nlp meetup 2016-09-07)

A Survey of Current Neural Network Architectures for NLP Márton Miháltz Meltwater Group Hungarian NLP Meetup

• Introduction• Short intro to NN concepts

• Recurrent neural networks• Long Short-Term Memory, Gated Recurrent Unit

• Recursive neural networks• Applications to sentiment analysis: Socher et al. 2013; Tai et al. 2015

• Convolutional neural networks• Applications to text classification: Kim 2014

• Some more recent architectures• Memory networks, attention models, hybrid architectures

• Tools• Theano, Torch, Tensor Flow, Caffe, Keras


• Feed-forward neural network• Activation fn: tanh, ReLU,

Leaky/Parametric ReLU, SoftPlus, …• Logistic regression or softmax

function for classification layer• Loss functions (objectives):

categorical cross-entropy, neg. log likelihood, …

• Training (optimizers): Gradient Descent, SGD, Mini-batch GD, RMSprop, Ada, Adagrad, Adam, Adamax, Nesterov Momentum, L-BFGS, …

Very Short Intro to Modern Neural Networks

• Input embeddings• 1-hot encoding• Random vectors• Pre-trained vectors, eg. distributional similarity

• Powerful apparatus for learning complex functions for ML• Better at certain NLP tasks than previous methods• Pre-trained distributed representation vectors

• Word2vec, GloVe, GenSim, doc2vec, skip-thought vectors etc.• Vector space properties: similarity, analogies, compositionality etc.

• Less feature engineering needed• Network learns abstract representations

• Transfer learning / domain adaptation• Joint learning/execution of NLP steps possible• Easy to go multimodal

Why Deep Learning for NLP?

● About RNNs○ Internal state depends on state of last step○ Good for sequential input○ Backprop. Through Time (BPTT) training

● Applications○ Language modeling (eg. in machine translation)○ Sequential labeling○ Text generation (eg. image description generation, together w/ CNN)

● Problems with RNNs○ Long sentences, long-term dependencies○ Exponentially shrinking gradients (“vanishing gradients”)○ Solutions:

■ Initialization of weights; regularization; using ReLU activ. fn.■ RNN variations: bidirectional RNN, deep RNN etc.■ gated RNNs: LSTM, GRU

Recurrent Neural Networks

• Long Short Term Memory Networks• A special recurrent network• Has a memory cell (internal memory) (c)• 3 gates: input, forget, output

sigmoid layers with pointwise multiplication operation (vector of values in [0, 1])

• LSTM is able to remove or add information to the cell state, regulated by gates, which optionally let information through

• Gated Recurrent Units• Another RNN variant• No internal memory different from internal state• 2 gates: reset, update (z)

• Reset gate: how to combine new input with previous

state, update gate: how much of the previous state to keep

LSTMs and GRUs

t-1 t-1

t-1 t-1

[Chung et al. 2014+ red labels by me]

• Overcome RNNs’ long dependency limitations& vanishing gradients problem

• Very hip in current NLP applications, eg. SOTA in MT• More complex architectures:

• Bi-directional LSTM• Stacked (deep) (B-)LSTM/GRU layers• Another extension, Grid-LSTM (Kalchbrenner et al. 2015)• Still evolving!

• LSTM vs. GRU better: still in the jury• GRU has fewer parameters, may be faster to train

• LSTM may be better with more data

LSTMs and GRUs

• About RNNs• Hierarchical architecture• Shared weights• Plausible approach for modeling linguistics structures

• Sentiment Analysis with Recursive Networks (Socher et al. 2013)• Compositional processing of parsed input (Eg. able to handle negations)

• Performs sentence-level sentiment classification:

Rotten Tomatoes dataset (Pang & Lee 2005): 11K movie review sentences pos or neg85.5% Accuracy on binary class subset, 45.7% on 5-class

• Not SOTA score any more, but was first to go over 80% after 7 years

• Sentiment Treebank for training

Recursive Networks

• Sentence words: embedding layer w/ random initial vectors (d=25..35)• Parse nodes: compositionality function computes representation, recursive• Softmax classifier: pos-neg (or 5-class) label for each word & each parse node

Recursive Neural Tensor Network

● Weight tensor V:

● Intuition:each slice of the tensor captures a specific type of composition

Sentiment Analysis with RNTN

• Tree-LSTM• Using constituency parsing• Using GloVe word vectors, updated during training

• Idea: sum hidden states of child vectorsof tree nodes

• Each child has its own forget gate• Polarity softmax classifiers on tree nodes

• Improves Socher et al 2013• Fine-grained sentence sentiment: 51.0% vs. 45.7%• Binary sentence sentiment: 88.0% vs. 85.4%

Tree-LSTMs for Sentiment Analysis (Tai et al 2015)

Convolutional Neural Networks• CNNs (ConvNets) widely used in

image processing• Location invariety• Compositionality• Fast

• Convolution layers• “sliding window” over input representation:

filter/kernel/feature generator• Local connectivity• Sharing weights

• Hyperparameters• Wide vs. narrow convolution (padding)• Filter size (width, height, depth)• Number of filters/layer• Stride size• Channels (R, G, B)

CNNs for Text Classification

● Intuition: filter windows over sentence words <-> n-grams

● Advantage over Recursive NN/Tree-LSTM: does not require parsing

● Becoming a standard baseline for new text classification architectures

● Easy to parallelize on GPUs

CNN for Sentiment Analysis (Kim 2014)• Sentence polarity classification (RT dataset/Sentiment Treebank)

• 88.1% on binary sentiment classification

• Use word2vec vectors• sentences: concatenated word vectors

• 2 channels: • Static word2vec vectors & tuned via backprop

• Multiple window sizes (h=3,4,5) and multiple filters (eg. 100)• Apply max-pooling on feature map

• Selects most important feature from feature map

• Penultimate layer: final feature vector• Concatenate all pooled features

• Final layer: softmax classifier (pos/neg sentiment)• Regularization: dropout on penultimate layer

• Randomly set to 0 some of the feature weights• Prevents co-adaptation of hidden units during forward propagation (overfitting)

Adaptation of Word Vectors

Page 17: Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)


• Recursive NNs• Linguistically plausible, applicable to grammatical structures,

needs parsing• Recurrent NNs

• Engineered for sequential input, current improvements with gated RNNs (LSTM, GRU etc.)

• Convolutional NNs• Exceptionally good for classification; unclear how to incorporate

phrase-level structures, hard to interpret, needs zero padding, good for GPUs


• Memory Networks• MemN2N (Sukhbaatar et al 2015)

Facebook’s bAbI Question Answering tasks 90-90%

• Dynamic Memory Networks (Kumar, Irsoy et al 2015): Sentiment on RT dataset 88.6%Episodic memory: input sequences, questions, reasoning about answers

• Attention models• Parsing (Vinyals & Hinton et al 2015); Machine Translation (Bahdanau & Bengio et al 2016)• Relation extraction with LSTM + attention (Zhou et al 2016)• Sentence embeddings with attention model (Wang et al 2016)

• Hybrid architectures• NER with BLSTM-CNN (Chiu & Nichols 2016): 91.62% CoNLL, 86.28% OntoNotes• Sequential labeling with BLSTM-CNN-CRF (Ma & Hovy 2016): 97.55% PoS, 91.21% NER• Sentiment Analysis using CNN-LSTM (Wang et al 2016)

• Joint learning of NLP tasks• Pos-tagging, chunking and CC-tagging with one network (Søgaard & Goldberg 2016)• JEDI: Joint learning of NER and RE (Kirschnick et al 2016)

Some Recent Work

● Cuda, CudNN○ You need these drivers installed

to utilize the GPU (Nvidia)

● Theano○ Low level abstraction; you define

symbolic variables & functions; python

● Tensor Flow○ Low level abstraction; you define

data flow graphs; C++, python

● Torch○ High abstraction level; very easy

C interfacing, Lua

Tools for Hacking ● Caffe○ Very high level, simple JSON

config, little versatility, most useful with convnets (C+Python to extend)

● High-level wrappers○ Keras: can bind to either Tensor

Flow or Theano; python○ SkFlow: wrapper around Tensor

Flow for those familiar with Scikit-learn; python

○ Pretty Tensor, TensorFlow Slim: high level wrapper functions for Tensor Flow; python

○ Digits: Supports Caffe and Torch

● More○ nice overview here

Thank you!