effective composition of dense features in natural language … · 2020. 7. 11. · basic concepts...
TRANSCRIPT
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Effective Composition of Dense Features inNatural Language Processing
Xipeng [email protected]
http://nlp.fudan.edu.cn/~xpqiu
Fudan University
August 25, 2015CCIR 2015
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Outline
1 Basic ConceptsMachine Learning & Deep Learning
2 Neural Models for NLPGeneral ArchitectureVarious ModelsFeature Composition Problems
3 Feature CompositionsCube Activation FunctionNeural Tensor Model
Multi-relational Data EmbeddingsNeural Tensor ModelTrainingApplications
Gated Recursive Neural NetworkAttention Model
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Machine Learning & Deep Learning
Basic Concepts of Machine Learning
Input Data: (xi , yi ),1 ≤ i ≤ m
Model:
Linear Model: y = f (x) = wT x + bGeneralized Linear Model: y = f (x) = wTϕ(x) + bNon-linear Model: Neural Network
Criterion:
Loss Function:L(y , f (x)) → OptimizationEmpirical Risk:Q(θ) = 1
m ·∑m
i=1 L(yi , f (xi , θ)) → Minimization
Regularization: ∥θ∥2
Objective Function: Q(θ) + λ ∥θ∥2
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Machine Learning & Deep Learning
Basic Concepts of Deep Learning
Model: Artificial Neural Network (ANN)
Function: Non-linear function y = σ(∑
i wixi + b)
Input x1
Input x2
Input x3
Input x4
y
Hiddenlayer
Hiddenlayer
Inputlayer
Outputlayer
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Traditional ML methods for NLP
Structured Learning
Structured Learning is the task of assigning a group of labelsy = y1, . . . , yn to a group of inputs x = x1, . . . , xn.Given a sample x, we define the feature Φ(x, y). Thus, we canlabel x with a score function,
y = argmaxy
F (w,Φ(x, y)), (1)
where w is the parameter of function F (·).
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
General Neural Architectures for NLP
How to use neural network for the NLP tasks?
Distributed Representation
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
General Neural Architectures for NLP
How to use neural network for the NLP tasks?
Distributed Representation
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
General Neural Architectures for NLP
from [Collobert et al., 2011]
1 represent the words/featureswith dense vectors(embeddings) by lookup table;
2 concatenate the vectors;
3 classify/match/rank withmulti-layer neural networks.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Difference with the traditional methods
Traditional methods Neural methods
Discrete Vector Dense VectorFeatures (One-hot Representation) (Distributed Representation)
High-dimension Low-dimension
Classifier Linear Non-Linear
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Beyond word/feature embeddings
Can we encode the phrase, sentence, paragraph, or even documentinto the distributed representation?
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
General Neural Architectures for NLP
Word LevelNNLMC&WCBOW & Skip-Gram
Sentence LevelNBOWSequence Models: Recurrent NN, LSTM, Paragraph VectorTopoligical Models: Recursive NN,Convolutional Models: DCNN
Document LevelNBOWHierachical Models two-level CNNSequence Models LSTM, Paragraph Vector
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Skip-Gram Model
from [Mikolov et al., 2013]
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Skip-Gram Model
Given a pair of words (w , c), the probability that the word c isobserved in the context of the target word w is given by
Pr(D = 1|w , c) =1
1 + exp(−wTc),
where w and c are embedding vectors of w and c respectively.The probability of not observing word c in the context of w isgiven by,
Pr(D = 0|w , c) = 1− 1
1 + exp(−wTc).
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Skip-Gram Model with Negative Sampling
Given a training set D, the word embeddings are learned bymaximizing the following objective function:
J(θ) =∑
w ,c∈DPr(D = 1|w , c) +
∑w ,c∈D′
Pr(D = 0|w , c),
where the set D′ is randomly sampled negative examples, assumingthey are all incorrect.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Recurrent Neural Network (RNN) [Elman, 1990]
The RNN has recurrent hiddenstates whose output at each timeis dependent on that of theprevious time.More formally, given a sequencex(1:n) = (x(1), . . . , x(t), . . . , x(n)),the RNN updates its recurrenthidden state h(t) by
h(t) = g(Uh(t−1) +Wx(t) + b),
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Long Short Term Memory Neural Network (LSTM)[Hochreiter and Schmidhuber, 1997]
x(t)
c
σ
σ
σ
φ
φ
OutputGate o
InputGate i
ForgetGate f
LSTMMemory Unit
c(t-1)
c(t-1)
c(t)
c(t)
c(t-1)
o(t)
i(t)
f(t)
h(t) h(t-1)
h(t-1)h(t-1)
h(t-1)
The core of the LSTM model is amemory cell c encoding memoryat every time step of what inputshave been observed up to thisstep.The behavior of the cell iscontrolled by three “gates”:
input gate i
output gate o
forget gate f
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Unfolded LSTM for Text Classification
h1 h2 h3 h4 · · · hT softmax
x1 x2 x3 x4 xT y
Drawback: long-term dependencies need to be transmittedone-by-one along the sequence.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Multi-Timescale LSTM
(a) Fast-to-Slow Strategy (b) Slow-to-Fast Strategy
Figure: Two feedback strategies of our model. The dashed line shows thefeedback connection, and the solid link shows the connection at currenttime.
from [Liu et al., 2015a]
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Unfolded Multi-Timescale LSTM with Fast-to-SlowFeedback Strategy
g31 g3
2 g33 g3
4 · · · g3T
g21 g2
2 g23 g2
4 · · · g2T softmax
g11 g1
2 g13 g1
4 · · · g1T y
x1 x2 x3 x4 xT
from [Liu et al., 2015a]
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
LSTM for Sentiment Analysis
<s> Is this progress ? </s>
0.2
0.3
0.4
0.5
LSTMMT-LSTM
<s> He ’d create a movie better than this . </s>
0
0.2
0.4
0.6
0.8
LSTMMT-LSTM
<s> It ’s not exactly a gourmetmeal but the fare is fair , even coming from the drive . </s>
0
0.2
0.4
0.6
0.8
LSTMMT-LSTM
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Recursive Neural Network (RecNN) [Socher et al., 2013]
a,Det red,JJ bike,NN
red bike,NP
a red bike,NP
Topological modelscompose the sentencerepresentation following agiven topological structureover the words.
Given a labeled binary parse tree,((p2 → ap1), (p1 → bc)), thenode representations arecomputed by
p1 = f (W
[bc
]),
p2 = f (W
[ap1
]).
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
A variant of RecNN for Dependency Parse Tree [Zhu et al.,2015]
Recursive neural network can only process the binary combinationand is not suitable for dependency parsing.
a,Det red,JJ bike,NN
Convolution
Pooling
a red bike,NN
a bike,NN red bike,NN
Recursive Convolutional Neural Network
introducing the convolution andpooling layers;
modeling the complicatedinteractions of the head word andits children.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Convolutional Neural Network (CNN)
Key steps
Convolution
(optional) Folding
Pooling
Various models
DCNN (k-max pooling)[Kalchbrenner et al., 2014]
CNN (binary pooling) [Huet al., 2014]
· · ·
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Overview of state-of-the-art neural models in NLP
Not “Really” Deep Learning in NLP
Most of the neural models is very shallow in NLP.
The major benefit is introducing dense representation.
The feature composition is also quite simple.
ConcatenationSum/AverageBilinear model
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
General ArchitectureVarious ModelsFeature Composition Problems
Quite Simple Feature Composition
Given two embeddings a and b,1 how to calculate their similarity/relevence/relation?
1 Concatenation
a⊕ b → ANN → output
2 Bilinear
aTMb → output
2 how to use them in classification task?1 Concatenation
a⊕ b → ANN → output
2 Sum/Average
a+ b → ANN → output
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Problem
How to enhance the neural model without increasing the networkdepth?
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Solution 1: Cube Activation Function [Chen and Manning,2014]
How to enhance the neural model without increasing the networkdepth?
A simple solution: Cube Activation FunctionSuppose x = a⊕ b,
f(wx+ b) = (w1x1 + · · ·+ wmxm + b)3
=∑i ,j ,k
(wiwjwk)xixjxk +∑i ,j
b(wiwj)xixj + · · ·
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Solution 2: Neural Tensor Model [Chen et al., 2013]
How to enhance the neural model without increasing the networkdepth?
Another intuitive solution: Neural Tensor Model
s(a,b) = uT f(aTW[1:k]b+ VT (a⊕ b) + c)
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Neural Tensor Model
The original Neural Tensor Model is proposed to model themulti-relational data embeddings [Chen et al., 2013] .
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Multi-relational Data Embeddings
Multi-relational Data (e1, e2, r): a pair of entities (e1, e2) and theirrelation mention r .The basic idea is that, the relationship between two entitiescorresponds to a translation between the embeddings of entities,that is, e1 + r ≈ e2 when (e1, e2, r) holds.
the embeddings r of relation mention r
the entity embeddings e1, e2 of the entity pair e1, e2
We can get them by a lookup table respectively.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Neural Tensor Model [Chen et al., 2013]
Neural Tensor Model gives a score function s(e1, e2, r) to modelthe triplet (e1, e2, r) as followed:
s(e1, e2, r) = uT f (eT1 W[1:k]r e2 + VT
r (e1 ⊕ e2) + br ),
where W[1:k]r ∈ Rd×d×k is a tensor, and the bilinear tensor product
takes two vectors e1 and e2 as input, and generates ak-dimensional phrase vector z as output.
z = eT1 M[1:k]r e2
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Neural Tensor Model
from [Chen et al., 2013]
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Training
The objective function is based on the idea that the similarityscores of observed triples in the training set should be larger thanthose of any other triples.
L =∑
(e1,e2,r)∈D
∑(e′1,e
′2,r)∈D′
[γ − s(e1, e2, r) + s(e′1, e′2, r)]+
where [x ]+ denotes the positive part of x ; γ > 0 is a marginhyper-parameter; D is all triplet extracted from the plain text; D ′
denotes all the corrupted triplets, which is composed of trainingtriplets with either the e1 or e2 replaced by a random entity (butnot both at the same time).
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Tensor Factorization
The tensor operation complexity is O(d2k). To remedy this, weuse a tensor factorization approach that factorizes each tensor sliceas the product of two low-rank matrices. Formally, each tensorslice M[i ] ∈ Rd×d is factorized into two low rank matrixP[i ] ∈ Rd×r and Q[i ] ∈ Rr×d :
M[i ] = P[i ]Q[i ], 1 ≤ i ≤ k
where r ≪ d is the number of factors.
g(w , c , t) = uT f (wTP[1:k]Q[1:k]t+ VTc (w ⊕ t) + bc).
The complexity of the tensor operation is now O(rdk).
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Applications
Parsing [Socher et al., 2012] [Socher et al., 2013]
Chinese Word Segmentation [Pei et al., 2014]
Semantic Composition [Zhao et al., 2015b]
Sentence Matching [Qiu and Huang, 2015]
Context-Sensitive Word Embeddings [Liu et al., 2015b]
· · ·
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Convolutional Neural Tensor Network for CQA [Qiu andHuang, 2015]
question+f +
answer MatchingScore
Architecture of Convolutional Neural Tensor Network
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Solution 3: Gated Recursive Neural Network [Chen et al.,2015a]
Rainy
下 雨Day
天Ground
地 面Accumulated water
积 水
M E SB
DAG based Recursive NeuralNetwork
Gating mechanism
An relative complicated solution
GRNN models the complicatedcombinations of the features,which selects and preserves theuseful combinations via reset andupdate gates.
A similar model: AdaSent [Zhaoet al., 2015a]
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Gated Recursive Neural Network
E M B S
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
……
ci-2 ci-1 ci ci+1 ci+2
……
……
……
……
Linearxiyi = Ws × xi + bs
Concatenate
yi
The RecNN needs a topologicalstructure to model a sequence,such as a syntactic tree.GRNN uses a directed acyclicgraph (DAG) to model thecombinations of the inputcharacters, in which twoconsecutive nodes in the lowerlayer are combined into a singlenode in the upper layer.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
GRNN Unit
Gate z
Gate rL Gate rR
hj-1(l-1) hj
(l-1)
hj^(l)
hj(l)
Two Gates
reset gate
update gate
The new activation hlj is
computed as:
hlj = tanh(Wh
[rL ⊙ hl−1
j−1
rR ⊙ hl−1j
]).
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Applications
Chinese Word Segmentation [Chen et al., 2015a]
Dependency Parsing [Chen et al., 2015c]
Sentence Modeling [Chen et al., 2015b]
· · ·
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Attention Model
from [Bahdanau et al., 2014]
Similar to gate mechanism!
1 The context vector ci iscomputed as a weighted sumof hi :
ci =T∑j=1
αijhj
2 The weight αij is computed by
αij = softmax(a(si−1; hj))
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Applications
Machine Translation [Bahdanau et al., 2014, Meng et al.,2015]
Speech Recognition [Chan et al., 2015]
· · ·
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
Cube Activation FunctionNeural Tensor ModelGated Recursive Neural NetworkAttention Model
Conclusion
Four intuitive ways to model the combination the distributedfeatures (dense vectors)?
Cube Activation Function
Neural Tensor Model
Gated Recursive Neural Network
Attention Model
Better models?
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
References I
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointlylearning to align and translate. ArXiv e-prints, September 2014.
W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. Listen, Attend and Spell. ArXive-prints, August 2015.
Danqi Chen and Christopher D Manning. A fast and accurate dependencyparser using neural networks. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP), pages740–750, 2014.
Danqi Chen, Richard Socher, Christopher D Manning, and Andrew Y Ng.Learning new facts from knowledge bases with neural tensor networks andsemantic word vectors. arXiv preprint arXiv:1301.3618, 2013.
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. Gated recursiveneural network for Chinese word segmentation. In Proceedings of AnnualMeeting of the Association for Computational Linguistics, 2015a.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
References II
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Shiyu Wu, and Xuanjing Huang.Sentence modeling with gated recursive neural network. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing,2015b.
Xinchi Chen, Yaqian Zhou, Chenxi Zhu, Xipeng Qiu, and Xuanjing Huang.Transition-based dependency parsing using two heterogeneous gatedrecursive neural networks. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, 2015c.
Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, KorayKavukcuoglu, and Pavel Kuksa. Natural language processing (almost) fromscratch. The Journal of Machine Learning Research, 12:2493–2537, 2011.
Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211,1990.
Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neuralcomputation, 9(8):1735–1780, 1997.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
References III
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neuralnetwork architectures for matching natural language sentences. In Advancesin Neural Information Processing Systems, 2014.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutionalneural network for modelling sentences. In Proceedings of ACL, 2014.
PengFei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang.Multi-timescale long short-term memory neural network for modellingsentences and documents. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, 2015a.
PengFei Liu, Xipeng Qiu, and Xuanjing Huang. Learning context-sensitive wordembeddings with neural tensor skip-gram model. In Proceedings ofInternational Joint Conference on Artificial Intelligence, 2015b.
F. Meng, Z. Lu, Z. Tu, H. Li, and Q. Liu. Neural Transformation Machine: ANew Architecture for Sequence-to-Sequence Learning. ArXiv e-prints, June2015.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
References IV
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficientestimation of word representations in vector space. arXiv preprintarXiv:1301.3781, 2013.
Wenzhe Pei, Tao Ge, and Chang Baobao. Maxmargin tensor neural network forchinese word segmentation. In Proceedings of ACL, 2014.
Xipeng Qiu and Xuanjing Huang. Convolutional neural tensor networkarchitecture for community-based question answering. In Proceedings ofInternational Joint Conference on Artificial Intelligence, 2015.
Richard Socher, Brody Huval, Christopher D Manning, and Andrew Y Ng.Semantic compositionality through recursive matrix-vector spaces. InProceedings of the 2012 Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning, pages1201–1211, 2012.
Richard Socher, John Bauer, Christopher D Manning, and Andrew Y Ng.Parsing with compositional vector grammars. In In Proceedings of the ACLconference. Citeseer, 2013.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Basic ConceptsNeural Models for NLPFeature Compositions
References
References V
Han Zhao, Zhengdong Lu, and Pascal Poupart. Self-adaptive hierarchicalsentence model. arXiv preprint arXiv:1504.05070, 2015a.
Yu Zhao, Zhiyuan Liu, and Maosong Sun. Phrase type sensitive tensor indexingmodel for semantic composition. In AAAI, 2015b.
Chenxi Zhu, Xipeng Qiu, Xinchi Chen, and Xuanjing Huang. A re-rankingmodel for dependency parser with recursive convolutional neural network. InProceedings of Annual Meeting of the Association for ComputationalLinguistics, 2015.
Xipeng Qiu Effective Composition of Dense Features in Natural Language Processing