on dual decomposition and linear programming relaxations for natural language processing
TRANSCRIPT
On Dual Decompositionand Linear Programming Relaxations
for Natural Language Processing
Alexander M. Rush, David Sontag,Michael Collins, and Tommi Jaakkola
Dynamic Programming
Dynamic programming is a dominant technique in NLP.
I Fast
I Exact
I Easy to implement
Examples:
I Viterbi algorithm for hidden Markov models
I CKY algorithm for weighted context-free grammars
y∗ = arg maxy
f (y) ← Decoding
Model Complexity
Unfortunately, dynamic programming algorithms do not scale wellwith model complexity.
As our models become complex, these algorithms can explode interms of computational or implementational complexity.
Integration:
I f ← Easy
I g ← Easy
I f + g ← Hard
Integration (1)
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
Red1 flies2 some3 large4 jet5
N V D A N
f (y) + g(z)
I Classical problem in NLP.
I The dynamic programming intersection is prohibitively slowand complicated to implement.
Integration (2)
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
*0 Red1 flies2 some3 large4 jet5
f (y) + g(z)
I Important for improving parsing accuracy.
I The dynamic programming intersection is slow andcomplicated to implement.
Dual Decomposition
A general technique for constructing decoding algorithms
Solve complicated models
y∗ = arg maxy
f (y)
by decomposing into smaller problems.
Upshot: Can utilize a toolbox of combinatorial algorithms.
I Dynamic programming
I Minimum spanning tree
I Shortest path
I Min-Cut
I ...
Dual Decomposition Algorithms
Simple - Uses basic dynamic programming algorithms
Efficient - Faster than full dynamic programming intersections
Strong Guarantees - Gives a certificate of optimality when exact
In experiments, we find the global optimum on 99% of examples.
Widely Applicable - Similar techniques extend to other problems
Roadmap
Algorithm
Experiments
LP Relaxations
Integrated Parsing and Tagging
Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
N V D A N
Red1 flies2 some3 large4 jet5
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
HMM
Dual Decomposition
CFG
Integrated Parsing and Tagging
Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
N V D A N
Red1 flies2 some3 large4 jet5
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
HMM
Dual Decomposition
CFG
HMM for Tagging
Red1 flies2 some3 large4 jet5
N V D A N
I Let Z be the set of all valid taggings of a sentence
and g(z) be a scoring function.
e.g. g(z) = log p(Red1|N) + log p(V|N) + ...
z∗ = arg maxz∈Z
g(z) ← Viterbi decoding
HMM for Tagging
Red1 flies2 some3 large4 jet5
N V D A N
I Let Z be the set of all valid taggings of a sentence
and g(z) be a scoring function.
e.g. g(z) = log p(Red1|N) + log p(V|N) + ...
z∗ = arg maxz∈Z
g(z) ← Viterbi decoding
CFG for Parsing
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
I Let Y be the set of all valid parse trees for a sentence and
f (y) be a scoring function.
e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...
y∗ = arg maxy∈Y
f (y) ← CKY Algorithm
CFG for Parsing
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
I Let Y be the set of all valid parse trees for a sentence and
f (y) be a scoring function.
e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...
y∗ = arg maxy∈Y
f (y) ← CKY Algorithm
Problem DefinitionS
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
I Find parse tree that optimizes
score(S → NP VP) + score(VP → V NP) +
...+ score(Red1,N) + score(V,N) + ...
I Conventional Approach (Bar Hillel et al., 1961)I Replace rules like S → NP VP
with rules like SN,N → NPN,V VPV ,N
Painful. O(t6) increase in complexity for trigram tagging.
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees
Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG
HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Else Update penalty weights based on y (k)(i , t)− z(k)(i , t)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A N
N V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A N
N V D A N
A N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A N
N V D A N
A N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A N
A N D A N
A N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A N
A N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)Key
f (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
Guarantees
TheoremIf at any iteration y (k)(i , t) = z(k)(i , t) for all i , t, then (y (k), z(k))is the global optimum.
In experiments, we find the global optimum on 99% of examples.
If we do not converge to a match, we can still get a result (more inpaper).
Guarantees
TheoremIf at any iteration y (k)(i , t) = z(k)(i , t) for all i , t, then (y (k), z(k))is the global optimum.
In experiments, we find the global optimum on 99% of examples.
If we do not converge to a match, we can still get a result (more inpaper).
Integrated CFG and Dependency Parsing
Red1 flies2 some3 large4 jet5
*0 Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
Dependency Model
Dual Decomposition
Lexicalized CFG
Integrated CFG and Dependency Parsing
Red1 flies2 some3 large4 jet5
*0 Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
Dependency Model
Dual Decomposition
Lexicalized CFG
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
I Let Z be the set of all valid dependency parses of a sentenceand g(z) be a scoring function.
e.g. g(z) = log p(some3|jet5, large4) + ...
z∗ = arg maxz∈Z
g(z) ← Eisner (2000) algorithm
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
I Let Z be the set of all valid dependency parses of a sentenceand g(z) be a scoring function.
e.g. g(z) = log p(some3|jet5, large4) + ...
z∗ = arg maxz∈Z
g(z) ← Eisner (2000) algorithm
Lexicalized PCFG
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
I Let Y be the set of all valid dependency parses of a sentenceand f (y) be a scoring function.
e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...
y∗ = arg maxy∈Y
f (y) ← Modified CKY algorithm
Lexicalized PCFG
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
I Let Y be the set of all valid dependency parses of a sentenceand f (y) be a scoring function.
e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...
y∗ = arg maxy∈Y
f (y) ← Modified CKY algorithm
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees
Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG
Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
Roadmap
Algorithm
Experiments
LP Relaxations
Experiment
Properties:
I Exactness
I Parsing Accuracy
Experiments on:
I English Penn Treebank
Models
I Collins (1997) Model 1I Semi-Supervised Dependency Parser (Koo, 2008)I Trigram Tagger (Toutanova, 2000)
How quickly do the models converge?
0
20
40
60
80
100
<=1<=2
<=3<=4
<=10<=20
<=50
% e
xam
ples
con
verg
ed
number of iterations
Integrated Dependency Parsing
0
20
40
60
80
100
<=1<=2
<=3<=4
<=10<=20
<=50
% e
xam
ples
con
verg
ed
number of iterations
Integrated POS Tagging
Integrated Constituency and Dependency Parsing: Accuracy
87
88
89
90
91
92Collins
DepDual
F1 Score
I Collins (1997) Model 1
I Fixed, First-best Dependencies from Koo (2008)
I Dual Decomposition
Integrated Parsing and Tagging: Accuracy
87
88
89
90
91
92FixedDual
F1 Score
I Fixed, First-Best Tags From Toutanova (2000)
I Dual Decomposition
Roadmap
Algorithm
Experiments
LP Relaxations
Dual Decomposition and Linear Programming Relaxations
Theorem
I If the dual decomposition algorithm converges, then(y (k), z(k)) is the global optimum.
Questions
I What problem is dual decomposition solving?
I How come the algorithm doesn’t always converge?
Dual decomposition searches over a linear programming relaxationof the original problem.
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
Y
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
Y
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
Q
conv(Q)Q′Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
Q
conv(Q)Q′Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)
Q′Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)
Q′
Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)Q′
Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)Q′
Q′ conv(Q)
Possible (y∗, z∗)
w
Possible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)Q′
Q′ conv(Q)
Possible (y∗, z∗)
w
Possible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
Are there points strictly in the outer bound?
Q′
Possible (y∗, z∗)?
Taggings 0.5x
w1 w2 w3
A A A
+ 0.5x
w1 w2 w3
A B B
Parses 0.5 x
X
A
w1
X
A
w2
B
w3
+ 0.5 x
X
A
w1
X
B
w2
A
w3
Best result can be afractional solution.
Convexcombination ofthese structures.
Summary
A Dual Decomposition algorithm for integrated decoding
Simple - Uses only simple, off-the-shelf dynamic programmingalgorithms to solve a harder problem.
Efficient - Faster than classical methods for dynamic programmingintersection.
Strong Guarantees - Solves a linear programming relaxation whichgives a certificate of optimality.
Finds the exact solution on 99% of the examples.
Widely Applicable - Similar techniques extend to other problems
Appendix
Iterative Progress
50
60
70
80
90
100
0 10 20 30 40 50
Per
cent
age
Maximum Number of Dual Decomposition Iterations
f score% certificates
% match K=50
Deriving the Algorithm
Goal:y∗ = arg max
y∈Yf (y)
Rewrite:arg max
z∈Z,y∈Yf (z) + g(y)
s.t. z(i , j) = y(i , j) for all i , j
Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j
u(i , j) (y(i , j)− z(i , j))
The dual problem is to find minu
L(u) where
L(u) = maxy∈Y,z∈Z
L(u, y , z) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ max
y∈Y
g(y)−∑i,j
u(i , j)y(i , j)
Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
Deriving the Algorithm
Goal:y∗ = arg max
y∈Yf (y)
Rewrite:arg max
z∈Z,y∈Yf (z) + g(y)
s.t. z(i , j) = y(i , j) for all i , j
Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j
u(i , j) (y(i , j)− z(i , j))
The dual problem is to find minu
L(u) where
L(u) = maxy∈Y,z∈Z
L(u, y , z) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ max
y∈Y
g(y)−∑i,j
u(i , j)y(i , j)
Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
A Subgradient Algorithm for Minimizing L(u)
L(u) = maxz∈Z
f (z) +∑i,j
u(i , j)y(i , j)
+ maxy∈Y
g(y)−∑i,j
u(i , j)z(i , j)
L(u) is convex, but not differentiable. A subgradient of L(u) at uis a vector gu such that for all v ,
L(v) ≥ L(u) + gu · (v − u)
Subgradient methods use updates u′ = u − αgu
In fact, for our L(u), gu(i , j) = z∗(i , j)− y∗(i , j)
Related Work
I Methods that use general purpose linear programming orinteger linear programming solvers (Martins et al. 2009;Riedel and Clarke 2006; Roth and Yih 2005)
I Dual decomposition/Lagrangian relaxation in combinatorialoptimization (Dantzig and Wolfe, 1960; Held and Karp, 1970;Fisher 1981)
I Dual decomposition for inference in MRFs (Komodakis et al.,2007; Wainwright et al., 2005)
I Methods that incorporate combinatorial solvers within loopybelief propagation (Duchi et al. 2007; Smith and Eisner 2008)