advanced dynamic programming in...
Post on 24-Jun-2020
32 Views
Preview:
TRANSCRIPT
Advanced Dynamic Programming in CL:
Theory, Algorithms, and Applications
Liang HuangUniversity of Pennsylvania
(S, 0, n)
w0 w1 ... wn-1
Liang Huang (Penn) Dynamic Programming
A Little Bit of History...
2
Liang Huang (Penn) Dynamic Programming
A Little Bit of History...• Who invented Dynamic Programming?
and when was it invented?
2
Liang Huang (Penn) Dynamic Programming
A Little Bit of History...• Who invented Dynamic Programming?
and when was it invented?
• R. Bellman (1940s-50s)
• A. Viterbi (1967)
• E. Dijkstra (1959)
• Hart, Nilsson, and Raphael (1968)
• Dijkstra => A* Algorithm
• D. Knuth (1977)
• Dijkstra on Grammar (Hypergraph)
2
Andrew Viterbi
Richard Bellman
Liang Huang (Penn) Dynamic Programming
A Little Bit of History...• Who invented Dynamic Programming?
and when was it invented?
• R. Bellman (1940s-50s)
• A. Viterbi (1967)
• E. Dijkstra (1959)
• Hart, Nilsson, and Raphael (1968)
• Dijkstra => A* Algorithm
• D. Knuth (1977)
• Dijkstra on Grammar (Hypergraph)
2
Andrew Viterbi
Richard BellmanA. Turing
Liang Huang (Penn) Dynamic Programming
Dynamic Programming• Dynamic Programming is everywhere in NLP
• Viterbi Algorithm for Hidden Markov Models
• CKY Algorithm for Parsing and Machine Translation
• Forward-Backward and Inside-Outside Algorithms
• Also everywhere in AI/ML
• Reinforcement Learning, Planning (POMDP)
• AI Search: Uniform-cost, A*, etc.
• This tutorial: a unified theoretical view of DP
• Focusing on Optimization Problems3
Liang Huang (Penn) Dynamic Programming
Review: DP Basics• DP = Divide-and-Conquer + Two Principles:
• [required] Optimal Subproblem Property
• [recommended] Sharing of Common Subproblems
• Structure of the Search Space
• Incremental
• Graph
• Knapsack, Edit Dist., Sequence Alignment
• Branching
• Hypergraph
• Matrix-Chain, Polygon Triangulation, Optimal BST4
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
5
topological(acyclic)
best-first(superior)
graphs with semirings
hypergraphs with weight functions
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Graphs in NLP
6
part-of-speech tagging
lattice in speech
Liang Huang (Penn) Dynamic Programming
Semirings on Graphs• in a weighted graph, we need two operators:
• extension (multiplicative) and summary (additive)
• the weight of a path is the product of edge weights
• the weight of a vertex is the summary of path weights
7
s
ue1 v
t
...
...
e2e3
d(!1) =!
ei!!1
w(ei) = w(e1) ! w(e2) ! w(e3)
d(t) =!
!i
w(!i)
= w(p1) ! w(p2) ! · · ·
Liang Huang (Penn) Dynamic Programming
Semiring Definitions
8
A monoid is a triple (A,!, 1) where
1. ! is a closed associative binary operator on the set A,
2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.
A monoid is commutative if ! is commutative.
Liang Huang (Penn) Dynamic Programming
Semiring Definitions
8
A monoid is a triple (A,!, 1) where
1. ! is a closed associative binary operator on the set A,
2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.
A monoid is commutative if ! is commutative. ([0, 1], +, 0)([0, 1], ×, 1)
([0, 1], max, 0)
Liang Huang (Penn) Dynamic Programming
Semiring Definitions
8
A monoid is a triple (A,!, 1) where
1. ! is a closed associative binary operator on the set A,
2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.
A monoid is commutative if ! is commutative.
A semiring is a 5-tuple R = (A,!,", 0, 1) such that
1. (A,!, 0) is a commutative monoid.
2. (A,", 1) is a monoid.
3. " distributes over !: for all a, b, c in A,
(a ! b) " c = (a " c) ! (b " c),
c " (a ! b) = (c " a) ! (c " b).
4. 0 is an annihilator for ": for all a in A, 0 " a = a " 0 = 0.
([0, 1], +, 0)([0, 1], ×, 1)
([0, 1], max, 0)
Liang Huang (Penn) Dynamic Programming
Semiring Definitions
8
A monoid is a triple (A,!, 1) where
1. ! is a closed associative binary operator on the set A,
2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.
A monoid is commutative if ! is commutative.
A semiring is a 5-tuple R = (A,!,", 0, 1) such that
1. (A,!, 0) is a commutative monoid.
2. (A,", 1) is a monoid.
3. " distributes over !: for all a, b, c in A,
(a ! b) " c = (a " c) ! (b " c),
c " (a ! b) = (c " a) ! (c " b).
4. 0 is an annihilator for ": for all a in A, 0 " a = a " 0 = 0.
([0, 1], +, 0)([0, 1], ×, 1)
([0, 1], max, 0)
([0, 1], max, ×, 0, 1)([0, 1], +, ×, 0, 1)
Liang Huang (Penn) Dynamic Programming
Semiring Definitions
8
A monoid is a triple (A,!, 1) where
1. ! is a closed associative binary operator on the set A,
2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.
A monoid is commutative if ! is commutative.
A semiring is a 5-tuple R = (A,!,", 0, 1) such that
1. (A,!, 0) is a commutative monoid.
2. (A,", 1) is a monoid.
3. " distributes over !: for all a, b, c in A,
(a ! b) " c = (a " c) ! (b " c),
c " (a ! b) = (c " a) ! (c " b).
4. 0 is an annihilator for ": for all a in A, 0 " a = a " 0 = 0.
([0, 1], +, 0)([0, 1], ×, 1)
([0, 1], max, 0)
([0, 1], max, ×, 0, 1)([0, 1], +, ×, 0, 1)
Liang Huang (Penn) Dynamic Programming
Examples
9
Semiring Set ! " 0 1 intuition/applicationBoolean {0, 1} # $ 0 1 logical deduction, recognition
Viterbi [0, 1] max % 0 1 prob. of the best derivation
Inside R+ & {+'} + % 0 1 prob. of a string
Real R & {+'} min + +' 0 shortest-distance
Tropical R+ & {+'} min + +' 0 with non-negative weights
Counting N + % 0 1 number of paths
Liang Huang (Penn) Dynamic Programming
Ordering
10
Liang Huang (Penn) Dynamic Programming
Ordering
• idempotent
10
A semiring (A,!,", 0, 1) is idempotent if for all a in A, a ! a = a.
Liang Huang (Penn) Dynamic Programming
Ordering
• idempotent
• comparison
• examples: boolean, viterbi, tropical, real, ...
10
A semiring (A,!,", 0, 1) is idempotent if for all a in A, a ! a = a.
(a ! b) " (a # b = a) defines a partial ordering.
({0, 1},!,", 0, 1) (R+ # {+$},min,+,+$, 0)
([0, 1],max,%, 0, 1) (R # {+$},min,+,+$, 0)
Liang Huang (Penn) Dynamic Programming
Ordering
• idempotent
• comparison
• examples: boolean, viterbi, tropical, real, ...
• total-order for optimization problems
• examples: all of the above10
A semiring (A,!,", 0, 1) is idempotent if for all a in A, a ! a = a.
(a ! b) " (a # b = a) defines a partial ordering.
A semiring is totally-ordered if ! defines a total ordering.
({0, 1},!,", 0, 1) (R+ # {+$},min,+,+$, 0)
([0, 1],max,%, 0, 1) (R # {+$},min,+,+$, 0)
Liang Huang (Penn) Dynamic Programming
Monotonicity
11
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
11
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
• optimal substructure in dynamic programming
11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
• optimal substructure in dynamic programming
11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
B: b C: c
A: b ⊗ c
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
• optimal substructure in dynamic programming
11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
B: b C: c
A: b ⊗ c
B: b’ ≤ b C: c
A: b’ ⊗ c≤ b ⊗ c
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
• optimal substructure in dynamic programming
• idempotent => monotone (from distributivity)
11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
B: b C: c
A: b ⊗ c
B: b’ ≤ b C: c
A: b’ ⊗ c≤ b ⊗ c
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
• optimal substructure in dynamic programming
• idempotent => monotone (from distributivity)
• (a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)
11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
B: b C: c
A: b ⊗ c
B: b’ ≤ b C: c
A: b’ ⊗ c≤ b ⊗ c
Liang Huang (Penn) Dynamic Programming
Monotonicity• monotonicity
• optimal substructure in dynamic programming
• idempotent => monotone (from distributivity)
• (a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)
• by def. of comparison, a⊗c ≤ b⊗c11
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A
(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)
B: b C: c
A: b ⊗ c
B: b’ ≤ b C: c
A: b’ ⊗ c≤ b ⊗ c
Liang Huang (Penn) Dynamic Programming
DP on Graphs
• optimization problems on graphs=> generic shortest-path problem
• weighted directed graph G=(V, E) with a function w that assigns each edge a weight from a semiring
• compute the best weight of the target vertex t
• generic update along edge (u, v)
• how to avoid cyclic updates?
• only update when d(u) is fixed
12
vu w(u, v) d(v) ! = d(u) " w(u, v)
d(v) ! d(v) " (d(u) # w(u, v))
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
13
topological(acyclic)
best-first(superior)
graphs with semirings
(e.g., FSMs)
hypergraphs with weight functions
(e.g., CFGs)
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Viterbi Algorithm for DAGs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each incoming edge (u, v) in E
• use d(u) to update d(v):
• key observation: d(u) is fixed to optimal at this time
• time complexity: O( V + E )14
v
u w(u, v)
d(v) ! = d(u) " w(u, v)
Liang Huang (Penn) Dynamic Programming
Variant 1: forward-update1. topological sort
2. visit each vertex v in sorted order and do updates
• for each outgoing edge (v, u) in E
• use d(v) to update d(u):
• key observation: d(v) is fixed to optimal at this time
• time complexity: O( V + E )15
d(u) ! = d(v) " w(v, u)
v
uw(v, u)
Liang Huang (Penn) Dynamic Programming
Examples
16
Liang Huang (Penn) Dynamic Programming
Examples• [Number of Paths in a DAG]
16
Liang Huang (Penn) Dynamic Programming
Examples• [Number of Paths in a DAG]
• just use the counting semiring (N, +, ×, 0, 1)
• note: this is not an optimization problem!
16
Liang Huang (Penn) Dynamic Programming
Examples• [Number of Paths in a DAG]
• just use the counting semiring (N, +, ×, 0, 1)
• note: this is not an optimization problem!
• [Longest Path in a DAG]
16
Liang Huang (Penn) Dynamic Programming
Examples• [Number of Paths in a DAG]
• just use the counting semiring (N, +, ×, 0, 1)
• note: this is not an optimization problem!
• [Longest Path in a DAG]
• just use the semiring
16
(R ! {"#},max,+,"#, 0)
Liang Huang (Penn) Dynamic Programming
Examples• [Number of Paths in a DAG]
• just use the counting semiring (N, +, ×, 0, 1)
• note: this is not an optimization problem!
• [Longest Path in a DAG]
• just use the semiring
• [Part-of-Speech Tagging with a Hidden Markov Model]
16
(R ! {"#},max,+,"#, 0)
Liang Huang (Penn) Dynamic Programming
Examples• [Number of Paths in a DAG]
• just use the counting semiring (N, +, ×, 0, 1)
• note: this is not an optimization problem!
• [Longest Path in a DAG]
• just use the semiring
• [Part-of-Speech Tagging with a Hidden Markov Model]
16
(R ! {"#},max,+,"#, 0)
Liang Huang (Penn) Dynamic Programming
Example: Speech Alignment
17
time complexity: O(n2)
also used in:edit distance
biological sequence alignment
Liang Huang (Penn) Dynamic Programming
Example: Word Alignment
18
• key difference
• reorderings in translation!
• sequence/speech alignment is always monotonic
• complexity under HMM
• word alignment is O(n3)
• for every (i, j)
• enumerate all (i-1, k)
• sequence alignment O(n2)
I love you .
Je
t’
aime
.
ii-1
j
k
Liang Huang (Penn) Dynamic Programming
Chinese Word Segmentation
19
下 雨 天 地 面 积 水 xia yu tian di mian ji shui
Liang Huang (Penn) Dynamic Programming
Chinese Word Segmentation
19
民主min-zhu
people-dominate
“democracy”
下 雨 天 地 面 积 水 xia yu tian di mian ji shui
Liang Huang (Penn) Dynamic Programming
Chinese Word Segmentation
19
民主min-zhu
people-dominate
“democracy”
江泽民 主席jiang-ze-min zhu-xi
... - ... - people dominate-podium
“President Jiang Zemin”
下 雨 天 地 面 积 水 xia yu tian di mian ji shui
Liang Huang (Penn) Dynamic Programming
Chinese Word Segmentation
19
民主min-zhu
people-dominate
“democracy”
江泽民 主席jiang-ze-min zhu-xi
... - ... - people dominate-podium
“President Jiang Zemin”
this was 5 years ago.
now Google isgood at segmentation!
下 雨 天 地 面 积 水 xia yu tian di mian ji shui
Liang Huang (Penn) Dynamic Programming
Chinese Word Segmentation
19
民主min-zhu
people-dominate
“democracy”
江泽民 主席jiang-ze-min zhu-xi
... - ... - people dominate-podium
“President Jiang Zemin”
this was 5 years ago.
now Google isgood at segmentation!
下 雨 天 地 面 积 水 xia yu tian di mian ji shui
Liang Huang (Penn) Dynamic Programming
Chinese Word Segmentation
19
民主min-zhu
people-dominate
“democracy”
江泽民 主席jiang-ze-min zhu-xi
... - ... - people dominate-podium
“President Jiang Zemin”
this was 5 years ago.
now Google isgood at segmentation!
下 雨 天 地 面 积 水 xia yu tian di mian ji shui
graph search
Huang and Chiang Forest Rescoring
Phrase-based Decoding
20
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
yu Shalong juxing le huitanwith Sharon held a talk
talksSharon heldwith
Huang and Chiang Forest Rescoring
Phrase-based Decoding
20
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
yu Shalong juxing le huitanwith Sharon held a talk
talksSharon heldwith
_ _ _ _ _
Huang and Chiang Forest Rescoring
Phrase-based Decoding
20
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
yu Shalong juxing le huitanwith Sharon held a talk
talksSharon heldwith
_ _ _ _ _ _ _●●●
Huang and Chiang Forest Rescoring
Phrase-based Decoding
20
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
yu Shalong juxing le huitanwith Sharon held a talk
talksSharon heldwith
_ _ _ _ _ _ _●●● ●●●●●
Huang and Chiang Forest Rescoring
Phrase-based Decoding
21
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
yu Shalong juxing le huitanwith Sharon held a talk
talksSharon heldwith
_ _ _ _ _ _ _●●● ●●●●●
Huang and Chiang Forest Rescoring
Phrase-based Decoding
21
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
yu Shalong juxing le huitanwith Sharon held a talk
talksSharon heldwith
_ _ _ _ _ _ _●●● ●●●●●
●_●●●
Huang and Chiang Forest Rescoring
Phrase-based Decoding
22
yu Shalong juxing le huitan
与 沙龙 举行 了 会谈
held a talk with Sharon
_ _●●●held a talk held a talk with Sharon
_ _ _ _ _
...
...
...
●●●●●
...
_ _●●●held a talk
source-side: coverage vector
target-side: grow hypotheses strictly left-to-right
space: O(2n), time: O(2n n2) -- cf. traveling salesman problem
Huang and Chiang Forest Rescoring
Traveling Salesman Problem & MT
• a classical NP-hard problem
• goal: visit each city once and only once
• exponential-time dynamic programming
• state: cities visited so far (bit-vector)
• search in this O(2n) transformed graph
• MT: each city is a source-language word
• restrictions in reordering can reduce complexity => distortion limit
• => syntax-based MT
23(Held and Karp, 1962; Knight, 1999)
Huang and Chiang Forest Rescoring
Traveling Salesman Problem & MT
• a classical NP-hard problem
• goal: visit each city once and only once
• exponential-time dynamic programming
• state: cities visited so far (bit-vector)
• search in this O(2n) transformed graph
• MT: each city is a source-language word
• restrictions in reordering can reduce complexity => distortion limit
• => syntax-based MT
23(Held and Karp, 1962; Knight, 1999)
Huang and Chiang Forest Rescoring
Adding a Bigram Model• “refined” graph: annotated with language model words
• still dynamic programming, just larger search space
24
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
Huang and Chiang Forest Rescoring
Adding a Bigram Model• “refined” graph: annotated with language model words
• still dynamic programming, just larger search space
24
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
with Sharon
Huang and Chiang Forest Rescoring
Adding a Bigram Model• “refined” graph: annotated with language model words
• still dynamic programming, just larger search space
24
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
with Sharon
bigram
Huang and Chiang Forest Rescoring
Adding a Bigram Model• “refined” graph: annotated with language model words
• still dynamic programming, just larger search space
24
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
with Sharon
bigram
space: O(2n), time: O(2n n2) => space: O(2n Vm-1), time: O(2n Vm-1 n2)
for m-gram language models
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
25
topological(acyclic)
best-first(superior)
graphs with semirings
(e.g., FSMs)
hypergraphs with weight functions
(e.g., CFGs)
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
26
d(u) d(u) ⊗ w(e)w(e)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm• Dijkstra does not require acyclicity
• instead of topological order, we use best-first order
• but this requires superiority of the semiring
• intuition: combination always gets worse
26
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A
a # a " b, b # a " b.
d(u) d(u) ⊗ w(e)w(e)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm• Dijkstra does not require acyclicity
• instead of topological order, we use best-first order
• but this requires superiority of the semiring
• intuition: combination always gets worse
• contrast: monotonicity: combination preserves order
26
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A
a # a " b, b # a " b.
d(u) d(u) ⊗ w(e)w(e)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm• Dijkstra does not require acyclicity
• instead of topological order, we use best-first order
• but this requires superiority of the semiring
• intuition: combination always gets worse
• contrast: monotonicity: combination preserves order
26
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A
a # a " b, b # a " b.
d(u) d(u) ⊗ w(e)w(e)
({0, 1},!,", 0, 1)([0, 1],max,#, 0, 1)
(R+ $ {+%},min,+,+%, 0)(R $ {+%},min,+,+%, 0)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm• Dijkstra does not require acyclicity
• instead of topological order, we use best-first order
• but this requires superiority of the semiring
• intuition: combination always gets worse
• contrast: monotonicity: combination preserves order
26
Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A
a # a " b, b # a " b.
d(u) d(u) ⊗ w(e)w(e)
({0, 1},!,", 0, 1)([0, 1],max,#, 0, 1)
(R+ $ {+%},min,+,+%, 0)(R $ {+%},min,+,+%, 0)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
27
S V - S
s ...
d(u) ! = d(v) " w(v, u)
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
v
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
27
S V - S
vs ...
d(u) ! = d(v) " w(v, u)
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
27
uw(v, u)
S V - S
vs ...
d(u) ! = d(v) " w(v, u)
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
acyclic: Viterbi
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
many NLP
problems
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
many NLP
problems
forward-backward(Inside semiring)
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
many NLP
problems
forward-backward(Inside semiring) non-probabilistic
models
Liang Huang (Penn) Dynamic Programming
Viterbi vs. Dijkstra
• structural vs. algebraic constraints
• Dijkstra only applicable to optimization problems
28
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
many NLP
problems
forward-backward(Inside semiring) non-probabilistic
models
cyclic FSMs/grammars
Liang Huang (Penn) Dynamic Programming
What if both fail?
29
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
many NLP
problems
generalized Bellman-Ford(CLR, 1990; Mohri, 2002)
or, first do strongly-connected components (SCC)which gives a DAG; use Viterbi globally on this SCC-DAG;
use Bellman-Ford locally within each SCC
Liang Huang (Penn) Dynamic Programming
What if both work?
30
monotonic optimization problems
acyclic: Viterbi
superior: Dijkstra
many NLP
problems
full Dijkstra is slower than ViterbiO((V + E) lgV) vs. O(V + E)
but it can finish as early as the target vertex is poppeda (V + E) lgV vs. V + E
Q: how to (magically) reduce a?
Liang Huang (Penn) Dynamic Programming
A* Search: Intuition
• Dijkstra is “blind” about how far the target is
• may get “trapped” by obstacles
• can we be more intelligent about the future?
• idea: prioritize by s-v distance + v-t estimate
31
s
v
ut
Liang Huang (Penn) Dynamic Programming
A* Search: Intuition
• Dijkstra is “blind” about how far the target is
• may get “trapped” by obstacles
• can we be more intelligent about the future?
• idea: prioritize by s-v distance + v-t estimate
31
s
v
ut
Liang Huang (Penn) Dynamic Programming
A* Search: Intuition
• Dijkstra is “blind” about how far the target is
• may get “trapped” by obstacles
• can we be more intelligent about the future?
• idea: prioritize by s-v distance + v-t estimate
31
s
v
ut
Liang Huang (Penn) Dynamic Programming
A* Heuristic
• h(v): the distance from v to target t
• ĥ(v) must be an optimistic estimate of h(v): ĥ(v)≤ h(v)
• Dijkstra is a special case where ĥ(v) = ī (0 for dist.)
• now, prioritize the queue by d(v) ⊗ ĥ(v)
• can stop when target gets popped -- why?
• optimal subpaths should pop earlier than non-optimal
• d(v) ⊗ ĥ(v) ≤ d(v) ⊗ h(v) ≤ d(t) ≤ non-optimal paths of t32
s v t
d(v) h(v)
ĥ(v)
Liang Huang (Penn) Dynamic Programming
How to design a heuristic?• more of an art than science
• basic idea: projection into coarser space
• cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }
• exact cost in coarser graph is estimate of finer graph
33 (Raphael, 2001)
Liang Huang (Penn) Dynamic Programming
How to design a heuristic?• more of an art than science
• basic idea: projection into coarser space
• cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }
• exact cost in coarser graph is estimate of finer graph
33
U V U V
(Raphael, 2001)
Liang Huang (Penn) Dynamic Programming
Viterbi or A*?• A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)
• can finish early if lucky
• actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)
• with the price of maintaining priority queue - O(log V)
• Q: how early? worth the price?
• if the rank is r, then A* is better when r/V log V < 1
34Dijkstra
d(v) pool
d(t)A*
d(v) ⊗ ĥ(v) pool
d(t) r
1
V
Liang Huang (Penn) Dynamic Programming
Viterbi or A*?• A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)
• can finish early if lucky
• actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)
• with the price of maintaining priority queue - O(log V)
• Q: how early? worth the price?
• if the rank is r, then A* is better when r/V log V < 1
34
r < V / log V
Dijkstra
d(v) pool
d(t)A*
d(v) ⊗ ĥ(v) pool
d(t) r
1
V
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
35
topological(acyclic)
best-first(superior)
graphs with semirings
(e.g., FSMs)
hypergraphs with weight functions
(e.g., CFGs)
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
35
topological(acyclic)
best-first(superior)
graphs with semirings
(e.g., FSMs)
hypergraphs with weight functions
(e.g., CFGs)
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Background: CFG and Parsing
36
(S, 0, n)
w0 w1 ... wn-1
Liang Huang (Penn) Dynamic Programming
Background: CFG and Parsing
36
(S, 0, n)
w0 w1 ... wn-1
Liang Huang (Penn) Dynamic Programming
Background: CFG and Parsing
37
(S, 0, n)
w0 w1 ... wn-1
Liang Huang (Penn) Dynamic Programming
Background: CFG and Parsing
37
(S, 0, n)
w0 w1 ... wn-1
Liang Huang (Penn) Dynamic Programming
(Directed) Hypergraphs• a generalization of graphs
• edge => hyperedge: several vertices to one vertex
• e = (T(e), h(e), fe). arity |e| = |T(e)|
• a totally-ordered weight set R
• we borrow the ⊕ operator to be the comparison
• weight function fe : R|e| to R
• generalizes the ⊗ operator in semirings
38
vu1
u2
fe
tails
headd(v) ! = fe(d(u1), d(u2))
simple case: fe(a, b) = a ⊗ b ⊗ w(e)
Yi,j e
Zj,kXi,k
Liang Huang (Penn) Dynamic Programming
Hypergraphs and Deduction
39
(Nederhof, 2003)
: b
v
u1 u2
fe
: a
: a × b × Pr(A → B C)
(A, i, j)
(C, k, j)(B, i, k) (B, i, k) (C, k, j)
(A, i, j)A→B C
Liang Huang (Penn) Dynamic Programming
Hypergraphs and Deduction
39
(Nederhof, 2003)
: b
v
u1 u2
fe
: a
: a × b × Pr(A → B C)
(A, i, j)
(C, k, j)(B, i, k) (B, i, k) (C, k, j)
(A, i, j)A→B C
v
u1 u2tails
head
fe
: a
: fe (a,b) v
u1 u2
fe
: a : b
: fe (a,b)
antecedents
consequent
: b
Liang Huang (Penn) Dynamic Programming
Related Formalisms
40
v
u1 u2
e
v
u1 u2
e AND-node
OR-node
OR-nodes
Liang Huang (Penn) Dynamic Programming
Packed Forests• a compact representation of many parses
• by sharing common sub-derivations
• polynomial-space encoding of exponentially large set
41
(Klein and Manning, 2001; Huang and Chiang, 2005)
0 I 1 saw 2 him 3 with 4 a 5 mirror 6
Liang Huang (Penn) Dynamic Programming
Packed Forests• a compact representation of many parses
• by sharing common sub-derivations
• polynomial-space encoding of exponentially large set
41
(Klein and Manning, 2001; Huang and Chiang, 2005)
0 I 1 saw 2 him 3 with 4 a 5 mirror 6
nodes hyperedges
a hypergraph
Liang Huang (Penn) Dynamic Programming
Weight Functions and Semirings
42
v
u1
u2
tails head
uk
fe
...
fe(a1, ..., ak)
Liang Huang (Penn) Dynamic Programming
Weight Functions and Semirings
42
v
u1
u2
tails head
uk
fe
...
fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)special case
Liang Huang (Penn) Dynamic Programming
Weight Functions and Semirings
42
d(u) d(u) ⊗ w(e)w(e)
d(u) fe(d(u))fe
fe(a) = a ⊗ w(e)
v
u1
u2
tails head
uk
fe
...
fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)special case
Liang Huang (Penn) Dynamic Programming
Weight Functions and Semirings
42
d(u) d(u) ⊗ w(e)w(e)
d(u) fe(d(u))fe
fe(a) = a ⊗ w(e)
v
u1
u2
tails head
uk
fe
...
fe(a1, ..., ak)
semiring-composed
= a1 ⊗ ... ⊗ ak ⊗ w(e)special case
Liang Huang (Penn) Dynamic Programming
Weight Functions and Semirings
42
d(u) d(u) ⊗ w(e)w(e)
d(u) fe(d(u))fe
fe(a) = a ⊗ w(e)
v
u1
u2
tails head
uk
fe
...
fe(a1, ..., ak)
semiring-composed
can also extend monotonicity and superiority to general weight functions
= a1 ⊗ ... ⊗ ak ⊗ w(e)special case
Liang Huang (Penn) Dynamic Programming
Generalizing Semiring Properties• monotonicity
• semiring: a ≤ b => a x c ≤ b x c
• for all weight function f, for all a1... ak, for all i, if a’i ≤ ai then f(a1... a’i ... ak) ≤ f(a1... ai ... ak)
• superiority
• semiring: a ≤ a x b, b ≤ a x b
• for all f, for all a1... ak, for all i, ai ≤ f(a1, ..., ak)
• acyclicity
• degenerate a hypergraph back into a graph43
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
44
topological(acyclic)
best-first(superior)
graphs with semirings
(e.g., FSMs)
hypergraphs with weight functions
(e.g., CFGs)
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Viterbi Algorithm for DAGs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each incoming edge (u, v) in E
• use d(u) to update d(v):
• key observation: d(u) is fixed to optimal at this time
• time complexity: O( V + E )45
v
u w(u, v)
d(v) ! = d(u) " w(u, v)
Liang Huang (Penn) Dynamic Programming
Viterbi Algorithm for DAHs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
• use d(ui)’s to update d(v)
• key observation: d(ui)’s are fixed to optimal at this time
• time complexity: O( V + E ) (assuming constant arity)46
vu1
u2
fed(v) ! = fe(d(u1), · · · , d(u|e|))
Liang Huang (Penn) Dynamic Programming
Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)
• typical instance of the generalized Viterbi for DAHs
• many variants of CKY ~ various topological ordering
47
O(n3|P|)
(S, 0, n) (S, 0, n)
Liang Huang (Penn) Dynamic Programming
Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)
• typical instance of the generalized Viterbi for DAHs
• many variants of CKY ~ various topological ordering
47
O(n3|P|)
bottom-up
(S, 0, n) (S, 0, n)
Liang Huang (Penn) Dynamic Programming
Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)
• typical instance of the generalized Viterbi for DAHs
• many variants of CKY ~ various topological ordering
47
O(n3|P|)
bottom-up left-to-right
(S, 0, n) (S, 0, n)
Liang Huang (Penn) Dynamic Programming
Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)
• typical instance of the generalized Viterbi for DAHs
• many variants of CKY ~ various topological ordering
48
O(n3|P|)
bottom-up left-to-right
(S, 0, n) (S, 0, n) (S, 0, n)
Liang Huang (Penn) Dynamic Programming
Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)
• typical instance of the generalized Viterbi for DAHs
• many variants of CKY ~ various topological ordering
48
O(n3|P|)
bottom-up left-to-right right-to-left
(S, 0, n) (S, 0, n) (S, 0, n)
Liang Huang (Penn) Dynamic Programming
Example: Syntax-based MT
49
• synchronous context-free grammars (SCFGs)
• context-free grammar in two dimensions
• generating pairs of strings/trees simultaneously
• co-indexed nonterminal further rewritten as a unit
VP
PP
yu Shalong
VP
juxing le huitan
VP
VP
held a meeting
PP
with Sharon
VP ! PP(1) VP(2), VP(2) PP(1)
VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon
Liang Huang (Penn) Dynamic Programming
Translation as Parsing
50
• translation with SCFGs => monolingual parsing
• parse the source input with the source projection
• build the corresponding target sub-strings in parallel
PP1, 3 VP3, 6
VP1, 6
yu Shalong juxing le huitan
VP ! PP(1) VP(2), VP(2) PP(1)
VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon
Liang Huang (Penn) Dynamic Programming
Translation as Parsing
50
• translation with SCFGs => monolingual parsing
• parse the source input with the source projection
• build the corresponding target sub-strings in parallel
PP1, 3 VP3, 6
VP1, 6
yu Shalong juxing le huitan
VP ! PP(1) VP(2), VP(2) PP(1)
VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon
Liang Huang (Penn) Dynamic Programming
Translation as Parsing
50
• translation with SCFGs => monolingual parsing
• parse the source input with the source projection
• build the corresponding target sub-strings in parallel
PP1, 3 VP3, 6
VP1, 6
yu Shalong juxing le huitan
with Sharon held a talk
held a talk with Sharon
VP ! PP(1) VP(2), VP(2) PP(1)
VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon
Liang Huang (Penn) Dynamic Programming
Translation as Parsing
50
• translation with SCFGs => monolingual parsing
• parse the source input with the source projection
• build the corresponding target sub-strings in parallel
PP1, 3 VP3, 6
VP1, 6
yu Shalong juxing le huitan
with Sharon held a talk
held a talk with Sharon
VP ! PP(1) VP(2), VP(2) PP(1)
VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon
complexity: same as CKY parsing -- O(n3)
Liang Huang (Penn) Dynamic Programming
Adding a Bigram Model
51
PP1, 3 VP3, 6
VP1, 6
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
with ... Sharon
along ... Sharonwith ... Shalong
held ... talkheld ... meeting
hold ... talks
with Sharon
bigram
held ... talk
VP3, 6
with ... Sharon
PP1, 3
bigram
Liang Huang (Penn) Dynamic Programming
Adding a Bigram Model
51
PP1, 3 VP3, 6
VP1, 6
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
with ... Sharon
along ... Sharonwith ... Shalong
held ... talkheld ... meeting
hold ... talks
with Sharon
bigram
held ... talk
VP3, 6
with ... Sharon
PP1, 3
bigram
held ... Sharon
VP1, 6
Liang Huang (Penn) Dynamic Programming
Adding a Bigram Model
51
PP1, 3 VP3, 6
VP1, 6
_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon
_ _●●● ... talks
_ _●●● ... meeting ●●●●● ... Shalong
with ... Sharon
along ... Sharonwith ... Shalong
held ... talkheld ... meeting
hold ... talks
with Sharon
bigram
complexity: O(n3 V4(m-1) )
held ... talk
VP3, 6
with ... Sharon
PP1, 3
bigram
held ... Sharon
VP1, 6
Liang Huang (Penn) Dynamic Programming
Two Dimensional Survey
52
topological(acyclic)
best-first(superior)
graphs with semirings
(e.g., FSMs)
hypergraphs with weight functions
(e.g., CFGs)
Viterbi Dijkstra
Generalized Viterbi Knuth
traversing order
sear
ch s
pace
Liang Huang (Penn) Dynamic Programming
Viterbi Algorithm for DAHs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each incoming hyperedge e = ((u1, .., u|e|), v, fe)
• use d(ui)’s to update d(v)
• key observation: d(ui)’s are fixed to optimal at this time
• time complexity: O( V + E ) (assuming constant arity)53
vu1
u2
fed(v) ! = fe(d(u1), · · · , d(u|e|))
Liang Huang (Penn) Dynamic Programming
Forward Variant for DAHs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
• if d(ui)’s have all been fixed to optimal
• use d(ui)’s to update d(h(e))
• time complexity: O( V + E )54
v = ui
h(e)u1
v
fe
u2 =
h(e)fe
Liang Huang (Penn) Dynamic Programming
Forward Variant for DAHs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
• if d(ui)’s have all been fixed to optimal
• use d(ui)’s to update d(h(e))
• time complexity: O( V + E )54
v = ui
h(e)u1
v
fe
u2 =
h(e)fe
Liang Huang (Penn) Dynamic Programming
Forward Variant for DAHs1. topological sort
2. visit each vertex v in sorted order and do updates
• for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)
• if d(ui)’s have all been fixed to optimal
• use d(ui)’s to update d(h(e))
• time complexity: O( V + E )54
v = ui
h(e)u1
v
fe
u2 =
Q: how to avoid repeated checking?maintain a counter r[e] for each e: how many tails yet to be fixed?fire this hyperedge only if r[e]=0
h(e)fe
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
55
S V - S
s ...
d(u) ! = d(v) " w(v, u)
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
v
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
55
S V - S
vs ...
d(u) ! = d(v) " w(v, u)
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
Liang Huang (Penn) Dynamic Programming
Dijkstra Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
55
uw(v, u)
S V - S
vs ...
d(u) ! = d(v) " w(v, u)
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
Liang Huang (Penn) Dynamic Programming
Knuth (1977) Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
56
S V - S
vs ...
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
u1
v
Liang Huang (Penn) Dynamic Programming
Knuth (1977) Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
56
S V - S
vs ...
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
u1
v
Liang Huang (Penn) Dynamic Programming
Knuth (1977) Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
56
S V - S
vs ...
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
u1
vh(e)
fe
Liang Huang (Penn) Dynamic Programming
Knuth (1977) Algorithm
• keep a cut (S : V - S) where S vertices are fixed
• maintain a priority queue Q of V - S vertices
• each iteration choose the best vertex v from Q
• move v to S, and use d(v) to forward-update others
56
S V - S
vs ...
time complexity:O((V+E) lgV) (binary heap)
O(V lgV + E) (fib. heap)
u1
vh(e)
fe
Liang Huang (Penn) Dynamic Programming
Example: Best-First/A* Parsing
• Knuth for parsing: best-first (Caraballo & Charniak, 1998)
• further speed-up: use A* heuristics
• showed significant speed up with carefully designed heuristic functions (Klein and Manning, 2003)
• heuristic function: an estimate of outside cost
57
[open problem] can you still define heuristic function if weight functions are not semiring-composed?
(S, 0, n)
Liang Huang (Penn) Dynamic Programming
Example: Best-First/A* Parsing
• Knuth for parsing: best-first (Caraballo & Charniak, 1998)
• further speed-up: use A* heuristics
• showed significant speed up with carefully designed heuristic functions (Klein and Manning, 2003)
• heuristic function: an estimate of outside cost
57
[open problem] can you still define heuristic function if weight functions are not semiring-composed?
(S, 0, n)
Liang Huang (Penn) Dynamic Programming
Outside Cost in Hypergraph• outside cost: yet to pay to reach goal
• let’s only consider semiring-composed case
• and only acyclic hypergraphs
• after computing d(v) for all v from bottom-up
• backwards Viterbi from top-down (outside-in)
58
e
...
...h(S0,n) = īh(v) ⊕= h(u)⊗w(e)⊗d(v’)
v v’
u
d(v)
s
v
t
d(v)
h(v)
S0,n
h(v)
d(v)
Liang Huang (Penn) Dynamic Programming
Outside Cost in Hypergraph• outside cost: yet to pay to reach goal
• let’s only consider semiring-composed case
• and only acyclic hypergraphs
• after computing d(v) for all v from bottom-up
• backwards Viterbi from top-down (outside-in)
58
e
...
...h(S0,n) = īh(v) ⊕= h(u)⊗w(e)⊗d(v’)
v v’
u
d(v)
Q: d(v)⊗h(v) = ?
s
v
t
d(v)
h(v)
S0,n
h(v)
d(v)
Liang Huang (Penn) Dynamic Programming
Projection-based Heuristics
• how to guess? project onto a coarser-grained space
• and parse with the coarser grammar
• outside cost of of the coarser item as heuristics
59
(Klein and Manning, 2003)
Liang Huang (Penn) Dynamic Programming
Projection-based Heuristics
• how to guess? project onto a coarser-grained space
• and parse with the coarser grammar
• outside cost of of the coarser item as heuristics
59
(Klein and Manning, 2003)
Liang Huang (Penn) Dynamic Programming
Projection-based Heuristics
• how to guess? project onto a coarser-grained space
• and parse with the coarser grammar
• outside cost of of the coarser item as heuristics
60
(Klein and Manning, 2003)
Liang Huang (Penn) Dynamic Programming
Projection-based Heuristics
• how to guess? project onto a coarser-grained space
• and parse with the coarser grammar
• outside cost of of the coarser item as heuristics
61
(Klein and Manning, 2003)
Liang Huang (Penn) Dynamic Programming
Projection-based Heuristics
• how to guess? project onto a coarser-grained space
• and parse with the coarser grammar
• outside cost of of the coarser item as heuristics
61
(Klein and Manning, 2003)
Liang Huang (Penn) Dynamic Programming
Projection-based Heuristics
• how to guess? project onto a coarser-grained space
• and parse with the coarser grammar
• outside cost of of the coarser item as heuristics
61
(Klein and Manning, 2003)ĥ (VBD2,3) = h’ (V2,3)
Liang Huang (Penn) Dynamic Programming
Analogy with Graphs
62
Liang Huang (Penn) Dynamic Programming
Analogy with Graphs
62
Liang Huang (Penn) Dynamic Programming
More on Coarse-to-Fine• multilevel coarse-to-fine A*
• heuristic = exact outside cost in previous stage
• ĥi (v) = hi-1 (proj i-1(v))
• VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
• multilevel coarse-to-fine Viterbi w/ beam-search
• Viterbi + beam pruning in each stage
• prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
• hard to derive a provably correct threshold
• in practice: use a preset threshold (but works well!)63
Liang Huang (Penn) Dynamic Programming
More on Coarse-to-Fine• multilevel coarse-to-fine A*
• heuristic = exact outside cost in previous stage
• ĥi (v) = hi-1 (proj i-1(v))
• VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
• multilevel coarse-to-fine Viterbi w/ beam-search
• Viterbi + beam pruning in each stage
• prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
• hard to derive a provably correct threshold
• in practice: use a preset threshold (but works well!)63
Liang Huang (Penn) Dynamic Programming
More on Coarse-to-Fine• multilevel coarse-to-fine A*
• heuristic = exact outside cost in previous stage
• ĥi (v) = hi-1 (proj i-1(v))
• VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)
• multilevel coarse-to-fine Viterbi w/ beam-search
• Viterbi + beam pruning in each stage
• prune according to merit: d(v)⊗h(v) ⊘ d(TOP)
• hard to derive a provably correct threshold
• in practice: use a preset threshold (but works well!)63
Liang Huang (Penn) Dynamic Programming
Same Picture Again
64
monotonic optimization problems
acyclic: Viterbi
superior: Knuth
many NLP
problems
Liang Huang (Penn) Dynamic Programming
Same Picture Again
64
monotonic optimization problems
acyclic: Viterbi
superior: Knuth
many NLP
problems
PCFG parsing with CNF
Liang Huang (Penn) Dynamic Programming
Same Picture Again
64
monotonic optimization problems
acyclic: Viterbi
superior: Knuth
many NLP
problems
Inside-Outside Alg.(Inside semiring)
PCFG parsing with CNF
Liang Huang (Penn) Dynamic Programming
Same Picture Again
64
monotonic optimization problems
acyclic: Viterbi
superior: Knuth
many NLP
problems
Inside-Outside Alg.(Inside semiring) non-prob.
(discriminative) parsing
PCFG parsing with CNF
Liang Huang (Penn) Dynamic Programming
Same Picture Again
64
monotonic optimization problems
acyclic: Viterbi
superior: Knuth
many NLP
problems
Inside-Outside Alg.(Inside semiring) non-prob.
(discriminative) parsing
cyclic grammars
PCFG parsing with CNF
Liang Huang (Penn) Dynamic Programming
Same Picture Again
64
monotonic optimization problems
acyclic: Viterbi
superior: Knuth
many NLP
problems
Inside-Outside Alg.(Inside semiring) non-prob.
(discriminative) parsing
cyclic grammars
PCFG parsing with CNF
generalized generalized
Bellman-Ford (open)
Liang Huang (Penn) Dynamic Programming
Take Home Message
• Dynamic Programming is cool, easy, and universal!
• two frameworks and two types of algorithms
• monotonicity; acyclicity and/or superiority
• topological (Viterbi) vs. best-first style (Dijkstra/Knuth/A*)
• when to choose which: A* can finish early if lucky
• graph (lattice) vs. hypergraph (forest)
• incremental, finite-state vs. branching, context-free
• covered many typical NLP applications
• a better understanding of theory helps in practice65
THE END - Thanks!Thanks!
66final slides will be available on my website.
(S, 0, n)
w0 w1 ... wn-1
Questions?Comments?
top related