two algorithms for finding k shortest paths of a weighted ... · weighted pushdown automata (wpdas)...

Two Algorithms for Finding k Shortest Paths of a WeightedPushdown Automaton

WU KeDepartment of Computer Scienceand CLIP Lab at the Institute for

Advanced Computer StudiesUniversity of Maryland

College Park, MD 20742, USA

Philip RESNIKDepartment of Linguistics

and CLIP Lab at the Institute forAdvanced Computer Studies

University of MarylandCollege Park, MD 20742, USA

February 5, 2013

1 Introduction

Weighted pushdown automata (WPDAs) have recently been adopted in some applications such as machinetranslation [Iglesias et al., 2011] as a more compact alternative to weighted finite-state automata (WFSAs)for representing a weighted set of strings. Allauzen and Riley [2012] introduce a set of basic algorithms forconstruction and inference of WPDAs, and the corresponding implementation as an extension of the opensource finite-state transducer toolkit OpenFst.1

Although a shortest-path algorithm for WPDAs with bounded stack is described in Allauzen and Riley[2012], it does not give a k-shortest-path algorithm, which finds the k shortest accepting paths of the givenautomaton. Other than just the single shortest path, k shortest paths are useful for many purposes suchas reranking the output in parsing [Collins and Koo, 2005] or tuning feature weights in machine translation[Chiang et al., 2009]. One existing work-around is to first expand the WPDA into an equivalent WFSA and thenfind the k shortest paths of the WFSA using the k-shortest-path algorithm for WFSAs (the expansion approach).Since the WPDA expansion has an exponential time and space complexity with respect to the size of theautomaton, one usually has to prune the WPDA before expansion (the pruned expansion approach), i.e. removethose transitions and states that are not on any accepting path with a weight at most a given threshold greaterthan the shortest distance. However, setting an adequate threshold that neither prunes nor keeps too manystates or transitions a priori is almost impossible in practice.

In this paper, we introduce two efficient algorithms for finding the k shortest paths of a WPDA, bothderived from the same weighted deductive logic description of the execution of a WPDA using different searchstrategies.

2 Weighted pushdown automata

2.1 Formal definitions

Following Allauzen and Riley [2012], we represent a WPDA as directed graph with labeled and weighted arcs(transitions).

Definition 1. A WPDA M over a semiring ⟨K,⊕,⊗, 0, 1⟩ is a tuple ⟨Σ, Π, Π, Q, E, s, f ⟩, where

1http://www.openfst.org/twiki/bin/view/FST/FstExtensions

1

..q1..

q2

. q3.

q4

.

(:1

.b:1

.

a:1

.

):1

.

b:1

Figure 1: A WPDA of {anbn|n > 0}

• Σ, Π and Π are disjoint finite sets of symbols;• Σ is the alphabet of input symbols;• Π and Π are the alphabets of respectively opening and closing parentheses; there exists a bijection between them

that pairs the parentheses; for any a ∈ Π ∪ Π, we represent its counterpart in the other alphabet as a;• Q is a finite set of states; s ∈ Q is the start state and f ∈ Q is the final state;• E ⊆ Q × (Σ ∪ Π ∪ Π ∪ {ϵ}) ×K× Q is a finite set of transitions; e = ⟨p[e], i[e], w[e], n[e]⟩ ∈ E denotes a

transition from state p[e] to state n[e] with label i[e] and weight w[e], where w[e] = 0.

A path π is a sequence of transitions π = e1e2 . . . em, such that n[ei] = p[ei+1] for all 1 ≤ i < m. p[·], i[·],w[·] and n[·] can all be generalized to paths. For a given path π = e1e2 . . . em, define p[π] = p[e1], n[π] = n[em],i[π] = i[e1]i[e2] . . . i[em], and w[π] = w[e1]⊗ w[e2]⊗ . . .⊗ w[em]. Unlike a WFSA, not all paths from s to f in aWPDA are accepting paths. For a set of symbols S, let cS[π] be the substring of i[π] consisting of all and onlythe symbols from set S. For example, cΠ∪Π[π] is the substring of i[π] consisting of all and only the openingand closing parentheses. Then,

Definition 2. The Dyck language on finite parenthesis alphabets Π and Π consists of strings of balanced parentheses. Apath π is balanced if cΠ∪Π[π] belongs to the Dyck language on Π and Π.

For example, when Π = {‘(’, ‘[’} and Π = {‘)’, ‘]’} with normal pairing by appearance, strings such as (),([()])[] are members of the Dyck language while ( or (][) are not.

Finally,

Definition 3. A path π is an accepting path if and only if p[π] = s, n[π] = f and π is balanced.

This representation of WPDAs is slightly different from the classical representation of PDAs, where astack alphabet is defined with optional push or pop operations at each transition. Here the stack alphabetis essentially Π and Π, paired by the bijection between them. Whenever a symbol from Π is consumed, itis equivalent to pushing the particular symbol onto the stack in the classical representation; and whenever asymbol from Π is consumed, it is equivalent to popping a symbol off the stack and checking if the symbol is itscounterpart from Π. As discussed in Allauzen and Riley [2012], such representation leads to easy adaptationof some WFSA algorithms for similar purposes on a WPDA.

Following Allauzen and Riley [2012], we limit our effort in finding k shortest paths to WPDAs with abounded stack in both pushing and popping.2

Definition 4. A WPDA has a bounded stack if there exists an integer K such that for any path π, the number ofunmatched parenthesis in cΠ[π] is no greater than K.

Although this rules out all WPDAs with recursion, the ones found in applications that need to find thek shortest paths usually do not have recursion [Iglesias et al., 2011]. Thus an algorithm that only works onWPDAs with a bounded stack is already very useful.

Allauzen and Riley [2012] give a general algorithm for converting a context free grammar into an equivalentWPDA. Figure 1 is an example WPDA representing the classical context free language {anbn|n > 0} constructed

2This definition is slightly different from Allauzen and Riley [2012], which only bounds pushing.

2

..s1.. s2. s3. s4. s5. a:1. a:1. b:1. b:1

Figure 2: Input string “aabb” encoded as a WFSA

..⟨q1, s1⟩..

⟨q2, s1⟩

.

⟨q1, s2⟩

.

⟨q2, s2⟩

.

⟨q1, s3⟩

.

⟨q3, s4⟩

.

⟨q4, s4⟩

.

⟨q3, s5⟩

. ⟨q4, s5⟩.

(:1

.

a:1

.

(:1

.

a:1

.

b:1

.

):1

.

b:1

.

):1

Figure 3: The result of intersection

following their algorithm. It is easy to see this WPDA does not have a bounded stack. However, consideringthis as a “grammar”, one can then “parse” strings with the grammar by encoding the input as a WFSA andintersecting the WPDA with it. For example, Figure 3 is the result of intersecting Figure 2 with Figure 1, whichnow has a bounded stack.

2.2 Automata execution as weighted deduction

A deductive logic defines a space of weighted items, some of which are axioms or goals (items to prove), anda set of inference rules of the form,

A1 : w1 A2 : w2 . . . Am : wm ϕB : g(w1, w2, . . . , wm)

which means if items A1, A2, . . . , Am are provable respectively with weights w1, w2, . . . , wm, then item B is alsoprovable with weight g(w1, w2, . . . , wm) given the side condition ϕ is satisfied. We also call B proved this wayan instantiation of B with weight g(w1, w2, . . . , wm). This style of system has been commonly used to expressparsing strategies since Shieber et al. [1995].

The execution of a WPDA M can be described using the following weighted deductive logic LM.• The items are of the form q1 ; q2, where q1, q2 ∈ Q. An instantiation q1 ; q2 : u for some u ∈ K

intuitively means there is a balanced path from q1 to q2 with weight u.• Axioms are

q = s or there exists e ∈ E such that n[e] = q and i[e] ∈ Πq ; q : 1

Furthermore, we call any state q an entering state if q ; q : 1 is an axiom .• There are two inference rules,

1. Scanq ; p[e] : u

e ∈ E such that i[e] ∈ Σ ∪ {ϵ}q ; n[e] : u⊗ w[e]

2. Complete

q ; p[e1] : u1 n[e1] ; p[e2] : u2e1, e2 ∈ E such that i[e1] ∈ Π, i[e2] ∈ Π, i[e1] = i[e2]q ; n[e2] : u1 ⊗ w[e1]⊗ u2 ⊗ w[e2]

3

q1 ; q1 : 1

q2 ; q2 : 1q2

a−→ q1 : 1 (1)q2 ; q1 : 1

q2 ; q2 : 1q2

a−→ q1 : 1 (2)q2 ; q1 : 1

q1b−→ q3 : 1 (3)

q2 ; q3 : 1q1

(−→ q2 : 1 (4), q3)−→ q4 : 1 (5)q2 ; q4 : 1

q4b−→ q3 : 1 (6)

q2 ; q3 : 1q1

(−→ q2 : 1 (7), q3)−→ q4 : 1 (8)q1 ; q4 : 1

Figure 4: A proof of the accepting path of “aabb”

• The only goal item iss ; f

Any valid proof forms a tree that induces a path. The induced path can be obtained by reading off thetransitions in side conditions through a left-to-right post-order traversal of the proof tree. Take the WPDA inFigure 1 for example; Figure 4 is a proof of the accepting path of the string “aabb”. The accepting path is thus

(7)(1)(4)(2)(3)(5)(6)(8), i.e. q1(−→ q2

a−→ q1(−→ q2

a−→ q1b−→ q3

)−→ q4b−→ q3

)−→ q4.One can easily prove the following by induction for any WPDA M (see the appendix),3

Theorem 1 (Soundness). Any valid proof of an instantiation q1 ; q2 : u in LM induces a balanced path from q1 to q2with weight u in M.

Theorem 2 (Completeness). Any balanced path from an entering state q1 to some state q2 with weight u in M has avalid proof of an instantiation q1 ; q2 : u in LM whose induced path is that path.

Theorem 3 (In-ambiguity). Any balanced path from an entering state in M has a unique proof in LM.4

The three properties together essentially state that there is a one-to-one correspondence between proofs ofgoal items in LM and accepting paths in M.

2.3 The k-shortest-path problem

The k-shortest-path problem on a WPDA M with a bounded stack is to find k accepting paths from M with thesmallest weights with respect to the natural ordering of M’s weight semiring K.5

The natural ordering ≤⊆ K×K is defined as

Definition 5. For any a, b ∈ K, a ≤ b if and only if a⊕ b = a.

For the problem to be well-defined, the natural ordering also has to be total, which is equivalent to requiringthe ⊕ operator to have the following path property: for any a, b ∈ K, a⊕ b = a or a⊕ b = b. An example meetingthese conditions is the tropical semiring ⟨R∪ {∞}, min,+, ∞, 0⟩, one of the most commonly used as weights inparsing and machine translation. Its natural ordering is simply the ordering of real numbers and infinity.

3 Computing the Shortest Distance

One of the benefits of the above weighted deduction representation is that many properties can be computedby carrying out the deductions in a uniform style. As a starting point, we are interested in finding the smallest-weight instantiation of some item q1 ; q2. For reasons which will become clear later, we call the weight of that

3Note especially that a bounded stack is not required.4Up to the tree structure with side conditions.5In the rest of this paper, we always assume the WPDA M has a bounded stack.

4

instantiation the inside weight of q1 ; q2. Let R be the set of all instantiations of provable items. Because of thepath property, computing the inside weight of q1 ; q2 is equivalent to computing

α(q1 ; q2) =⊕

{u|q1;q2 :u∈R}u

The sum can be further grouped by the last step taken in a proof of q1 ; q2 : u. Define A(q1 ; q2) to bethe following,

A(q1 ; q2) =

{1 q1 ; q2 is an axiom0 otherwise

Define Sq1;q2 ⊆ E be the set of “last steps taken” to prove q1 ; q2 with a Scan, i.e. e is in Sq1;q2 if and onlyif some instantiation q1 ; p[e] : u with e as the side condition can prove q1 ; q2 with the Scan rule. Similarly,define Cq1;q2 ⊆ E× E be the set of “last steps taken” to prove q1 ; q2 with a Complete, i.e. ⟨e1, e2⟩ is in Cq1;q2

if and only if some instantiations q1 ; p[e1] : u1 and n[e1] ; p[e2] : u2 can prove q1 ; q2 with the Completerule. Then, α(q1 ; q2) can be rewritten as

α(q1 ; q2) =A(q1 ; q2)⊕

⊕e∈Sq1;q2

⊕{u|q1;p[e]:u∈R}

u⊗ w[e]

⊕ ⊕⟨e1,e2⟩∈Cq1;q2

⊕{u1|q1;p[e1]:u1∈R}

⊕{u2|n[e1];p[e2]:u2∈R}

u1 ⊗ w[e1]⊗ u2 ⊗ w[e2]

=A(q1 ; q2)⊕

⊕e∈Sq1;q2

α(q1 ; p[e])⊗ w[e]

⊕ ⊕⟨e1,e2⟩∈Cq1;q2

α(q1 ; p[e1])⊗ w[e1]⊗ α(n[e1] ; p[e2])⊗ w[e2]

This recursive formulation allows us to compute the shortest distance of an item using the shortest distance

of its component sub-items. When the WPDA M has a bounded stack, one can easily derive an algorithm thatcomputes the shortest distance using LM. Figure 5 is a simple example of such an algorithm. This algorithmcarries out a standard agenda-based reasoning with the relaxation technique [Cormen et al., 2009], where Qis the agenda. The map α maintains the current estimate of each proven item’s inside weight. Lines 4-7 seedthe axioms as the starting point of reasoning. Then lines 8-26 try to prove new items by applying the Scanrule (lines 12-13) and the Complete rule (lines 14-24). Any item that is newly proven or proven with a smallerweight is added back to the agenda in the Relax function.

The above algorithm is conveniently derived from the weighted deduction system using standard tech-niques. Nevertheless, there are other strategies that can also be used; for example, the shortest path algorithmin Allauzen and Riley [2012] is essentially computing the inside weights with a multi-agenda strategy.

4 Algorithm 1

Having discussed the shortest distance problem in a WPDA, we now move on to the k-shortest-path problem.The key idea of our first algorithm is similar to the A* k-best parsing algorithm in Pauls and Klein [2009]. Aswe have shown in Section 2.2, similar to parsing, the execution of a WPDA can be described as a weighteddeductive logic. The generalized A* search algorithm from Felzenszwalb and McAllester [2007] can then beapplied with a monotonic and admissible heuristic function to find the k instantiations of the goal item withsmallest weights, from which we get the k shortest paths. The outside weight of items can be defined withsimilar meanings to parsing and used as an exact heuristic. Another, inexact heuristic will also be discussed,which will eventually lead to our second algorithm.

5

function Insideα← empty mapQ← empty queuefor all entering state q do

5: Push(q ; q, Q)α[q ; q]← 1

end forwhile Q is not empty do

q1 ; q2 ← Pop(Q)10: u← α[q1 ; q2]

for all transition e such that p[e] = q2 doif i[e] ∈ Σ ∪ {ϵ} then ▷ Scan

Relax(q1 ; n[e], u⊗ w[e])else if i[e] ∈ Π then ▷ Complete; as the left antecedent

15: for all e′ such that i[e′] = i[e] and n[e] ; p[e′] is in α doRelax(q1 ; n[e′], u⊗ w[e]⊗ α[n[e] ; p[e′]]⊗ w[e′])

end forelse if i[e] ∈ Π then ▷ Complete; as the right antecedent

for all e′ such that i[e′] = i[e] and n[e′] = q1 do20: for all q3 such that q3 ; p[e′] is in α do

Relax(q3 ; n[e], α[q3 ; p[e′]]⊗ w[e′]⊗ u⊗ w[e])end for

end forend if

25: end forend while

end function

function Relax(q1 ; q2, w)30: if q1 ; q2 is in α then

u← α[q1 ; q2]⊕ wif u = α[q1 ; q2] then

α[q1 ; q2]← uPush(q1 ; q2, Q) if q1 ; q2 not already in Q

35: end ifelse

α[q1 ; q2]← uPush(q1 ; q2, Q) if q1 ; q2 not already in Q

end if40: end function

Figure 5: A simple Inside algorithm

6

S← empty set of proven instantiationsQ← empty min-priority queuefor all axiom A : u do

Push(A : u, Q) with priority H(A : u)5: end for

while Q is not empty doA : u← Pop(Q)if A : u is a goal item then

Output A : u10: end if

Add A : u to Sfor all new instantiation B : v proveable using A : u and any member of S do

Push(B : v, Q) with priority H(B : v)end for

15: end while

Figure 6: The generalized A* algorithm

4.1 A* search on a deductive logic

Felzenszwalb and McAllester [2007] introduce the generalized A* search algorithm on a deductive logic. Al-though the original algorithm assumes the weights are from a positive tropical semiring, this is not a necessaryrequirement in our problem, as we show next.

Similar to the original A* algorithm on graphs [Hart et al., 1968], we need a heuristic function H to estimatethe final weight continuing from the current search state (an instantiation in this case) to the closest goal item.More formally, for a weighted logic L with (unweighted) item space I on semiring ⟨K,⊕,⊗, 0, 1⟩, a heuristicfunction H : ⟨I, K⟩ → K is any function satisfying the following,

Admissibility For any provable instantiation of the goal item G : w,

H(G : w) = w

Monotonicity For any provable instantiations A1 : w1, A2 : w2, . . . , Am : wm and an inference rule

A1 : w1 A2 : w2 . . . Am : wm

B : g(w1, w2, . . . , wm)

and 1 ≤ i ≤ n,H(Ai : wi) ≤ H(B : g(w1, w2, . . . , wm))

where ≤ is the natural ordering of the semiring.

With such an H, the A* algorithm on a deductive logic can then be described as in Figure 6.Similar to the original A* algorithm, the following property holds for the generalized A* algorithm as well:

Theorem 4. If a monotonic H is used, the generalized A* algorithm pops instantiations in increasing order of their Hvalue.

The proof of the tropical semiring case can be found in Felzenszwalb and McAllester [2007]. We include theproof simply to show this is the case with any monotonic heuristic function and any semiring with the pathproperty; not just the tropical semiring.

7

..s... q1. q2... f. (. )...

Inside

.......

Outside

Figure 7: Inside and outside weights on a shortest path

Proof. Suppose some instantiation is not popped in order of the H value. Let the instantiations popped in orderbe A1 : w1, A2 : w2, . . . and let i be the smallest index such that H(Ai−1 : wi−1) > H(Ai : wi). Right beforeAi−1 : wi−1 is popped, Ai : wi cannot be inside Q, otherwise it will be popped instead. This means Ai : wi isadded into Q after popping Ai−1 : wi−1 by applying some inference rule with Ai−1 : wi−1. The application isof the form

. . . Ai−1 : wi−1 . . .Ai : g(. . . , wi−1, . . .)

Because H is monotonic,

H(Ai−1 : wi−1) ≤ H(Ai : g(. . . , wi−1, . . .)) = H(Ai : wi)

This contradicts the assumption H(Ai−1 : wi−1) > H(Ai : wi).

If H is also admissible, then for any instantiation of a goal item G : w, H(G : w) is just w. Thus suchinstantiations are popped in increasing order of their weights and the first k such instantiations popped are theones with the smallest weight.

4.2 Outside weight as an exact heuristic

For a given instantiation q1 ; q2 : u, we want the heuristic to tell us the weight of the shortest accepting pathcontinuing from this instantiation. Such a heuristic is trivially monotonic and admissible. Let π be the pathinduced by q1 ; q2 : u, and define

H1(q1 ; q2 : u) =⊕µ,ν

w[µ]⊗ u⊗ w[ν]

where the sum is over all pairs of prefixes and suffixes of transitions such that µπν forms an accepting path.When the semiring is commutative, the heuristic has a simple form. Define β(q1 ; q2) =

⊕µ,ν w[µ]⊗ w[ν];

thenH1(q1 ; q2 : u) = β(q1 ; q2)⊗ u

We call β(q1 ; q2) the outside weight of q1 ; q2, because on the shortest accepting path going throughq1 ; q2, β(q1 ; q2) is the weight of the partial path “outside” of q1 ; q2, as illustrated in Figure 7. Thiscan be easily computed by applying the Scan and Complete rules in reverse, starting from the goal after theinside weights have been computed. See Figure 8 for a simple algorithm. Very similar to the inside algorithmin Figure 5, we use agenda-based reasoning, but with the goal item as the starting point (lines 5-6). Then lines7-20 try to propagate the estimates to inner items by applying inference rules in reverse.

8

function Outsideα← the inside weights from Insideβ← empty mapQ← empty queue

5: β[s ; f ]← 1Push(s ; f , Q)while Q is not empty do

q1 ; q2 ← Pop(Q)u← β[q1 ; q2]

10: for all incoming e of q2 doif i[e] ∈ Σ ∪ {ϵ} then ▷ Scan in reverse

Relax(q1 ; p[e], u⊗ w[e])else if i[e] ∈ Π then ▷ Complete in reverse

for all e′ such that i[e′] = i[e] and q1 ; p[e′] and n[e′] ; p[e] both in α do15: Relax(q1 ; p[e′], u⊗ w[e′]⊗ α[n[e′] ; p[e]]⊗ w[e])

Relax(n[e′] ; p[e], u⊗ w[e′]⊗ α[q1 ; p[e′]]⊗ w[e])end for

end ifend for

20: end whileend function

function Relax(q1 ; q2, w)if q1 ; q2 is in β then

25: u← β[q1 ; q2]⊕ wif u = β[q1 ; q2] then

β[q1 ; q2]← uPush(q1 ; q2, Q) if q1 ; q2 not already in Q

end if30: else

β[q1 ; q2]← uPush(q1 ; q2, Q) if q1 ; q2 not already in Q

end ifend function

Figure 8: A simple Outside algorithm

9

.. · · ·. q1. q2.

q3

.

q4

.

· · ·

.

· · ·

. (.

)

.

)

...

D(q 2,

q 3)

...

D(q2 , q

4 )

Figure 9: γ(q1 ; q2) = D(q2, q3)⊕ D(q2, q4) is the shortest distance from q2 to an “exit”

.. · · ·. q1. q2.

q3

.

q4

.

· · ·

.

· · ·

. (.

)

.

)

...

α(q 3;

q 2)...

α(q4 ;

q2 )

Figure 10: The reversed WPDA of Figure 9

4.3 An inexact heuristic and its problems

The above heuristic is very effective in the search because the outside weight gives an exact estimate. However,pre-computation of the outside weight requires two passes traversing the automaton. A natural question iswhether there is an inexact heuristic, yet still monotonic and admissible, which takes less time to compute.

When the multiplication does not decrease the weight,6 one may use the weight of only part of the finalshortest accepting path as an estimate. This can produce a heuristic that is less expensive to compute, possiblyat the cost of increasing the search time. In particular, define D(q1, q2) to be the shortest distance between anypair of states q1 and q2, and γ(q1 ; q2) =

⊕q3

D(q2, q3), where the summation is over all states reachable fromq2 that have a closing parenthesis or simply f when q1 is s (call such a state an exiting state, in contrast with anentering state). γ(q1 ; q2) is roughly how far away q1 ; q2 is to a pair of immediate enclosing parenthesis.7

For example, in Figure 9, γ(q1 ; q2) = D(q2, q3)⊕ D(q2, q4) is the shortest distance from q2 to exitting statesq3 and q4. All the relevant values of D are in fact the inside weights of the reversed WPDA of M (for example,see Figure 10),8 therefore we call it the reverse inside weight.

Then, we can define the following heuristic,

H2(q1 ; q2 : u) = u⊗ γ(q1 ; q2)

6For example, the tropical semiring with only non-negative weights in the setting of the classical shortest path problem on a graph,where real valued weights are summed within the path and the minimum is taken (i.e. use + as ⊗ and min as ⊕).

7This is only a rough estimate since there may not be a opening parenthesis going to q1 that matches the closing parenthesis of theselected exitting state. However, the actual shortest distance is never smaller than this, which means the estimate is still admissible.

8That is, reverse the direction of transitions; swap s and f ; and swap Π and Π.

10

..s.. q1.

q2

.

q3

.

q4

.

q5

.

q6

.

q7

.

q8

.

f

. (:0.

a:1

.

b:0

.

a:1

.

):0

.

a:1

.

b:0

.

b:0

.

b:0

. ):4

Figure 11: A problematic WPDA for H2, weights are in the tropical semiring

Item β γ

s ; s 3 3s ; q6 1 1s ; f 0 0

q1 ; q1 3 0q1 ; q2 2 1q1 ; q3 4 0q1 ; q4 1 0q1 ; q5 4 0q1 ; q7 4 0q1 ; q8 4 0

Table 1

This gives us the weight of the shortest path starting at the induced path of q1 ; q2 : u to any exiting state,which may be a part of an accepting path. It is trivially admissible because γ(s ; f ) = D( f , f ) = 1. Whenmultiplication does not decrease the weight, the weight of part of a path is always smaller than or equal to theweight of the whole path. Therefore, the heuristic is monotonic. Unlike the outside weight, the semiring doesnot need to be commutative for this heuristic to be well-defined.

Unfortunately, though we now spend less time in pre-computing the heuristic, the actual A* search usuallyends up taking much longer because of the inexactness. To see why, notice that any instantiation of an itemwith a weight smaller than the shortest accepting path has to be visited, even if it will only be used in anaccepting path far longer than the k shortest ones. For example, consider the WPDA in Figure 11, with therelevant values of β and γ listed in Table 1. When using H1, the following are instantiated before reaching the1-shortest path:

s ; s : 0 (H1 = 0 + 3 = 3, Q = {q1 ; q1 : 0})q1 ; q1 : 0 (H1 = 0 + 3 = 3, Q = {q1 ; q2 : 1, q1 ; q3 : 0})q1 ; q2 : 1 (H1 = 1 + 2 = 3, Q = {q1 ; q4 : 2, q1 ; q3 : 0})q1 ; q4 : 2 (H1 = 2 + 1 = 3, Q = {s ; q6 : 2, q1 ; q3 : 0})s ; q6 : 2 (H1 = 2 + 1 = 3, Q = {s ; f : 3, q1 ; q3 : 0})s ; f : 3 (H1 = 3 + 0 = 3, Q = {q1 ; q3 : 0})

11

Q← empty min-priority queuePush(⟨1, 1⟩, Q) with priority A1 + B1while Q is not empty do⟨i, j⟩ ← Pop(Q)

5: Output ⟨Ai, Bj⟩if ⟨i + 1, j⟩ not already in Q then

Push(⟨i + 1, j⟩, Q) with priority Ai+1 + Bjend ifif ⟨i, j + 1⟩ not already in Q then

10: Push(⟨i, j + 1⟩, Q) with priority Ai + Bj+1end if

end while

Figure 12

However, when H2 is used, the following are instantiated before the 1-shortest,

q1 ; q1 : 0 (H2 = 0 + 0 = 0, Q = {q1 ; q3 : 0, q1 ; q2 : 1, s ; s : 0})q1 ; q3 : 0 (H2 = 0 + 0 = 0, Q = {q1 ; q5 : 0, q1 ; q2 : 1, s ; s : 0})q1 ; q5 : 0 (H2 = 0 + 0 = 0, Q = {q1 ; q7 : 0, q1 ; q2 : 1, s ; s : 0})q1 ; q7 : 0 (H2 = 0 + 0 = 0, Q = {q1 ; q8 : 0, q1 ; q2 : 1, s ; s : 0})q1 ; q8 : 0 (H2 = 0 + 0 = 0, Q = {q1 ; q2 : 1, s ; s : 0, s ; f : 4})q1 ; q2 : 1 (H2 = 1 + 1 = 2, Q = {q1 ; q4 : 2, s ; s : 0, s ; f : 4})q1 ; q4 : 2 (H2 = 2 + 0 = 2, Q = {s ; s : 0, s ; f : 4})s ; s : 0 (H2 = 0 + 3 = 3, Q = {s ; q6 : 0, s ; f : 4})s ; q6 : 2 (H2 = 2 + 1 = 3, Q = {s ; f : 3, s ; f : 4})s ; f : 3 (H2 = 3 + 0 = 3, Q = {s ; f : 4})

H2 ends up visiting more instantiations along the path from q1 to q8 because that path is shorter in the scopeof the enclosing parentheses. The following closing parenthesis completely flips the position, but this is someinformation H1 “knows” while H2 does not. In practice, we find this happens so frequently that H2 fails tooutput the shortest path within a reasonable amount of time.

Another problem with H2 is that when multiplication may increase the weight (for example, in the trop-ical semiring with negative weights, which is commonly used in applications such machine translation), theheuristic is no longer monotonic.

5 Algorithm 2

We can adopt a new search strategy to address the problems with H2. Before describing the algorithm, we takea brief excursion to introduce a technique from Huang and Chiang [2005]. Consider the following problem:

Let A and B be two (possibly infinite) ordered sequence of real numbers (i.e. for any i, Ai ≤ Ai+1and Bi ≤ Bi+1). Find the k smallest elements in A× B, ordered by the sum of the pair.

For example, when A is {0, 2, 2} and B is {1, 2, 4}, the 3 smallest elements are {⟨0, 1⟩, ⟨0, 2⟩, ⟨2, 1⟩}. A naivesolution is to compute the first k elements in both A and B then sort all the k2 combinations. The techniquefrom Huang and Chiang [2005], described in Figure 12, visits at most 2k combinations and usually a lot fewerin practice. The key insight is that there is no need to explore ⟨Ai+1, Bj⟩ or ⟨Ai, Bj+1⟩ before ⟨Ai, Bj⟩ is poppedbecause both of them are guaranteed to be sub-optimal compared with ⟨Ai, Bj⟩. When computing elements inA and B is expensive, this technique is substantially faster than the naive solution.

12

Pair of states Ds, f 3

q1, q4 2q2, q4 1q4, q4 0q1, q8 0q3, q8 0q5, q8 0q7, q8 0q8, q8 0

Table 2

The same idea can be applied in our problem. For any pair of entering and exiting states ⟨p, q⟩, let Gpq bethe sequence of balanced paths from p to q ordered by their weight and let Gi

pq be the i-th path. Followingsimilar reasoning, we know there is no need to compute the actual value of Gi+1

pq before Gipq is ever used as part

of a larger path, in search of the k shortest accepting path. Furthermore, Gpq can be incrementally computed,as we show next in Figure 13.

The algorithm operates as follows. First of all, instead of having a single priority queue, now for everyrelevant Gpq, we have a corresponding priority queue Qpq. Qpq is only responsible for finding the intermediate“goal”, i.e. balanced paths from p to q, in increasing order of their weights. Further, only items of the formp ; r are pushed into Qpq, which allows us to use the following heuristic that only requires the reverse insideweights,

Hpq(p ; r : u) = u⊗ D(r, q)

Items are then proved in a top-down fashion, starting with Gs f . The search process can be described recursively(Figure 13). Let the sequence in consideration be Gpq,

• If there is no balanced path from p to q using any parenthesis, all proofs only involve the Scan rule. As aresult, Gpq can be incrementally computed without consulting any other sequence (lines 23-27).

• Otherwise, let e and e′ be the pair of parentheses encountered during the search. Simply query Gn[e]p[e′ ]to get the shortest path (line 30), and only use the (k + 1)-th shortest path after an instantiation provedwith the k-th one is popped (lines 14-18).

Though omitted in Figure 13 for a simpler presentation, a further optimization is essential to achieve thedesired efficiency. Observe in the second case above, that the exact knowledge of the shortest path from n[e]to p[e′] is not required until an item proved using that path is popped. Therefore, instead of directly callingFindKth(n[e], p[e′], 1), one can query D(n[e], p[e′]) to get the shortest distance. This is sufficient to compute thepriority and “promise” an actual proof, which will be realized once the item is popped. To distinguish actualinstantiations from those with a promise, we denote q1 ∼ q2 : u as an instantiation where the last step is basedon a promise.

To see the new algorithm at work, consider again the WPDA in Figure 11. Relevant values of D are listedin Table 2. Then the following are instantiated before reaching the 1-shortest path,

Gs f Gq1q4

s ; s : 0 (Hs f = 0 + 3 = 3, Qs f = {s ∼ q6 : 2, s ∼ f : 4})s ∼ q6 : 2 (Hs f = 2 + 1 = 3, Qs f = {s ∼ f : 4})

q1 ; q1 : 0 (Hq1q4 = 0 + 2 = 2, Qq1q4 = {q1 ; q2 : 1})q1 ; q2 : 1 (Hq1q4 = 1 + 1 = 2, Qq1q4 = {q1 ; q4 : 2})q1 ; q4 : 2 (Hq1q4 = 2 + 0 = 2, Qq1q4 = {})

s ; q6 : 2 (Hs f = 2 + 1 = 3, Qs f = {s ; f : 3, s ∼ f : 4})s ; f : 3 (Hs f = 3 + 0 = 3, Qs f = {s ∼ f : 4})

13

function FindKth(p, q, k) ▷ Finds the k-th element of Gpqif the result has been cached then

return the cached resultend if

5: S is a global variable storing proven items, initialized as empty outside the functionQpq is a min-priority queue, initialized as empty outside the functionif p ; p not in S then ▷ First time called

Push(p ; p : 1, Qpq) with priority D(p, q)end if

10: while Qpq is not empty doif top of Qpq is proven via Scan then

p ; r : u← Pop(Qpq)else ▷ via Complete; further pushing is needed⟨p ; r : u, v, e, e′, j⟩ ← Pop(Qpq)

15: n[e] ; p[e′] : w← FindKth(n[e], p[e′], j + 1)h← v⊗ w[e]⊗ w⊗ w[e′]⊗ D(n[e′], q)if h = 0 then

Push(⟨p ; r : v⊗ w[e]⊗ w⊗ w[e′], v, e, e′, j + 1⟩, Qpq) with priority h ▷ Storeinformation for further pushing in the future

end if20: end if

Add p ; r : u to Sfor all transition e such that p[e] = r do

if i[e] ∈ Σ ∪ {ϵ} then ▷ Scanh← u⊗ w[e]⊗ D(n[e], q)

25: if h = 0 thenPush(p ; n[e] : u⊗ w[e], Qpq) with priority h

end ifelse if i[e] ∈ Π then ▷ Complete; as the left antecedent

for all transition e′ such that i[e′] = i[e] and D(n[e], p[e′]) = 0 do30: n[e] ; p[e′] : v← FindKth(n[e], p[e′], 1)

h← u⊗ w[e]⊗ v⊗ w[e′]⊗ D(n[e′], q)if h = 0 then

Push(⟨p ; n[e′] : u⊗ w[e]⊗ v⊗ w[e′], u, e, e′, 1⟩, Qpq) with priority h ▷Store information for further pushing in the future

end if35: end for

end ifend forif p ; r : w is a goal item then

Cache p ; r : w, then return p ; r : w40: end if

end whileend function

Figure 13: Algorithm 2

14

103 104 105 106 107

|Q|+ |E|101

102

103

104

105

Tim

e(m

s)

Algorithm 1 with H1

Algorithm 2Oracle Pruned-ExpansionExpansion

Figure 14: Timing on WPDAs with various sizes, k = 1000

Notice no item is ever instantiated from Gq1q8 , which is exactly the desired result.Another benefit of grouping the search by the intermediate “goals” is there is not any special requirement

on the semiring — multiplication neither has to be commutative nor non-decreasing.

6 Experimental Results

We tested our algorithms on WPDAs generated from the machine translation system described in Iglesias et al.[2011]. Figure 14 compares the running time of the two algorithms with two previous approaches (expansionand pruned-expansion with orcale threshold) in finding the 1000 shortest paths on sample WPDAs with varioussizes. Due to the exponential time complexity of WPDA expansion, the expansion baseline is only able to finishwithin our time and memory limit on the 5 smallest sample inputs.9 For the pruned-expansion approach, wepick the oracle threshold (the exact weight difference between the shortest path and the 1000th shortest one)for each sample.

Both of our algorithms are significantly faster than the expansion baseline, and their performance is compa-rable on smaller input. But as the size of the WPDA grows, the advantage of the single pass pre-computationof Algorithm 2 becomes clear, resulting in a very large time improvement in this case.

The performance of Algorithm 2 is close to the pruned-expansion’s oracle best case in almost all sampleinputs. However, it is worth noting that the perfect threshold varies significantly between samples — even forthose generated from the same system using different inputs, the factor of the perfect threshold relative to theweight of the shortest path can vary from 0.35% to 160% while the median is 7%. This justifies our previousclaim about the difficulty in picking an appropriate threshold.

Figure 15 breaks down the running time of our algorithms on a large WPDA. Both of them spend most oftheir time on pre-computing the heuristics and the actual search takes very little time even with k as large as10000.

92 CPU hours; 4 GB of memory.

15

100 101 102 103 104

k

10000

20000

30000

40000

50000

60000

70000

80000

Tim

e(m

s)

Algorithm 1 with H1

Algorithm 2Inside + OutsideReverse InsideShortest Path

Figure 15: Timing on a WPDA with 398347 states and 951889 transitions

7 Conclusion

In this paper, we developed two algorithms for finding k shortest paths of a WPDA. Previously, there weretwo approaches to this problem. The expansion approach expands the WPDA into an equivalent WFSA,which requires exponential time and space, and then finds the k shortest paths of the WFSA. Another pruned-expansion approach expands the WPDA into a WFSA with states or transitions not on a path close enough bya given threshold to the shortest path by weight removed, and then finds the k shortest paths of the prunedWFSA. This requires less time and space, but an appropriate threshold is almost impossible to set.

In contrast, our algorithms do not need any pruning or threshold picking and give the exact k shortestpaths. The experimental results on real world input show that Algorithm 2 is highly efficient, adding very littleoverhead to the shortest distance pre-computation, whose running time is comparable to the original shortestpath algorithm in Allauzen and Riley [2012].

Acknowledgements

We would like to thank Gonzalo Iglesias for providing test inputs for our experiments. This research wassupported in part by the BOLT program of the Defense Advanced Research Projects Agency, Contract No.HR0011-12-C-0015. Any opinions, findings, conclusions or recommendations expressed in this paper are thoseof the authors and do not necessarily reflect the view of DARPA.

References

C. Allauzen and M. Riley. A pushdown transducer extension for the openfst library. In Proceedings of theSeventeenth International Conference on Implementation and Application of Automata, (CIAA 2012), 2012.

David Chiang, Kevin Knight, and Wei Wang. 11,001 new features for statistical machine translation. In Proceed-ings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Associationfor Computational Linguistics, page 218226, Boulder, Colorado, June 2009. Association for Computational Lin-guistics. URL http://www.aclweb.org/anthology/N/N09/N09-1025.

16

http://www.aclweb.org/anthology/N/N09/N09-1025

Michael Collins and Terry Koo. Discriminative reranking for natural language parsing. Computational Linguis-tics, 31(1):25–70, 2005. ISSN 0891-2017. doi: 10.1162/0891201053630273. URL http://dx.doi.org/10.1162/

0891201053630273.

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Relaxation. In Introduction toAlgorithms, pages 648–650. The MIT Press, third edition edition, July 2009. ISBN 0262033844.

P. F. Felzenszwalb and D. McAllester. The generalized a* architecture. Journal of Artificial Intelligence Research,29(1):153–190, 2007. URL https://www.aaai.org/Papers/JAIR/Vol29/JAIR-2906.pdf.

P.E. Hart, N.J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths.Systems Science and Cybernetics, IEEE Transactions on, 4(2):100–107, 1968.

Liang Huang and David Chiang. Better k-best parsing. In Proceedings of the Ninth International Workshop onParsing Technology, pages 53–64, Vancouver, British Columbia, October 2005. Association for ComputationalLinguistics. URL http://www.aclweb.org/anthology/W/W05/W05-1506.

Gonzalo Iglesias, Cyril Allauzen, William Byrne, Adrià de Gispert, and Michael Riley. Hierarchical phrase-based translation representations. In Proceedings of the 2011 Conference on Empirical Methods in Natural LanguageProcessing, pages 1373–1383, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics.URL http://www.aclweb.org/anthology/D11-1127.

Adam Pauls and Dan Klein. K-Best a* parsing. In Proceedings of the Joint Conference of the 47th Annual Meeting ofthe ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 958–966,Suntec, Singapore, August 2009. Association for Computational Linguistics. URL http://www.aclweb.org/

anthology/P/P09/P09-1108.

S. M. Shieber, Y. Schabes, and F. C. N. Pereira. Principles and implementation of deductive parsing. TheJournal of Logic Programming, 24(1-2):3–36, 1995. URL http://www.sciencedirect.com/science/article/

pii/074310669500035I.

A Proof of Properties of LM

Theorem (Soundness). Any valid proof of an instantiation q1 ; q2 : u in LM induces a balanced path from q1 to q2with weight u in M.

Proof. Base An axiom of the form q1 ; q2 : u must have q1 = q2 and u = 1. The yield of a proof using onlythe axiom is an empty path, thus a balanced path with weight 1.

Induction Assuming proofs with at most n steps satisfy the above lemma. For any proof of q1 ; q2 : u inn + 1 steps,

• If the last step uses the Scan rule, then it must be of the following form,

q1 ; q3 : u1q3

a−→ q2 : u2q1 ; q2 : u

where u1 ⊗ u2 = u, q3a−→ q2 : u2 ∈ E and q1 ; q3 : u1 is the outcome of some proof in at most n

steps. Let the induced path of the proof of q1 ; q3 : u1 be π′ = e1e2 . . . em. The induced path ofthe whole proof is thus π = e1e2 . . . em(q3

a−→ q2). By the induction hypothesis, π′ is a balanced pathfrom q1 to q3 with weight u1. As a result, π is also balanced because a ∈ Σ ∪ {ϵ} by definition of thelogic; its weight is w[π] = w[π′]⊗ u2 = u1 ⊗ u2 = u.

• If the last step uses the Complete rule, then it must be of the following form,

17

http://dx.doi.org/10.1162/0891201053630273

http://dx.doi.org/10.1162/0891201053630273

https://www.aaai.org/Papers/JAIR/Vol29/JAIR-2906.pdf

http://www.aclweb.org/anthology/W/W05/W05-1506

http://www.aclweb.org/anthology/D11-1127

http://www.aclweb.org/anthology/P/P09/P09-1108

http://www.aclweb.org/anthology/P/P09/P09-1108

http://www.sciencedirect.com/science/article/pii/074310669500035I

http://www.sciencedirect.com/science/article/pii/074310669500035I

q1 ; q3 : u1 q4 ; q5 : u3q3

a−→ q4 : u2, q5a−→ q2 : u4q1 ; q2 : u

where u1 ⊗ u2 ⊗ u3 ⊗ u4 = u, a ∈ Π is an opening parenthesis, and a ∈ Π is the correspondingclosing parenthesis. Similar to the Scan rule case, one can prove the induced path is a balanced pathfrom q1 to q2 with weight u using the associativity of ⊗.

Theorem (Completeness). Any balanced path from an entering state q1 to some state q2 with weight u in M has a validproof of an instantiation q1 ; q2 : u in LM whose induced path is that path.

Proof. Base For any empty balanced path from a state q such that q ; q : 1 is an axiom, the proof is just theaxiom itself.

Induction Assuming all balanced paths from any entering state of at most length n satisfy the above lemma.For any balanced path of length n + 1 from an entering state π = e1e2 . . . en+1 from q1 to q2 with weightu,

• If en+1 is q3a−→ q2 : u2 with a ∈ Σ ∪ {ϵ}, then π′ = e1e2 . . . em is a balanced of length n and

w[π′]⊗ u2 = u. By induction hypothesis, there exists a proof of π′ (via item q1 ; n[em] : w[π′]).Applying the Scan rule then gives a proof of q1 ; q2 : u1 ⊗ u2 = u.

• If en+1 is q3a−→ q2 : u2 with a ∈ Π, then there must be a k ≤ n such that ek balances with en+1. Similar

to the above case, one can prove the item by combining the proof of e1 . . . ek−1 and ek+1 . . . em.

• If en+1 is q3a−→ q2 : u2 with a ∈ Π, the path cannot be balanced.

Theorem (In-ambiguity). Any balanced path from an entering state in M has a unique proof in LM.

Proof. This is very similar to the completeness proof, thus we only give a sketch of the proof. First note allempty paths from an entering state has a unique proof (if the start state happens to have an incoming open-parenthesis transition, we consider the two inducing the same axiom). For any longer paths, if the last transitionhas a label from Σ∪ {ϵ} then the last step must be using the Scan rule with antecedents with unique proof andthat particular transition as the side condition; otherwise the last transition must have a closing parenthesis,which means a unique application of the Complete rule.

18

two algorithms for finding k shortest paths of a weighted ... · weighted pushdown automata (wpdas)...

Documents