weighted finite-state transducersnovakj/wfst-algorithms.pdf(4) ¯0 is an annihilator for ⊗ : ∀a...

University of Tokyo Minematsu Lab Josef R. Novak

Weighted Finite-State Transducers Important Algorithms

30 Mehryar Mohri

0

1a/0

b/1

c/5

2

d/0

e/1

3

e/0f/1

e/4

f/5

0/0

1a/0

b/1

c/5

2

d/4e/5

3/0

e/0f/1

e/0

f/1

0/15

1

a/0

b/(1/15)

c/(5/15)

2

d/0

e/(9/15)

3/1

e/0f/1

e/(4/9)

f/(5/9) 0/0 1a/0b/1c/5

3/0e/0f/1

(a) (b) (c) (d)

Fig. 12. Weight pushing algorithm. (a) Weighted automaton A. (b) Equivalentweighted automaton B obtained by weight pushing in the tropical semiring. (c)Weighted automaton C obtained from A by weight pushing in the probability semir-ing. (d) Minimal weighted automaton over the tropical semiring equivalent to A.

large graphs or automata of several hundred million states or transitions.An approximate version of a generic shortest-distance algorithm can be usedinstead to compute d[q] e!ciently.

Note that if d[q] = 0, then, since S is zero-sum-free, the weight of allpaths from q to F is 0. Let A be a weighted automaton over the semiringS. Assume that S is complete or k-closed and that the shortest-distancesd[q] are all well-defined and in S ! {0}. Note that in both cases we can usethe distributivity over the infinite sums defining shortest distances. Let e!

(!!) denote the transition e (path !) after application of the weight pushingalgorithm. e! (!!) di"ers from e (resp. !) only by its weight. Let "! denote thenew initial weight function, and #! the new final weight function.

The following proposition is proven in [38, 43].

Proposition 7 ([38, 43]). Let B = (A, Q, I, F, E!, "!, #!) be the result of theweight pushing algorithm applied to the weighted automaton A, then

1. the weight of a successful path ! is unchanged after application of weightpushing:

"![p[!!]] " w[!!] " #![n[!!]] = "(p[!]) " w[!] " #(n[!]). (39)

2. the weighted automaton B is stochastic, i.e.

#q $ Q,!

e!"E![q]

w[e!] = 1. (40)

These two properties of weight pushing are illustrated by Figures 12(a)-(c):the total weight of a successful path is unchanged after pushing; at eachstate of the weighted automaton of Figure 12(b), the minimum weight of theoutgoing transitions is 0, and at at each state of the weighted automaton ofFigure 12(c), the weights of outgoing transitions sum to 1.

Weight pushing can also be used to test the equivalence of two subsequen-tial weighted automata [38, 43]. Let A and B be two subsequential weightedautomata to which weight pushing can be applied and let A! and B! be the

Lecture Outline  Preliminaries  Semirings  Transducers and Acceptors

 Important Algorithms  Composition  Determinization  Epsilon Removal  Other algorithms

2

Semirings (I)   WFSTs and WFST-based operations are underpinned by

algebraic objects called semirings

 This has implications for optimization, search, and combination algorithms such as determinization, shortest-path, and composition

Definition A semiring is a system (K,⊕,⊗, 0̄, 1̄) such that,(1) (K,⊕, 0̄) is a commutative monoid with 0̄ as the identity

element for ⊕,(2) (K,⊗, 1̄) is a monoid with 1̄ as the idnetity element for ⊗,(3) ⊕ distributes over ⊕: for all a, b, c in K,

(a⊕ b)⊗ c = (a⊗ c)⊕ (b⊗ c),c⊗ (a⊕ b) = (c⊗ a)⊕ (c⊗ b),

(4) 0̄ is an annihilator for ⊗ : ∀a ∈ K, a⊗ 0̄ = 0̄⊗ a = 0̄

1

3 A monoid is an algebraic structure that supports a single associative binary operation and an identity element. (wikipedia)

Speech Recognition with Weighted Finite-State Transducers, Mohri et. al

Semirings (II)   A variety of semirings exist, but there are two that are of

particular interest for NLP and ASR applications   The log semiring

  Isomorphic to the real semiring, high numerical stability

  The tropical semiring   Convenient for shortest-path algorithms because global path weights are

guaranteed to be monotonic non-decreasing, viterbi approximation is fast

  Semiring frameworks and algorithms for shortest-distance problems Mohri, 2002, for more details

SEMIRING SET ⊕ ⊗ 0̄ 1̄

Boolean {0, 1} ∨ ∧ 0 1Probability R+ ∪ {+∞} + × 0 1Log R ∪ {−∞,+∞} ⊕log + +∞ 0Tropical R+ ∪ {+∞} min + +∞ 0

Table 1. Semiring examples. ⊕log is defined by: x⊕log y = −log(e−x + e−y).

1

4


Semirings (example) 5

  Tropical Semiring example   Simple, but unintuitive!

€

5⊕ 3 = 34⊕2⊕ 7 = 2(3⊗ 4)⊕6 = 61⊗5 = 6(1⊗5)⊕ (2⊗ 4) = 63⊕ 1 ⇔ 3⊕0 ⇔ 03⊗ 1 ⇔ 3⊗0 ⇔ 3

€

a⊗b = a + ba⊕b = min(a,b)1 = 00 = +∞

Operation Definitions Trivial Examples

Transducers and Acceptors

  A WFSA is simply a WFST where the output labels have been omitted

  Similarly, FSAs and FSTs lack weights on the arcs or states

Definition A WFST is defined as an 8-tuple, T = (Σ,�, Q, I, F,E,λ, ρ).Here Σ represents the finite alphabet of input symbols, � represents

the finite output alphabet, Q represents the finite set of states, I ⊆ Qthe set of initial states, F ⊆ Q the set of final states, E ⊆ Q× (Σ ∪ {�})× (�∪ {�})×K×Q a finite set of state-to-state transitions, λ : I → Kthe initial weight function, and ρ : F → K the final weight function

mapping F to K .

1

6


Basic examples   Finite-State Acceptor (FSA)

  Weighted Finite-State Acceptor (WFSA)

  Finite-State Transducer (FST)

  Weighted Finite-State Transducer (WFST)

7

WFST Practice   OpenFST ( www.openfst.org )   $ emacs test.fst.txt!

  $ emacs test.syms!

  $ fstcompile --isymbols=test.syms --osymbols=test.syms test.fst.txt | fstdraw --isymbols=test.syms --osymbols=test.syms --portrait=true | dot -Tpdf > test.pdf!

  $ emacs test2.fst.txt!

  $ emacs test2.syms!

  $ fstcompile --acceptor=true --isymbols=test2.syms test2.fst.txt | fstdraw --portrait=true --acceptor=true --isymbols=test2.syms | dot -Tpdf > test2.pdf!

test.fst.txt!0 1 t test!1 2 eh -!2 3 s -!3 4 t -!4!

test.syms!- 0!test 1!t 2!eh 3!s 4!

test2.fst.txt!0 1 write .3!0 1 right .7!1!

test2.syms!-  0!right 1!write 2!

8

Important algorithms

  There exists a wide variety of algorithms that operate on Weighted Finite-State Transducers   composition, determinization, minimization, epsilon-removal,

epsilon-normalization, synchronization, weight-pushing, reversal, projection, shortest-path, connection, closure, concatenation, pruning, re-weighting, union, etc.

  Arguably the most important operations are composition, determinization, epsilon-removal, weight-pushing, and minimization

9

Algorithms 15

0 1a:b/0.1a:b/0.2

2b:b/0.33/0.7b:b/0.4

a:b/0.5

a:a/0.6

0 1b:b/0.1

b:a/0.22a:b/0.3

3/0.6a:b/0.4

b:a/0.5

(a) (b)

(0, 0) (1, 1)a:b/.01

(0, 1)a:a/.04

(2, 1)b:a/.06 (3, 1)

b:a/.08

a:a/.02

a:a/0.1

(3, 2)a:b/.18

(3, 3)/.42

a:b/.24

(c)

Fig. 6. Weighted transducers (a) T1 and (b) T2 over the probability semiring. (c)Illustration of composition of T1 and T2, T1 ! T2. Some states might be constructedduring the execution of the algorithm that are not co-accessible, e.g., (3, 2). Suchstates and the related transitions can be removed by a trimming (or connection)algorithm in linear-time.

T1 = (!!, "!, Q1, I1, F1, E1, #1, $1) and T2 = ("!, %!, Q2, I2, F2, E2, #2, $2)

such that the input alphabet of T2, ", coincides with the output alphabet ofT1, outputs a weighted transducer T = (!!, %!, Q, I, F, E, #, $) realizing thecomposition of T1 and T2.

Let T1 and T2 be two weighted transducers defined over S such that theinput alphabet of T2 coincides with the output alphabet of T1. Assume thatthe infinite sum

!

z"!! T1(x, z)! T2(z, y) is defined and in S for all (x, y) "!! # %!. This condition holds for all transducers defined over a completesemiring such as the Boolean semiring, the tropical semiring, the probabilitysemiring and the log semiring, and for all acyclic transducers defined overan arbitrary semiring. Then, the result of the composition of T1 and T2 is aweighted transducer denoted by T1 $ T2 and defined for all x, y by:

(T1 $ T2)(x, y) ="

z"!!

T1(x, z) ! T2(z, y). (23)

The sum runs over all strings z labeling a path of T1 on the output side anda path of T2 on input label z. The matrix notation we have used emphasizesthe connection of composition with matrix multiplication.9

There exists a general and e!cient algorithm to compute the compositionof two weighted transducers. In the absence of &s on the input side of T1 orthe output side of T2, the states of T1 $ T2 can be identified with pairs of a

9 Our choice of a matrix notation as opposed to a functional notation is motivatedby its convenience in applications.

Composition (I) 10

  Fundamental operation for combining related transducers,

  Used to iteratively combine multiple related knowledge sources to produce a single, integrated result

(T1 ◦ T2)(x, y) =�

z∈∆∗

T1(x, z)⊗ T2(z, y),

where T1 and T2 represent the two transducers to be composed. The composition algorithmitself then, in the epsilon-free case, is defined as follows,

Algorithm 1 Weighted-Composition(T1, T2)

1: Q ← I1 × I2

2: K ← I1 × I2

3: while K �= ∅ do4: q = (q1, q2) ← Head(K)5: Dequeue(K)6: if q ∈ I1 × I2 then7: I ← I ∪ {q}8: λ(q) ← λ1(q1)⊗ λ2(q2)9: if q ∈ F1 × F2 then

10: F ← F ∪ {q}11: p(q) ← p1(q1)⊗ p2(q2)12: for each (e1, e2) ∈ E[q1]× E[q2] such that o[e1] = i[e2] do13: if (q�) = (n[e1], n[e2] /∈ Q) then14: Q ← Q ∪ {q�}15: Enqueue(K, q

�)16: E ← E � {(q, i[e1], o[e2], w[e1]⊗ w[e2], q�)}17: return T

1

(a) (b)

(c)

Composition (II) 11

(T1 ◦ T2)(x, y) =�

z∈∆∗

T1(x, z)⊗ T2(z, y),



1: Q ← I1 × I2

2: K ← I1 × I2




1

Algorithm complexity

Time O(V1V2D1(logD2+M2))

Space O(V1V2V1M2) Vi=#states, Di=max out-degree, Mi=max multiplicity of ith WFST. www.OpenFST.org

Composition (II) 12

(T1 ◦ T2)(x, y) =�

z∈∆∗

T1(x, z)⊗ T2(z, y),



1: Q ← I1 × I2

2: K ← I1 × I2




1


Time O(V1V2D1(logD2+M2))

Space O(V1V2V1M2) Vi=#states, Di=max out-degree, Mi=max multiplicity of ith WFST. www.OpenFST.org

Initialize the queue and new WFST

Initial state matches

Final state matches

Create new arcs for any matching labels, and add to

the new WFST

Composition (example) 13

C (0a,0b)

a:b/0.1 (1a) b:b/0.1 (1b) a:b/0.01 (1a,1b)

Queue (K)

(0a,0b)

(a) (b)

(c)


C (0a,0b)

a:b/0.1 (1a) b:b/0.1 (1b)

a:b/0.01 (1a,1b)

Queue (K)

(0a,0b)

Queue (K+1)

(1a,1b)

(a) (b)

(c)


C (1a,1b)

a:b/0.2 (0a) b:a/0.2 (1b)

b:b/0.3 (2a) a:b/0.3 (2b)

b:b/0.4 (3b) a:b/0.4 (3b)

a:a/0.04 (0a,1b)

b:a/0.06 (2a,1b)

b:a/0.08 (3a,1b)

Queue (K)

(1a,1b)

Queue (K+1)

(0a,1b)

(2a,1b)

(3a,1b)

(a) (b)

(c)


C (0a,1b)

a:b/0.1 (1a) b:a/0.2 (1b)

a:b/0.3 (2b)

a:b/0.4 (3b)

a:a/0.02 (1a,1b)

Queue (K)

(0a,1b)

(2a,1b)

(3a,1b)

Queue (K+1)

(2a,1b)

(3a,1b)

(a) (b)

(c)


C (2a,1b)

a:b/0.5 (3a) b:a/0.2 (1b)

a:b/0.3 (2b)

a:b/0.4 (3b)

a:a/0.1 (3a,1b)

Queue (K)

(2a,1b)

(3a,1b)

Queue (K+1)

(3a,1b)

(a) (b)

(c)


C (3a,1b)

a:a/0.6 (3a) b:a/0.2 (1b)

a:b/0.3 (2b)

a:b/0.4 (3b)

a:b/0.18 (3a,2b)

a:b/0.24 (3a,3b)

Queue (K)

(3a,1b)

Queue (K+1)

(3a,2b)

(3a,3b)

(a) (b)

(c)


C (3a,2b)

a:a/0.6 (3a) b:a/0.5 (3b)

Null

Queue (K)

(3a,2b)

(3a,3b)

Queue (K+1)

(3a,3b)

(a) (b)

(c)


C (3a,3b)

a:a/0.6 (3a) Null

Null

Queue (K)

(3a,3b)

Queue (K+1)

Null

(a) (b)

(c)


(C) composition result

(0a,0b) a:b/0.01 (1a,1b) (0a,1b) a:a/0.02 (1a,1b)

(1a,1b) a:a/0.04 (0a,1b) (2a,1b) a:a/0.1 (3a,1b)

(1a,1b) b:a/0.06 (2a,1b) (3a,1b) a:b/0.18 (3a,2b)

(1a,1b) b:a/0.08 (3a,1b) (3a,1b) a:b/0.24 (3a,3b)

(c)

Composition (III) 22

  -transitions (epsilons) cause problems  Wildcard (*) that matches zero or more occurrences of any

label in the input/output alphabet  Create redundant paths during simple composition  May produce incorrect results for non-idempotent semirings

as path-weights may be counted more than once

  Solution is to utilize an intermediate filter transducer to eliminate the redundant paths

Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)

Fig. 7. Redundant !-paths in composition. All transition and final weights are equalto 1. (a) A straightforward generalization of the !-free case would generate all thepaths from (1, 1) to (3, 2) when composing T1 and T2 and produce an incorrectresults in non-idempotent semirings. (b) Filter transducer F [46]. The shorthand xis used to represent an element of ".

In the worst case, all transitions of T1 leaving a state q1 match all thoseof T2 leaving state q!1, thus the space and time complexity of composition isquadratic: O(|T1||T2|). However, an important feature of composition is thatit admits a natural on-demand computation which can be used to constructonly the part of the composed transducer that is needed. Figures 6(a)-(c)illustrate the algorithm when applied to the transducers of Figures 6(a)-(b)defined over the probability semiring.

More care is needed when T1 admits output ! labels or T2 input ! labels.Indeed, as illustrated by Figure 7, a straightforward generalization of the !-free case would generate redundant !-paths and, in the case of non-idempotentsemirings, would lead to an incorrect result. The weight of the matching pathsof the original transducers would be counted p times, where p is the numberof redundant paths in the result of composition.

To cope with this problem, all but one !-path must be filtered out ofthe composite transducer. Figure 7 indicates in boldface one possible choice

Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Composition (IV) 23 Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Composition (IV) 24 Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Valid because the output epsilon

label doesn’t need to match anything on the 2nd WFST

Valid because the input epsilon label

doesn’t need to match anything on

the 1st WFST

Composition (example) Composition with epsilons – generic

C (0a,0b)

a:a (1a) a:d (1b)

a:d (1a,1b)

Queue (K)

(0a,0b)

Queue (K+1)

(1a,1b)

C (0a,0b)

a:a (1a) a:d (1b)

a:d (1a,1b)

Queue (K)

(0a,0b)

Queue (K+1)

(1a,1b)


C (1a,1b)

b:ε (2a) ε:e (2b)

ε (1a) ε:e (2b)

b:ε (2a) ε (1b)

ε:e (1a,2b)

b:ε (2a,1b)

b:e (2a,2b)

Queue (K)

(1a,1b)

Queue (K+1)

(1a,2b)

(2a,1b)

(2a,2b)


C (1a,1b)

b:ε (2a) ε:e (2b)

ε (1a) ε:e (2b)

b:ε (2a) ε (1b)

ε:e (1a,2b)

b:ε (2a,1b)

b:e (2a,2b)

Queue (K)

(1a,1b)

Queue (K+1)

(1a,2b)

(2a,1b)

(2a,2b)


Epsilon labels can either generate a wildcard match, or act as a null transition. This causes the corresponding WFST to either stay in the current state or skip to the next.

C (1a,2b)

b:ε (1a) d:a (2b)

b:ε (2a,2b)

Queue (K)

(1a,2b)

(2a,1b)

(2a,2b)

Queue (K+1)

(2a,1b)

(2a,2b)


C (2a,1b)

ε (2a) ε:e (2b)

c:ε (3a) ε (1b)

ε:e (2a,2b)

c:ε (3a,1b)

Queue (K)

(2a,1b)

(2a,2b)

Queue (K+1)

(2a,2b)

(3a,1b)


C (2a,2b)

c:ε (3a) ε (2b)

c:ε (3a,2b)

Queue (K)

(2a,2b)

(3a,1b)

Queue (K+1)

(3a,1b)

(3a,2b)


C (3a,1b)

ε (3a) ε:e (3b)

ε:e (3a,2b)

Queue (K)

(3a,1b)

(3a,2b)

Queue (K+1)

(3a,2b)


C (3a,2b)

d:d (4a) d:a (3b)

d:a (4a,3b)

Queue (K)

(3a,2b)

Queue (K+1)

(4a,3b)


C (4a,3b)

Null Null

Null

Queue (K)

(4a,3b)

Queue (K+1)

Null



  Naive application of the algorithm generates many superfluous transitions as well as states  Necessary to filter the composition operation in order to prevent the

unnecessary generation of these elements  The final result should only include unique, valid paths

Determinization (I) 36

  Eliminates ambiguity in the input paths   Improves efficiency of downstream operations such as shortest-path   Each state has at most one outgoing transition containing any

particular input label

 Determinization of weighted automata. (a) Weighted automaton over the tropical semiring A. (a’) Equivalent weighted automaton a’ obtained by determinization of a. (Mohri, 2002)

a a’

Determinization (II) 37

26 Mehryar Mohri

Weighted-Determinization(A)

1 i! ! {(i, !(i)) : i " I}2 !!(i!) ! 13 Q ! {i!}4 while Q #= $ do

5 p! ! Head(Q)6 Dequeue(Q)7 for each x " i[E[Q[p!]]] do

8 w! !L

{v % w : (p, v) " p!, (p, x, w, q) " E}9 q! ! {(q,

L

{w!"1 % (v % w) : (p, v) " p!, (p, x, w, q) " E}) :q = n[e], i[e] = x, e " E[Q[p!]]}

10 E! ! E! & {(p!, x, w!, q!)}11 if q! #" Q! then

12 Q! ! Q! & {q!}13 if Q[q!] ' F #= $ then

14 F ! ! F ! & {q!}15 "!(q!) !

L

{v % "(q) : (q, v) " q!, q " F}16 Enqueue(Q, q!)17 return T !

set of transitions leaving these states, and i[E[Q[p!]]] the set of input labelsof these transitions.

The states of the output automaton can be identified with (weighted) sub-sets of the states of the original automaton. A state r of the output automatonthat can be reached from the start state by a path ! is identified with the set ofpairs (q, x) ! Q"S such that q can be reached from an initial state of the origi-nal machine by a path " with i["] = i[!] and #(p["])#w["] = #(p[!])#w[!]#x.Thus, x can be viewed as the residual weight at state q.

Determinization does not terminate for all weighted automata. As weshall see, not all weighted automata are determinizable by the algorithmjust described. When it terminates, the algorithm returns a subsequentialweighted automaton A! = ($, Q!, I !, F !, E!, #!, %!), equivalent to the inputA = ($, Q, I, F, E, #, %).

The algorithm uses a queue Q containing the set of states of the resultingautomaton A!, yet to be examined. The queue discipline of Q can be arbitrarilychosen and does not a!ect the termination of the algorithm. A! admits aunique initial state, i!, defined as the set of initial states of A augmented withtheir respective initial weights. Its input weight is 1 (lines 1-2). Q originallycontains only the subset i! (line 3). At each execution of the loop of lines 4-16,a new subset p! is extracted from Q (lines 5-6). For each x labeling at least oneof the transitions leaving a state p of the subset p!, a new transition with inputlabel x is constructed. The weight w! associated to that transition is the sumof the weights of all transitions in E[Q[p!]] labeled with x pre-#-multiplied


Time exponential

Space exponential

26 Mehryar Mohri




8 w! !L

{v % w : (p, v) " p!, (p, x, w, q) " E}9 q! ! {(q,

L


10 E! ! E! & {(p!, x, w!, q!)}11 if q! #" Q! then

12 Q! ! Q! & {q!}13 if Q[q!] ' F #= $ then

14 F ! ! F ! & {q!}15 "!(q!) !

L







Compute new arc weight

Compute new states and residual weights


Compute new final weight, if necessary


Time exponential

Space exponential

26 Mehryar Mohri




8 w! !L

{v % w : (p, v) " p!, (p, x, w, q) " E}9 q! ! {(q,

L


10 E! ! E! & {(p!, x, w!, q!)}11 if q! #" Q! then

12 Q! ! Q! & {q!}13 if Q[q!] ' F #= $ then

14 F ! ! F ! & {q!}15 "!(q!) !

L







Compute new arc weight

Compute new states and residual weights


Compute new final weight, if necessary


Time exponential

Space exponential

Not the same!!

Not the same!!

Determinization (example) 40




1




1

Recall the properties of the tropical semiring

Queue (K)

(0, )

a

a’

€

1





1




1


Queue (K)

(0, )

a

a’

€

1

Input state ID

Residual weight from original path





1




1


Queue (K)

(0,0)

€

ʹ′ w =⊕ 1 ⊗1( ), 1 ⊗2( ){ }ʹ′ w = min 0 +1( ), 0 + 2( ){ }ʹ′ w =1

€

ʹ′ q = 1,⊕ 1−1 ⊗ 1 ⊗1( ){ }( ), 2,⊕ 1−1 ⊗ 1 ⊗2( ){ }( )⎧ ⎨ ⎩

⎫ ⎬ ⎭

ʹ′ q = 1,min −1+ 0 +1( ){ }( ), 2,min −1+ 0 + 2( ){ }( ){ }ʹ′ q = 1,0( ), 2,1( ){ }

{ (0,0), a, 1, ((1,0),(2,1)) } Queue (K)

((1,0),(2,1))

a

a’





1




1


Queue (K)

(0,0)

€

ʹ′ w =⊕ 1 ⊗1( ), 1 ⊗2( ){ }ʹ′ w = min 0 +1( ), 0 + 2( ){ }ʹ′ w =1

€

ʹ′ q = 1,⊕ 1−1 ⊗ 1 ⊗1( ){ }( ), 2,⊕ 1−1 ⊗ 1 ⊗2( ){ }( )⎧ ⎨ ⎩

⎫ ⎬ ⎭

ʹ′ q = 1,min −1+ 0 +1( ){ }( ), 2,min −1+ 0 + 2( ){ }( ){ }ʹ′ q = 1,0( ), 2,1( ){ }

{ (0,0), a, 1, ((1,0),(2,1)) } Queue (K)

((1,0),(2,1))

a

a’

注意！





1




1


Queue (K)

((1,0),(2,1))

€

ʹ′ w =⊕ 0⊗ 3( ), 1⊗ 3( ){ }ʹ′ w =min 0 + 3( ), 1+ 3( ){ }ʹ′ w = 3

€

ʹ′ q = 1,⊕ 3−1⊗ 0⊗ 3( ){ }( ), 2,⊕ 3−1⊗ 1⊗ 3( ){ }( )⎧ ⎨ ⎩

⎫ ⎬ ⎭

ʹ′ q = 1,min −3+ 0 + 3( ){ }( ), 2,min −3+ 1+ 3( ){ }( ){ }ʹ′ q = 1,0( ), 2,1( ){ }

{ ((1,0),(2,1)), b, 3, ((1,0),(2,1)) } Queue (K)

a

a’





1




1


Queue (K)

((1,0),(2,1))

€

ʹ′ w =⊕ 0⊗5( ){ }ʹ′ w =min 0 + 5( ){ }ʹ′ w = 5

€

ʹ′ q = 3,⊕ 5−1⊗ 0⊗5( ){ }( )⎧ ⎨ ⎩

⎫ ⎬ ⎭

ʹ′ q = 3,min −5 + 0 + 5( ){ }( ){ }ʹ′ q = 3,0( ){ }

{ ((1,0),(2,1)), c, 5, (3,0) } Queue (K)

(3,0)

a

a’


a

a’




1




1


Queue (K)

((1,0),(2,1))

€

ʹ′ w =⊕ 1⊗6( ){ }ʹ′ w =min 1+ 6( ){ }ʹ′ w = 7

€

ʹ′ q = 3,⊕ 7−1⊗ 1⊗6( ){ }( )⎧ ⎨ ⎩

⎫ ⎬ ⎭

ʹ′ q = 3,min −7 + 1+ 6( ){ }( ){ }ʹ′ q = 3,0( ){ }

{ ((1,0),(2,1)), d, 7, (3,0) } Queue (K)

(3,0)


a

a’

Null




1




1


Queue (K)

(3,0)

Queue (K)

Null


a’

a’ determinization result

(0,0) a/1 ((1,0),(2,1))

((1,0),(2,1)) b/3 ((1,0),(2,1))

((1,0),(2,1)) c/5 (2a,1b)

((1,0),(2,1)) d/7 (3a,1b)

Determinization (III) 49

  Some issues with weighted determinization  May not terminate for some inputs!  May not finish for some inputs, even if they are valid!  May blow up in memory

  The Twins property can be utilized to test for determinizability

This WFST cannot be determinized, but why?

€

w P q,y,q( )( ) = w P ʹ′ q ,y, ʹ′ q ( )( )

Determinization (III) 50

  Some issues with weighted determinization  May not terminate for some inputs!  May not finish for some inputs, even if they are valid!  May blow up in memory

  The Twins property can be utilized to test for determinizability

This WFST cannot be determinized, but why?

The residual weights generated by these

loops cannot be normalized

€

w P q,y,q( )( ) = w P ʹ′ q ,y, ʹ′ q ( )( ) “Weighted Automata Algorithms”, Mehryar Mohri, 2002 for details

Generic Epsilon-Removal (I) 51

  Removes -transitions from the input transducer and produces a new, epsilon-free transducer equivalent to the original

  Preferable to remove -transitions prior to performing downstream processing or optimization   Introduce delays  Cause problems or complications

Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





€

A

€

Aε

Generic Epsilon-Removal (II) 52

  -closure computation of state p:

22 Mehryar Mohri

Epsilon-Removal(T )

1 for each p ! Q do

2 Compute-Closure(C(p))3 E! " E! " {(p, a, b, w, q) ! E : (a, b) #= (!, !)}4 F ! " F5 "! " "6 for each p ! Q do

7 for each (q, w!) ! C[p] do

8 E![p] " E![p] $ {(p, a, b, w! % w, r) : (q, a, b, w, r) ! E!}9 if q ! F then

10 if p #! F then

11 F ! " F & {p}12 "![p] " 013 "![p] " "![p] ' (w! % "(q))14 return T ! = (#, $, Q, I, F !, E!, %, "!)

The proof is simple and is given in [40].The !-closures C(p) can be computed using an all-pair shortest-distance

algorithm over T! when the semiring S is complete, or by applying the single-shortest distance algorithm from each source p when the semiring is k-closed,as described in Section 2. The complexity of the second stage of the algorithm(lines 6-13) is in O(|Q|2 + |Q||E|) since in the worst case each C(p) containsall states of the transducer. Thus, the overall complexity of the algorithm is

O(|Compute-Closure| + |Q|2 + |Q||E|(T! + T")), (30)

where O(|Compute-Closure|) denotes the total cost of the closure compu-tation. We are now examining several special cases of practical interest:

• T! is acyclic, that is T admits no !-cycle, a rather frequent case in practice.In that case, the single-source shortest distance algorithm can be used forany semiring S and has linear time complexity. The total complexity ofapplying the algorithm at each state is O(|Q|2 + |Q||E|(T! + T")) andmatches that of the second stage. Thus, the overall complexity of epsilon-removal is then

O(|Q|2 + |Q||E|(T! + T")). (31)

The algorithm can in fact be improved in that special case. The complexityof the computation of the all-pairs shortest distances can be substantiallyimproved if the states of T! are visited in reverse topological order and ifthe single-source shortest-distance algorithm is interleaved with the actualremoval of !s as follows: for each state p of T! visited in reverse topologicalorder,

– run a single-source shortest-distance algorithm with source p to com-pute the distance from p to each state q in T!;

Algorithms 21

6 Optimization algorithms

6.1 Epsilon-Removal

The use of various automata or transducer operations such as rational opera-tions generate !-transitions. These transitions cause some delay in the use ofthe resulting transducers since the search for an alphabet symbol to matchin composition or other similar operations requires reading some sequences of!s first. To make these weighted transducers more e!cient to use, it may bepreferable to remove all !-transitions, that is to create an equivalent weightedtransducer with no !-transition. This section describes a general !-removalalgorithm that precisely achieves this task.

Simply removing !-transitions from the input transducer clearly does notresult in an equivalent one. Instead, for a given state p, the non-!-transitionsof all states q reachable from p via !-transitions should be added to those ofp. In the weighted case, this does not result in an equivalent transducer sincethe weights of the !-transitions from p to q would be ignored. Thus, beforeadding an outgoing transition of state q to p, the weight it carries must bepre-!-multiplied by the sum of the weights of the !-paths from p to q. Toensure that this weight is an element of the semiring, we will assume that Sis complete.11

This leads to a two-step algorithm [40]. Given a transducer T , let T! denotethe transducer derived from T by keeping only !-transitions and let d![p, q] =""!P (p,!,q)w["] denote the distance from state p to state q in T!. Then, thefollowing are the two main steps of the algorithms:

• !-closure computation: at each state p, the weighted !-closure defined by

C(p) = {(q, w) : P (p, !, q) #= $, w = d![p, q]} (28)

is computed.• actual removal of !s: all !-transitions are removed and for each p and each

(q, w) % C(p), the transition set of p is augmented with the followingtransitions

{(p, a, b, d![p, q] ! w, r) : (q, a, b, w, r) % E, (a, b) #= (!, !)}. (29)

If there exists (q, w) % C(p) with q % F , then d![p, q] ! #(q) must be"-added to the final weight of p.

The following is the pseudocode of the algorithm which follows the main stepsjust discussed.

Theorem 4 ([40]). Let T be a weighted transducer over a complete semir-ing S. Assume that the closures C(p) can be computed for any state p of T .Then, the weighted transducer T " returned by the epsilon-removal algorithmjust described is equivalent to T .

11 The results presented also hold in the case of closed semirings.

Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Algorithm complexity*

Time O(V2logV+VE)

Space O(VE) V=#states, E=#arcs. *Tropical semiring www.OpenFST.org



22 Mehryar Mohri

Epsilon-Removal(T )

1 for each p ! Q do




10 if p #! F then







O(|Q|2 + |Q||E|(T! + T")). (31)



Algorithms 21


6.1 Epsilon-Removal





C(p) = {(q, w) : P (p, !, q) #= $, w = d![p, q]} (28)








Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)






Time O(V2logV+VE)

Space O(VE) V=#states, E=#arcs. *Tropical semiring www.OpenFST.org Initialize the new WFST with

all non-epsilon arcs and states

Compute the e-closure for each state, q

Compute new, non-epsilon transitions for each state p, from its closure, C[p]

Handle final states and weights

Epsilon-closure of p is the set of states reachable from

p via epsilon transitions



22 Mehryar Mohri

Epsilon-Removal(T )

1 for each p ! Q do




10 if p #! F then







O(|Q|2 + |Q||E|(T! + T")). (31)



Algorithms 21


6.1 Epsilon-Removal





C(p) = {(q, w) : P (p, !, q) #= $, w = d![p, q]} (28)








Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Initialize the new WFST with all non-epsilon arcs and states

Compute the e-closure for each state, q

Compute new, non-epsilon transitions for each state p, from its closure, C[p]

Handle final states and weights

Epsilon-closure of p is the set of states reachable from

p via epsilon transitions

Disjoint union

€

A∪ B and A∩ B =∅


Time O(V2logV+VE)

Space O(VE) V=#states, E=#arcs. *Tropical semiring www.OpenFST.org

Generic Epsilon-Removal (example) 55

€

Aε

€

A

Algorithm initialization


€

Aε

€

A

Copy all states, and non-epsilon arcs from the

input transducer


€

Aε

€

A

For each state in the original input transducer, compute the epsilon-closure – the states that are accessible via epsilon-arcs.


€

Aε

€

A

Generate new, non-epsilon arcs based on the epsilon closure. Similarly, generate new final states where necessary.


€

Aε

€

A


€

A

€

Aε


€

Aε

States 2 and 3 are not co-accessible. These may be

deleted along with associated transitions.


€

Aε

Generic Epsilon-Removal (III) 66

  Epsilon removal may be employed to remove arbitrary symbols other than ,  Relabel the to an alternative, map other desired symbols to

epsilon, perform generic epsilon removal, and restore the original labels

  Choice of semiring affects the time complexity of the algorithm.  Tropical semiring is faster because it takes advantage of the

viterbi approximation for the shortest path computation  Other semirings exhibit exponential time complexity  Un-weighted epsilon-removal exhibits time complexity

O(V2+VE)

Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Algorithms 17

! "#$# %&$! '($! )*$*! "

#$%&

!!'(

%$#

T1 T2

!

!!!"!!!"

"#$#

!!!"!!!"

%&$!#

!!!"!!!"

'($!#

!!!"!!!"

)*$*

!!!"!!!"

!

!!"!

"#$%

!!"!

&!#"$$$''''(

!!"!

)%$#

!!"!

T̃1 T̃2

!"! #"# #"$

$"# $"$

%"# %"$

&"%

'() !(*

+(!

,(!

+(!

,(!

!(*

!(*

)('

+(*

-.(./ -!!"!!/

-!!"!!/

-!!"!!/

-!#"!#/-!#"!#/

-!#"!#/ -!#"!#/

-.(./

-!#"!!$

!

"#"

!!"!#$

!#"!#

%

!!"!!

"#"

!#"!#

"#"

!!"!!

(a) (b)





Other algorithms 67

  Minimization   Given an input WFST, produces a

‘minimal’ version which is guaranteed to have the smallest possible number of states while preserving the input language and and weight/path properties of the original

  Projection   Turns a transducer into an acceptor

by projecting either just the input language or just the output language

  Weight-pushing   Pushes arc weights forward or

backward, accumulating and/or distributing them where appropriate according to the semiring

A

Project(A)

A Minimize(A)

A Push(A)

Limitations   Weighted Finite-State Transducers provide a powerful and flexible

framework for performing many operations however they do have some limitations   Expressive power

WFSTs are equivalent in terms of representational power to regular languages thus they suffer from the same limitations. In particular there are certain grammars which cannot be represented by WFSTs, e.g., BNAN where N is not specified at compile time. Although this may be approximated for various values of N it cannot be explicitly represented by a WFST or a regular expression

  Computing resources Large static graphs such as those often employed for LVCSR tasks may require large amounts of memory, space, and time to construct, compute, and store.

68

weighted finite-state transducersnovakj/wfst-algorithms.pdf(4) ¯0 is an annihilator for ⊗ : ∀a...

Documents