finite state subautomata application to electronic dictionaries lamia tounsi polytech'tours,...

67
Finite state subautomata Application to Electronic Dictionaries Lamia Tounsi Polytech'Tours, Computer Science laboratory François Rabelais University of Tours, France [email protected]

Upload: sabrina-mclaughlin

Post on 17-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Finite state subautomataApplication to Electronic Dictionaries

Lamia TounsiPolytech'Tours, Computer Science laboratory

François Rabelais University of Tours, France

[email protected]

2

Motivation

o DFSA are widely used in Natural Language processing

Find all sub structures in a given FSA.

Search of subautomata in a DFSA• Decompose a very large FSA into smaller ones• Discover frequently occurring data • Reduce memory consumption

3

Plan

Mathematical preliminaries • Automaton• Subautomaton

Research of subautomata• Smallest closed subautomaton• Smallest subautomaton

Application to automata representing dictionaries Indexation and Compression Conclusion

Finite state subautomataApplication to Electronic Dictionnaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton•Smallest subautomaton

Application to automata representing dictionariesIndexation and Compression Conclusion

5

Automaton

A deterministic acyclic automaton A =<, Q, , qi, qf > is the alphabet• Q is the finite set of states is the transition function: : Q Q• qi is the initial state (qi Q)• qf is the final state (qf Q)

Let a and w * : (p, )=p (p, wa)= ( (p,w),a)

6

Successors & predecessors

Succ(p) = {qQ : , (p,)= q}Succ*(p) = {qQ : w*, (p,w)= q}

Pred(p) = { qQ : , (q,)= p}Pred*(p) = { qQ : w*, (q,w)= p}

Height : • H(qf)=0• H(p)=Max{q Succ(p)} H(q)+1

7

Automaton

An automaton that recognizes the flexion of nine verbs

H(14)=4

H(13)=5

8

Source (E) & Initial State (p)

Let E E

9

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

10

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

• AN(E)={p Q/ w AP(E), p w}

11

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

• AN(E)={p Q/ w AP(E), p w}

source(E) AN (E)

Source(E) :

H(source(E)) =MinqAN (E)(H(q))

Source(E)

12

Source (E) & Initial State (p)

Let E • AP(E)={ w path from qi to p, p E}

• AN(E)={p Q/ w AP(E), p w}

source(E) AN (E)

source(E) :

H(source(E)) =MinqAN (E)(H(q))

Let p Q, p qi

IS(p) = Source(Pred(p))

13

Source (E) & Initial State (p)

Source(q2, q3, q5) = Source(q3, q4) = q2

Source(q3, q4, q5) = Source(q3, q4, q5 , q6) = q1

IS(q3)= q2

IS(q5)= q1

IS(q6)= q1

14

Sink (E) & Final State (p)

Let E • PP(E) = { w path from p to qf, p E}

• PN(E) = {p Q/ w PP(E), p w}

Sink(E) PN (E)

Sink(E) :

H(Sink(E)) =MaxqPN (E)(H(q))

Let p Q, p qi

FS(p) = Sink(Succ(p))

15

Subautomaton (SA)

A’=<, Q’, ’, si, sf > is a sub automaton of A iff:• Q’ Q

• {si, sf } Q’

Q’ Q’ ’:

(q, ) Q’ : ’ (q, ) = (q, )

q Q’ : q Succ*(si) and q Pred*(sf)

q Q’ \ {si, sf } : Succ(q) Q’ and Pred(q) Q’

16

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

17

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

18

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

19

Subautomaton (SA)

An automaton that recognizes the flexion of nine verbs

SA

20

Closed subautomaton (CSA)

Let Q Q’ and si, sf two distinct states:

A subautomaton A’=<, Q’, ’, si, sf > is a closed subautomaton iff :

q Q’ \ {si}: Pred(q) Q’

q Q’ \ {sf}: Succ(q) Q’

21

Closed subautomaton (CSA)

An automaton that recognizes the flexion of nine verbs

CSA

22

Closed subautomaton (CSA)

An automaton that recognizes the flexion of nine verbs

CSA

23

Closed subautomaton (CSA)

An automaton that recognizes the flexion of nine verbs

CSA

24

Smallest Closed subautomaton (SCSA)

Let Q Q’ and si, sf two distinct states:

A closed subautomaton A’=<, Q’, ’, si, sf >is a smallest closed subautomaton iff :

(si, q) is CSA q= sf

q Q’ :

(q, sf) is CSA q= si

25

Smallest Closed subautomaton (SCSA)

An automaton that recognizes the flexion of nine verbs

SCSASCSASCSA SCSA

26

Smallest subautomaton (SSA)

Let p Q \{si, sf}

The subautomaton A’=<, Q’, ’, si, sf >

is SSA(p) iff :- A’ strictly contains p A’’=<, Q’’, ’’, s’’i, s’’f > wich strictly

contains p : Q’ Q’’

27

Smallest subautomaton (SSA)

An automaton that recognizes the flexion of nine verbs

SSA(6) SSA(18)

Finite state subautomataApplication to Electronic Dictionaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton (SCSA)•Smallest subautomaton (SSA)

Application to automata representing dictionariesIndexation and Compression Conclusion

29

Research SCSA

Property 1.

(si, sf ) is a SCSA iff IS(sf)= si & FS(si)= sf

Property 2. (Associativity)

If E=E1E2 and E1 , E2 then

Source(E)= Source(Source(E1),Source(E2))

Property 3. (Hierarchy between two SCSA )• Either, they have no common transitions,• Either, one is strictly included in the other.

30

Research SCSA

Let p Q1. P.IS : initial state associated to p.2. P.FSmin : minimal final state associated to p, assuming

that p is the initial state of a SCSA.3. P.FSmax : maximal final state associated to p, assuming

that p is the initial state of a SCSA.

Property 4.

p>qi, (p.IS,p) is a SCSA iff p.IS.FSmin p p.IS.FSmax

Complexity Algorithm : O (n2)

31

Research SCSA

FSminis

FSmax

32

Research SCSA

FSminis

FSmax

33

Research SSA

Let A’=<, Q’, ’, si, sf > be a subautomaton

Property 5.E Q’ \ {sf}: Succ*(si)Pred*(E) Q’

E Q’ \ {si}: Pred*(sf)Succ*(E) Q’

34

SSA associated to grey states

E

35

SSA associated to grey states

Source

36

SSA associated to grey states

SinkSource

37

SSA associated to grey states

SinkSource

38

SSA associated to grey states

Source Sink

39

Research SSA

Property 6.

Let p, p’, q, q’ Q• {p, p’} Pred(q) and {q, q’} Succ(p)• H(p’) ≥ H(p) and H(q’) ≤ H(q)

p and q belong to the same SSA

40

All Subautomata of an automaton

Algorithm input: A - output: subautomata

1: repeat2: repeat3: Detect, store and replace each parallels by one transition;4: Detect, store and replace each sequences by one transition;5: until the automaton is freed from all its parallels and sequences6: Detect, store and replace each smallest subautomata by one transition;7: until The automaton A is reduced to one single transition

Valdez J., Tarjan R. E., Lawler E. L., The recognition of series-parallel digraphs, SIAM J. Comput. 11-2:298-313, 1982.

41

All Subautomata of an automaton

42

All Subautomata of an automaton

43

All Subautomata of an automaton

44

All Subautomata of an automaton

45

All Subautomata of an automaton

46

All Subautomata of an automaton

Finite state subautomataApplication to Electronic Dictionaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton (SCSA)•Smallest subautomaton (SSA)

Application to automata representing dictionariesIndexation and Compression Conclusion

48

Dictionaries and automata

10 dictionaries : Lexicographic order of words

• 6 Delaf : French, English, Serbian, German, Polylexicaux English, French cities.

• 4 Web : Frech, Hungarian, Bulgarian and Portuguese.

Properties of automata:Finit set of states, Acyclic, deterministic, unique initial

state, unique final state, minimal.

49

Internal structure of automata

d

50

Internal structure of automata

d

51

Experimental Results

Finite state subautomataApplication to Electronic Dictionnaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton•Smallest subautomaton

Application to automata representing dictionariesFactorisation, indexation and compression Conclusion

53

Factorisation, indexation and compression

The reseach of subautomata detects sequences and parallels

Sequence subautomaton

Parallel subautomaton

Proposal: - The application of the direct acyclic word graph, initially dedicated for

indexing text, to index the subautomata,- heuristic to select the most interesting substructure to factorize.

54

Storage of an automaton

c

c

d

d

1 1 a 8

2 0 c 3

3 1 a 5

4 0 b 6

5 1 b 7

6 1 c 10

7 1 c 9

8 1 b 11

9 1 d 0

10 1 d 11

11 1 b 0

Boolean Character 

log2(|Σ|) Address arrival state

log2(Max address+1)

55

Factorization

c

c

d

d

b

1 1 a 5

2 0 c 3

3 1 a 7

4 0 6

5 1 b 6

6 1 b 0

7 1 0

a cb

1 1 a 8

2 0 c 3

3 1 a 5

4 0 b 6

5 1 b 7

6 1 c 10

7 1 c 9

8 1 b 11

9 1 d 0

10 1 d 11

11 1 b 0

56

Factorisation

b

Factorization

bb

Factorization

c

c

d

d

57

How can we choose the subautomata to factorize ?

- The best candidates to be factorized are those which increase memory storage efficiency and reduce the size of the initial automaton

Profit = saved memory – Consumed memory

- The memory space is saved by elimination of all occurrences of the substructure

- The memory space is consumed by the extention of the alphabet and the index.

58

Directed Acyclic word graph (DAWG)

Computations of frequency and profit associated to each sequence with a DAWG

DAWG (aabba)

59

Greedy Algorithm of Compression

Algorithm input: A - Output: A, Alphabet

1: Iterative process 2: Select the best sequence s from the DAWG 3: Extend the alphabet to represent s4: Delete s from A and from DAWG5: Update the DAWG

60

Compression FCM

FCM

61

Compression FCNM

FCNM

62

Compression FCDic

FCDic

63

Best Compressions

1024

FCNMFCNM

64

Best Compressions

1024

FCNMFCNM

Finite state subautomataApplication to Electronic Dictionaries

Mathematical preliminaries •Automaton•Subautomaton

Research of subautomata•Smallest closed subautomaton•Smallest subautomaton

Application to automata representing dictionariesFactorisation, indexation and compression Conclusion

66

Conclusion

Research of two kinds of smallest subautomata

Statistical analysis of the internal structure of some automata associated to dictionnaries

Method of compression based on factorizations of sequences or parallel subautomata

A minimised automaton does not always lead to the better compression.

67

Future works

Factorization of more kinds of subautomata,

Find a way to deminimised an automaton in order to get a better compression,

Work on alternative encoding of automata, for example a depth first codage