ternary directed acyclic word graphs (tdawg)

38
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last Algorithm Group)

Upload: tao

Post on 19-Jan-2016

38 views

Category:

Documents


1 download

DESCRIPTION

Ternary Directed Acyclic Word Graphs (TDAWG). Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara. Present by Peera Liewlom (The Last Algorithm Group). CIAA 2003. Eighth International Conference on Implementation and Application of Automata - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ternary Directed Acyclic Word Graphs  (TDAWG)

1

Ternary Directed Acyclic Word Graphs (TDAWG)

Satoru Miyamoto, Shunsuke Inenaga,

Masayuki Takeda and Ayumi Shinohara

Present by

Peera Liewlom

(The Last Algorithm Group)

Page 2: Ternary Directed Acyclic Word Graphs  (TDAWG)

2

CIAA 2003• Eighth International Conference on

Implementation and Application of Automata

• July 16-18, 2003, Santa Barbara, CA, USA

• Topic / Committee / Community

Page 3: Ternary Directed Acyclic Word Graphs  (TDAWG)

3

Why did I select this paper ?• DAWG start 1985… not so far• Continueing development• cDAWG, ASDAWG, morphic DAWG, WDAWG,

SDAWG, two-tree DAWG, DASG, CSDAWG etc.• TST : 1997 – 98, TDAWG : 2003• DAWG : Widely Apply by Bioinformatics, NLP,

Graph Theory, String Matching, Automata etc.• Speed & Space Trends in Huge Data Management• Topic for Algorithm Group• Matching the interesting topics in this seminar

group

Page 4: Ternary Directed Acyclic Word Graphs  (TDAWG)

4

Content

• DFA (use in string matching’s problem)

• DAWG

• Ternary Search Tree

• Paper : TDAWG, Experiment & Result

• Paper : Conclusion

• Paper : Discussion

Page 5: Ternary Directed Acyclic Word Graphs  (TDAWG)

5

DFADeterministic Finite Automata

Page 6: Ternary Directed Acyclic Word Graphs  (TDAWG)

6

Formalities• Deterministic Finite Accepter (DFA)

FqQM ,,,, 0Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

Page 7: Ternary Directed Acyclic Word Graphs  (TDAWG)

7

Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

Page 8: Ternary Directed Acyclic Word Graphs  (TDAWG)

8

Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

Page 9: Ternary Directed Acyclic Word Graphs  (TDAWG)

9

Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 10: Ternary Directed Acyclic Word Graphs  (TDAWG)

10

Set of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

Page 11: Ternary Directed Acyclic Word Graphs  (TDAWG)

11

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

Page 12: Ternary Directed Acyclic Word Graphs  (TDAWG)

12

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

Page 13: Ternary Directed Acyclic Word Graphs  (TDAWG)

13

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 14: Ternary Directed Acyclic Word Graphs  (TDAWG)

14

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

Page 15: Ternary Directed Acyclic Word Graphs  (TDAWG)

15

Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

Page 16: Ternary Directed Acyclic Word Graphs  (TDAWG)

16

Another Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML ,, M

acceptacceptaccept

Page 17: Ternary Directed Acyclic Word Graphs  (TDAWG)

17

• ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

Page 18: Ternary Directed Acyclic Word Graphs  (TDAWG)

18

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

Page 19: Ternary Directed Acyclic Word Graphs  (TDAWG)

19

DAWGDirected Acyclic Word Graph

Page 20: Ternary Directed Acyclic Word Graphs  (TDAWG)

20

DAWG

Page 21: Ternary Directed Acyclic Word Graphs  (TDAWG)

21

DAWG

Page 22: Ternary Directed Acyclic Word Graphs  (TDAWG)

22

DAWG

Page 23: Ternary Directed Acyclic Word Graphs  (TDAWG)

23

cDAWG

Page 24: Ternary Directed Acyclic Word Graphs  (TDAWG)

24

แนวคิ�ดพั�ฒนาหลั�กMethodology

node edge

จุ�ดเด�นในการพั�ฒนา

1.DAWG

เป็�นต้�นแบบของการพั�ฒนาDAWG ซึ่��งป็ร�บทิ�ศทิางของกราฟแบบ แต้กต้�นไม้�ให้�สาม้ารถชี้ !ต้นเองได้�ทิ#าให้�ลด้node ลงไป็ ได้�ม้ากและเพั��ม้

ป็ระส�ทิธิ�ภาพัความ้เร*วม้ากกว+าDAG 2.cDAWG

เน�นการลด้จำ#านวนnode ลงทิ#าให้�ลด้จำ#านวนedge ลงต้าม้ไป็ด้�วย

ทิ#าให้�การป็ระม้วลผลเร*วกว+าDAWG 3.ASDAWG

สาม้ารถเก*บsubsequence ทิ�!งห้ม้ด้ให้�รวม้อย/+ในกราฟก�อนเด้ ยวก�น

เห้ม้าะส#าห้ร�บการว�เคราะห้0subsequence และลด้พั1!นทิ � ห้น+วยความ้จำ#าได้�ม้าก

4.morphic DAWG

เป็�นการป็ระย2กต้0น#าฟ3งก0ชี้��นม้ากระทิ#าก�บข�อม้/ลแบบDAWG

5.WDAWG

ม้ กรอบความ้ยาวของสายsequence ส#าห้ร�บควบค2ม้เฉพัาะส��งทิ �เรา สนใจำ(VLDC) โด้ยส��งทิ �ไม้+สนใจำให้�ก#าห้นด้เป็�นwildcard ทิ#าให้�

เจำาะกล2+ม้เป็6าห้ม้ายในการว�เคราะห้0ได้�ง+ายสะ ด้วกข�!น6.SDAWG

ใชี้�ป็ร�บโครงสร�าง DAWG ให้�ม้ ค2ณสม้บ�ต้�symmetric tree

ทิ#าให้�ม้ ความ้เร*วเฉล �ยในการใชี้�งานส/งส2ด้7.two-tree DAWG

เป็�นเทิคน�คส#าห้ร�บต้�ด้แบ+งDAWG ออกเป็�น2 ส+วนซึ่��งทิ#าให้�การ อ�พัเด้ทิข�อม้/ลทิ#าได้�เร*วข�!นไม้+ต้�องป็ร�บโครงสร�างต้�นไม้�ทิ�!งต้�น

8.DASG

พั�ฒนาเพั��ม้จำากcDAWG โด้ยก#าห้นด้ให้�แต้+ละ edge เชี้1�อม้โยง ระห้ว+างnode สาม้ารถม้ ทิ�ศทิางไป็และย�อนกล�บได้�

9.CSDAWG

ป็ร�บให้�โครงสร�างต้�นไม้�DAWG สาม้ารถม้ จำ2ด้เร��ม้ต้�นและจำ2ด้ส�!นส2ด้ เป็�นจำ2ด้เด้ ยวก�นได้�ทิ#าให้�น#าการเก*บข�อม้/ลแบบน !ไป็ใชี้�ก�บข�อม้/ลกราฟ ฟ8คห้ร1อจำ โอเม้ต้ร�กเชี้+น วงกลม้ห้ร1อโพัล กอนได้�

Page 25: Ternary Directed Acyclic Word Graphs  (TDAWG)

25

TSTTernary Search Tree

Page 26: Ternary Directed Acyclic Word Graphs  (TDAWG)

26

TST History• Jon L. Bentley and Robert Sedgewick• Algorithms for Sorting and Searching

Strings, Proceeding. 8th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), January 1997.

• Ternary Search Trees, Dr. Dobb's Journal, April 1998.

• Dictionary of Algorithms and Data Structures, National Institute of Standard and Technology, http://www.nist.gov/

Page 27: Ternary Directed Acyclic Word Graphs  (TDAWG)

27

BST DST

TST

Page 28: Ternary Directed Acyclic Word Graphs  (TDAWG)

28

Page 29: Ternary Directed Acyclic Word Graphs  (TDAWG)

29

TDAWGTernary Directed Acyclic Word Graph

Page 30: Ternary Directed Acyclic Word Graphs  (TDAWG)

30

Introduction

• DFA how to implement the transitions of each state ? (Time & Space efficiency)

• TST “implant” BST for transitions– Good Time

• DAWG smallest DFA for all suffixes– Good Space

• TDAWG

• Proof : TDAWG VS. DAWG

Page 31: Ternary Directed Acyclic Word Graphs  (TDAWG)

31

Hypothesis / Theorem (1/2)• Time = Construct + Search (useable for online)• DFA function

= Alphabet (Chinese & Japan ~ 1000 chars)• State• Table O(|p|) p = length of pattern• Table use very large memory• Link List O(| | x |p|) search time• If is large … problem for search time

FqQM ,,,, 0

QQ :

Page 32: Ternary Directed Acyclic Word Graphs  (TDAWG)

32

Hypothesis / Theorem (2/2)• For TDAWG

– Use O(|S|) space– Use O(log|| x |p|) for search time– Use O(|| x |S|2) construct time (Bentley & Sedwick)– Use O(|| x |S|) construct time (this paper … apply from

Blummer’s online DAWG construction)

• Comparison : TDAWG VS. DAWG(table & link list)– Space , Search Time , Construction Time

Page 33: Ternary Directed Acyclic Word Graphs  (TDAWG)

33

TST TDAWG

Page 34: Ternary Directed Acyclic Word Graphs  (TDAWG)

34

Online DAWG Construction

Page 35: Ternary Directed Acyclic Word Graphs  (TDAWG)

35

Online TDAWG Construction

Page 36: Ternary Directed Acyclic Word Graphs  (TDAWG)

36

Experiment Result

Page 37: Ternary Directed Acyclic Word Graphs  (TDAWG)

37

Conclusion

• New data structure … TDAWG

• Construction time (English text 256)– TDAWG < linklistDAWG < tableDAWG

• Space Requirment– linklistDAWG < TDAWG ~ 20 %– tableDAWG not compare in same scale

• Search Time– Short pattern: tableDAWG best , TDAWG <

linklistDAWG– Log curve VS. Linear Curve (long pattern?)

Page 38: Ternary Directed Acyclic Word Graphs  (TDAWG)

38

Discussion & Future Work• In Asian Language (characters~1000s)

should have better search time than English (character 256) because log(||x|p|)

• Apply to other DAWG… cDAWG, minimumDAWG …etc.

• More efficiency by AVL tree (AVL-balance)

• Bioinformatic have 4 character . But, Sliding window with 12 characters = 412