hokkaido university 1 lecture on information knowledge network2011/1/7 lecture on information...

29
北北北北北 Hokkaido University 1 Lecture on Information knowledge network 2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA

Upload: celia-tutton

Post on 28-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University

1

Lecture on Information knowledge network2011/1/7

Lecture on Information Knowledge Network

"Information retrieval and pattern matching"

Laboratory of Information Knowledge Network,Division of Computer Science,

Graduate School of Information Science and Technology,Hokkaido University

Takuya KIDA

Page 2: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

The 5thRegular expression

matchingAbout regular expression

Flow of processingConstruction of syntax tree (parse tree)

Construction of NFA for RESimulating the NFA

2011/1/7 Lecture on Information knowledge network

2

Page 3: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University3

Lecture on Information knowledge network2011/1/7

What is regular expression?

Notation for flexible and strong pattern matching–Example of a regular expression of filenames:

> rm * .txt> cp Important[0-9].doc

–Example of a regular expression of search tool Grep :> grep –E “for.+(256|CHAR_SIZE)” * .c

–Example of a regular expression of programming language Perl :

$line = m|^http://.+\.jp/.+$|

A regular expression can express a regular set (regular language). – It expresses a language L (sets of strings) which can be

accepted by a finite automaton

matches to any files whose extensions are “.txt. “matches to Important0.doc ~

Important9.doc

matches to strings which start with “http://” and include ”.jp/”.

Page 4: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University4

Lecture on Information knowledge network2011/1/7

Definition of regular expression

Definition:A regular expression is a string over Σ∪{ε, |, ・ , *, (, )}, which is recursively defined by the following rules.

– (1) An element of {ε}∪Σ is a regular expressions. – (2) If α and β are regular expressions, then (α ・ β) is a regular

expression. – (3) If α and β are regular expressions, then (α|β) is a regular

expression. – (4) If α is a regular expression, α* is a regular expression. – (5) Only ones led on from the above are regular expressions.

Example: (A ・ ((A ・ T)|(C ・G))*)   → A(AT|CG)*(α ・ β) is often described αβ for short

※ Symbols ‘| ’, ‘ ・’ , ‘*’ are called operator. Moreover, for a regular expression α, “+" is often used in the meaning of α+ =α ・ α*.

Page 5: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University5

Lecture on Information knowledge network2011/1/7

Semantic of regular expression

A regular expression is mapped into a subset of Σ* (language L)– (i) ||ε|| = {ε}– (ii) For a∈Σ, || a || = { a }– (iii) For regular expressions α and β, ||(α ・ β)|| = ||α|| ・ ||

β||– (iv) For regular expressions α and β, ||(α|β)|| = ||α||∪||β||– (v) For a regular expression α, ||α*|| = ||α||*

For example : (a ・ (a | b)*)|| (a ・ (a | b) *) ||

= ||a|| ・ ||(a | b)*||= {a} ・ ||(a | b)||*

= {a} ・ ({a}∪{b})*

= { ax | x∈{a, b}* }

a q1q0

q2

b

a,b

a,b

An equivalent DFA to the left example

※exercise:   What is the equivalent language to (AT|GA)(TT)* ?

Page 6: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University6

Lecture on Information knowledge network2011/1/7

What is the regular expression matching problem?

Regular expression matching problem :– It is the problem to find any strings in L(α) = ||α||, which is defined by a

given α, from a given text. The ability of regular expression to define a language is equal to

that of finite automaton!– We can construct a finite automaton that accepts the same language

expressed by a regular expression.– We also can describe a regular expression that expresses the same

language accepted by a finite automaton.※ Please refer to "Automaton and computability" (2.5 regular expressions and

regular sets), written by Setsuo Arikawa and Satoru Miyano.

What we should do for matching a regular expression is to make an automaton (NFA/DFA) corresponding to the regular expression and then to simulate it.

– A regular expression is easier to convert to NFA than to DFA.– The initialization state of the automaton is always active.– The pattern expressed by a given regular expression occurs when the

automaton reaches to the final states by reading a text.

Page 7: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University7

Lecture on Information knowledge network2011/1/7

Flow of pattern matching process

Regular expression

Parse tree NFAReport

the occurrences

DFA

Parsing

Constructing NFA by Thompson method

Constructing NFA by Glushkov method

Translate

Scan texts

Sca

n te

xts

General flow

Flow with filtering technique

Regularexpression

A set of strings

Find the candidates

Report the occurrences

ExtractingMultiple pattern

matchingVerify

Page 8: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University8

Lecture on Information knowledge network2011/1/7

Construction of parse tree

Parse tree:   a tree structure used in preparation for making NFA

– Each leaf node is labeled by a symbol a∈Σ or the empty word ε.– Each internal node is labeled by a operator symbol on {|, ・ , *}.– Although a parser tool like Lex and Flex can parse regular

expressions, it is too exaggerated. (The pseudo code of the next slide is enough to do that).

|

・ ・

A T G A

|

*

・ ・

A G ・

A A

A

Example: the parse tree TRE for regular expression RE=(AT|GA)((AG|AAA)*)

( A T | G A ) ( ( A G | A A A ) * )

1

Depth of parentheses Operator

|1

2 |

|

・ ・

A T G A

|

*

・ ・

A G ・

A A

A

Page 9: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University9

Lecture on Information knowledge network2011/1/7

Pseudo code

Parse (p=p1p2…pm, last)1 v ← θ;2 while plast≠$ do3 if plast Σ or p∈ last=ε then /* normal character */4 vr ← Create a node with plast;5 if v≠θthen v ← [ ・ ](v, vr);6 else v ← vr;7 last ← last + 1;8 else if plast = ‘|’ then /* union operator */9 (vr, last) ← Parse(p, last + 1);10 v ← [ | ](v, vr);11 else if plast = ‘*’ then /* star operator */12 v ← [ * ](v);13 last ← last + 1;14 else if plast = ‘( ’ then /* open parenthesis */15 (vr, last) ← Parse(p, last + 1);16 last ← last + 1;17 if v≠θthen v ← [ ・ ](v, vr);18 else v ← vr;19 else if plast = ‘)’ then /* close parenthesis */20 return (v, last);21 end of if22 end of while23 return (v, last);

Page 10: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University10

Lecture on Information knowledge network2011/1/7

NFA construction by Thompson method

Idea:– Traversing the parse tree TRE for a given RE in post-order traversal, we

construct the automaton Th(v) that accepts language L(REv) corresponding to a partial tree whose top is node v.

– The key point is that Th(v) can be obtained by connecting with ε transitions the automatons corresponding to each partial tree whose top is a child of v.

Properties of Thompson NFA– The number of states < 2m, and the number of state transitions < 4m

→O(m).– It contains many ε transitions.– The transitions other than ε connect the states from i to i+1.

K. Thompson. Regular expression search algorithm. Communications of the ACM, 11:419-422, 1968.

Example :  Thompson NFA for RE = (AT|GA)((AG|AAA)*)

A

0

1 2

17

3

4 5 6

T

7

8

9 10 11

12 13 14 15

16

G A

A G

A A Aε

εε

ε ε

ε

ε

ε ε

ε ε

ε

Page 11: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University11

Lecture on Information knowledge network2011/1/7

NFA construction algorithm

I F

(i) When v is the empty word ε

ε

(ii) When v is a character “a”

I Fa

(iii) When v is a concatenation ” ・” → (vL ・ vR)

IL vL FRvR

(iv) When v is a selection ”|” → (vL| vR)

IL vL FL

FI

IR vR FR

ε

ε ε

ε

(v) When v is a repetition”*” → v*

v

FI

ε ε

ε

ε

For the parse tree tree TRE, as traversing the tree in post-order traversal, it generates and connects the automatons for each node as follows.

Page 12: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University12

Lecture on Information knowledge network2011/1/7

Move of the NFA construction algorithm

Ex. : Thompson NFA forRE = (AT|GA)((AG|AAA)*)

A1 2 3

T

54G

6A

9 10A

11G

12 13A

14A

15A

0

ε

ε

ε8 ε

ε

16

ε

ε

17

ε

ε

ε

ε

Ex. : Parse tree TRE for RE=(AT|GA)((AG|AAA)*)

|

・ ・

A T G A

|

*

・ ・

A G ・

A A

A

1 2

3

4 5

6

7

8 9

10

11 12

13 14

15

16

17

18

A

0

1 2

17

3

4 5 6

T

7

8

9 10 11

12 13 14 15

16

G A

A G

A A Aε

εε

ε ε

ε

ε

ε ε

ε ε

ε

Page 13: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University13

Lecture on Information knowledge network2011/1/7

Pseudo code

Thompson_recur (v)1 if v = “|”(vL, vR) or v = “ ・” (vL, vR) then2 Th(vL) ← Thompson_recur(vL);3 Th(vR) ← Thompson_recur(vR);4 else if v=“*”(vC) then Th(v) ← Thompson_recur(vC);5 /* the above is for recursive traversal (post-order) */6 if v=(ε) then return construction (i);7 if v=(α), α Σ ∈ then return construction (ii);8 if v=“ ・” (vL, vR) then return construction (iii);9 if v=“|”(vL, vR) then return construction (iv);10 if v=“*”(vC) then return construction (v);

Thompon(RE)11 vRE ← Parse(RE$, 1); /* construct the parse tree */12 Th(vRE) ← Thompson_recur(vRE);

Page 14: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University14

Lecture on Information knowledge network2011/1/7

NFA construction by Glushkov method

Idea– Making a new expression RE’ by numbering each symbol a∈∑ sequentially

from the beginning to the end. (Let ∑’ be the alphabet with subscripts) Example: RE = (AT|GA)((AG|AAA)*) → RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)

– After constructing a NFA that accepts language L(RE'), we obtain the final NFA by removing the subscript numbers.

Properties of Glushkov NFA– The number of states is just m+1, and the number of state transitions is

O(m2).– It doesn't contain any ε transitions.– For any node, all the labels of transitions entering to the node are the same.

V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961.

A0 1 2

TG

3 4A A

5 6G A

7 8A

9A

A

A AA A

A

A10 1 2T2

G3

3 4A4 A5 5 6

G6 A7 7 8A8 9

A9

A7

A5 A7A5A7

A5

Example :  The Glushkov NFA

Example :  A NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)

Page 15: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University15

Lecture on Information knowledge network2011/1/7

NFA construction algorithm (1)

Construction procedure:– Making a new expression RE’ by numbering each symbol a∈∑

sequentially from the beginning to the end. Pos(RE’) = {1…m}, and ∑’ is the alphabet with subscript numbers.

– As traversing the parse tree TRE’ in post-order traversal, for each language REv’ corresponding to a partial tree whose top is v, it calculates set First(REv’), set Last(REv’), function Emptyv, and function Follow(RE', x) of position x as follows.

First(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, αxu∈L(RE’)} Last(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, uαx∈L(RE’)} Follow(RE’, x) = {y∈Pos(RE’) | ∃u, v∈∑’*, uαxαyv∈L(RE’)} EmptyRE: a function that returns {ε} if ε belongs to L(RE), otherwise

returns φ.This can be recursively calculated as follows.

Emptyε = {ε},Emptyα∈∑ = φ,

EmptyRE1|RE2 = EmptyRE1 ∪ EmptyRE2,EmptyRE1 ・ RE2= EmptyRE1 ∩ EmptyRE2,EmptyRE* = {ε}.

– The NFA is constructed based on the values obtained from the above.

Positions of the initial states

Positions of the final states

transition functions

Whether the initial state of the NFA is a

final state or not?

Page 16: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University16

Lecture on Information knowledge network2011/1/7

NFA construction algorithm (2)

The Glushkov NFA GL’= (S, ∑’, I, F, δ’) that accepts language L(RE')–S :A set of states. S = {0, 1, …, m}–∑' :The alphabet with subscript numbers– I :The initial state id; I = 0–F :The final states;

F = Last(RE’)∪(EmptyRE ・ {0}).–δ' :Transition function defined by the followings

∀x∈ Pos(RE’), ∀y∈ Follow(RE’, x), δ’(x, αy) = yThe transitions from the initial state are as follows: ∀y∈ First(RE’), δ’(0, αy) = y

A10 1 2T2

G3

3 4A4 A5 5 6

G6 A7 7 8A8 9

A9

A7

A5 A7A5A7

A5

Example : NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)

Page 17: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University17

Lecture on Information knowledge network2011/1/7

Pseudo code

Glushkov_variables (vRE, lpos)1 if v = [ | ](vl,vr) or v = [ ・ ](vl,vr) then2 lpos ← Glushkov_variables(vl, lpos);3 lpos ← Glushkov_variables(vr, lpos);4 else if v = [ * ](v*) then lpos ← Glushkov_variables(v*, lpos);5 end of if6 if v = (ε) then7 First(v) ← φ, Last(v) ← φ, Emptyv ← {ε};8 else if v = (a), a Σ ∈ then9 lpos ← lpos + 1;10 First(v) ← {lpos}, Last(v) ← {lpos}, Emptyv ← φ, Follow(lpos) ←

φ;11 else if v = [ | ](vl,vr) then12 First(v) ← First(vl) First(v∪ r);13 Last(v) ← Last(vl) Last(v∪ r);14 Emptyv ← Emptyvl Empty∪ vr;15 else if v = [ ・ ](vl,vr) then16 First(v) ← First(vl) (Empty∪ vl ・ First(vr));17 Last(v) ← (Emptyvr ・ Last(vl)) Last(v∪ r);18 Emptyv ← Emptyvl∩Emptyvr;19 for x Last(v∈ l) do Follow(x) ← Follow(x) First(v∪ r);20 else if v = [ * ](v*) then21 First(v) ← First(v*), Last(v) ← Last(v*), Emptyv ← {ε};22 for x Last(v∈ *) do Follow(x) ← Follow(x) First(v∪ *);23 end of if24 return lpos;

It takes O(m2) time

O(m3) time totally

Page 18: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University18

Lecture on Information knowledge network2011/1/7

Pseudo code (cont.)

Glushkov (RE)1 /* make the parse tree by parsing the regular expression */2 vRE ← Parse(RE$, 1);3 4 /* calculate each variable by using the parse tree */5 m ← Glushkov_variables(vRE, 0);6 7 /* construct NFA GL(S,∑, I, F,δ) by the variables */8 Δ←φ;9 for i 0…m ∈ do create state I;10 for x First(v∈ RE) do Δ←Δ {(0, α∪ x, x)};11 for i 0…m ∈ do12 for i Follow(i) ∈ do Δ←Δ {(i,α∪ x, x)};13 end of for14 for x Last(v∈ RE) (Empty∪ vRE ・ {0}) do mark x as terminal;

Page 19: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University19

Lecture on Information knowledge network2011/1/7

Take a breath

Taiwan High-speed Railway@Taipei 2011.11.8

Page 20: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University20

Lecture on Information knowledge network2011/1/7

Flow of pattern matching process

Regularexpression

Parse tree NFAReport

the occurrences

DFA

Parsing

An NFA can be simulated in O(mn) time

To translate, we need O(2m) time and space

There exists a method of converting directly into a DFA※Please refer the section 3.9 of “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986.

Constructing NFA by Thompson method

Constructing NFA by Glushkov method

TranslateScan texts

Sca

n te

xts

Page 21: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University21

Lecture on Information knowledge network2011/1/7

Methods of simulating NFAs Simulating a Thompson NFA directly

– The most naïve method– Storing the current active states with a list of size O(m), the method

updates the states of the NFA in O(m) time for each symbol read from a text.

– It obviously takes O(mn) time. Simulating a Thompson NFA by converting into an equivalent DFA

– It is a classical technique.– Refer “Compilers – Principles, Techniques and Tools,” written by A. V. Aho,

R. Sethi, and J. D. Ullman, Addison-Wesley, 1986. – The conversion is done as preprocessing → it takes O(2m) time and space.– There are also techniques that converses dynamically as scanning a text.

Hybrid method– E. W. Myers. A four russians algorithm for regular expression pattern

matching. Journal of the ACM, 39(2):430-448, 1992.– It is a method that combines NFA and DFA to do efficient matching.– It divides the Thompson NFA into modules which include O(k) nodes for

each, and then converses each module into DFA. It simulates the transitions between modules as a NFA.

High-speed NFA simulation by bit-parallel technique– Simulating the Thompson NFA: proposed by Wu and U. Manber[1992] – Simulating the Glushkov NFA: proposed by G. Navarro and M.

Raffinot[1999]

Page 22: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University22

Lecture on Information knowledge network2011/1/7

Simulating by converting into an equivalent DFA

DFA Classical (N = (Q,∑, I, F,Δ), T = t1t2…tn)1 Preprocessing:2 for σ ∑ ∈ do Δ←Δ (i, σ, I);∪3 (Qd,∑, Id, Fd,δ) ← BuildDFA(N); /* Make an equivalent DFA with NFA

N */4 Searching:5 s ← Id;6 for pos 1…n ∈ do7 if s F∈ d then report an occurrence ending at pos – 1;8 s ← δ(s, tpos); 9 end of for

C,T

0

0 1

A0 2

0 3 0 4 0 3 6 0 1 4 5 7 0 1 5 7 8

0 1 8 9

0 1 5 7 90 1 5 7

0 1 8 0 1 9

C,T

C

A

G

G

T

G

GC,T

A

C

C,T

AG

AC,T

G

A

AT

GC

G

TA

T

C

A G

G C

T

A

C T A

CT

G

G

A

G

A

C

T

Ex. : A DFA converted from the Glushkov NFA for RE = (AT|GA)((AG|AAA)*)

Page 23: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University23

Lecture on Information knowledge network2011/1/7

Bit-parallel Thompson (BPThompson)

Simulating the Thompson NFA by bit-parallel technique– For Thompson NFAs, note that the next of the i-th state is the i+1th

except for ε transitions.→ Bit-parallelism similar to the Shift-and method can be applicable.

– ε transitions are separately simulated. This needs the mask table of size 2L (L is the number of states of the NFA)

– It takes O(2L + m|∑|) time for preprocessing.– It scans a text in O(n) time when L is small enough.

About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ)– The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-

11, Fn = |sj∈F 0|Q|-1-j10j

– Definitions of mask tables: Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j

En[ i ] = |sj∈E(i) 0|Q|-1-j10j (where E(i) is the ε-closure of state si)

Ed[D] = |i, i=0 OR D&0L-i-110i ≠ 0L En[ i ]

B[σ] = |i∈0…m Bn[i, σ]

S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992.

Page 24: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University24

Lecture on Information knowledge network2011/1/7

Pseudo code

BuildEps (N = (Qn,∑,In,Fn,Bn,En) )1 for σ ∑ ∈ do2 B[σ] ← 0L;3 for i 0…L–1 ∈ do B[σ] ← B[σ] | Bn[i,σ];4 end of for5 Ed[0] ← En[0];6 for i 0…L–1 ∈ do7 for j 0…2∈ i – 1 do8 Ed[2i + j] ← En[ i ] | Ed[ j ];9 end of for10 end of for11 return (B, Ed);

BPThompson (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn)1 Preprocessing:2 (B, Ed) ← BuildEps(N);3 Searching:4 D ← Ed[ In ]; /* initial state */5 for pos 1…n ∈ do6 if D & Fn≠ 0L then report an occurrence ending at pos–

1;7 D ← Ed[ (D << 1) & B[tpos] ];8 end of for

Page 25: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University25

Lecture on Information knowledge network2011/1/7

Bit-parallel Glushkov (BPGlushkov)

Simulating the Glushkov NFA by bit-parallel technique–For Glushkov NFAs, note that, for any node, all the labels of

transitions entering to the node are the same.→ Although the bit-parallel similar to the Shift-And method cannot be applicable, each state transition can be calculated by Td[D]&B[σ].

–The number of mask tables is 2|Q| (while it is 2L for BPThompson).– It takes O(2m + m|∑|) time for preprocessing.– It scans a text in O(n) time when m is small enough.– It is more efficient than BPThompson in almost all cases.

About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ)–The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|

Q|-11, Fn = |sj∈F 0|Q|-1-j10j

–Definitions of mask tables: Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j

B[σ] = |i∈0…m Bn[i, σ]

Td[D] = |(i,σ), D&0m-i10i ≠ 0m+1, σ∈∑ Bn[i,σ]

G. Navarro and M. Raffinot. Fast regular expression search. In Proc. of WAE99, LNCS1668, 199-213, 1999.

Page 26: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University26

Lecture on Information knowledge network2011/1/7

Pseudo codeBuildTran (N = (Qn,∑,In,Fn,Bn,En) )1 for i 0…m ∈ do A[ i ] ← 0m+1;2 for σ ∑ ∈ do B[σ] ← 0m+1;3 for i 0…m, σ ∑ ∈ ∈ do4 A[ i ] ← A[ i ] | Bn[I,σ];5 B[σ] ← B[σ] | Bn[i,σ];6 end of for7 Td[0] ← 0m+1;8 for i 0…m ∈ do9 for j 0…2∈ i – 1 do10 Td[2i + j] ← A[ i ] | Td[ j ];11 end of for12 end of for13 return (B, Ed);

BPGlushkov (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn) 1 Preprocessing:2 for σ ∑ ∈ do Bn[0,σ] ← Bn[0,σ] | 0m1; /* initial self-loop */3 (B, Ed) ← BuildTran(N);4 Searching:5 D ← 0m1; /* initial state */6 for pos 1…n ∈ do7 if D & Fn≠ 0m+1 then report an occurrence ending at pos–1;8 D ← Td[D] & B[tpos];9 end of for

Page 27: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University27

Lecture on Information knowledge network2011/1/7

Other topics

Extended regular expression:–The one with allowing two operations, intersection and

complementation, in addition to connection, selection, and repetition.

¬ (UNIX)∧(UNI(.)* | (.)*NIX)– It is different from POSIX regular expression.

H. Yamamoto, An Automata-based Recognition Algorithm for Semi-extended Regular Expressions, Proc. MFCS2000, LNCS1893, 699-708, 2000.

O. Kupferman and S. Zuhovitzky, An Improved Algorithm for the Membership Problem for Extended Regular Expressions, Proc. MFCS2002, LNCS2420, 446-458, 2002.

Researches on speeding-up regular expression matching–Filtration technique using BNDM + verification

G. Navarro and M. Raffinot, New Techniques for Regular Expression Searching, Algorithmica, 41(2): 89-116, 2004.

- In this paper, the method of simulating the Glushkov NFA with mask tables of O(m2m) bits is also presented.

Page 28: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University28

Lecture on Information knowledge network2011/1/7

The 5th summary

Regular expression– the ability of it to define the language is the same as that of finite automaton.

Flow of regular expression matching– After translating it to a parse tree, the corresponding NFA is constructed. Matching is done

by simulating the NFA– Filtration + pattern plurals collation + inspection + NFA simulation

Methods for constructing an NFA– Thompson NFA:

The number of states < 2m, and the number of state transitions < 4m →O(m). It contains many ε transitions. The transitions other than ε connect the states from i to i+1.

– Glushkov NFA The number of states is just m+1, and the number of state transitions is O(m2). It doesn't contain any ε transitions. For any node, all the labels of transitions entering to the node are the same.

Methods of simulating NFAs– Simulating Thompson NFAs directly → O(mn) time– Converting into an equivalent DFA → It runs in O(n) for scanning, but it takes O(2m) time

and space for preprocessing.– Speeding-up by bit-parallel techniques:   Bit-parallel Thompson and Bit-parallel Glushkov

The next theme– Pattern matching on compressed texts: an introduction of Kida’s research (it’s a trend of

90's in this field!)

Page 29: Hokkaido University 1 Lecture on Information knowledge network2011/1/7 Lecture on Information Knowledge Network "Information retrieval and pattern matching"

北海道大学 Hokkaido University29

Lecture on Information knowledge network2011/1/7

Appendix

About the definitions of terms which I didn’t explain in the first lecture.

– A subset of ∑* is called a formal language or a language for short.– For languages L1, L2∈∑*, the set

{ xy | x∈L1 and y∈L2 }is called a product of L1 and L2, and denoted by L1 ・ L2 or L1L2 for short.

– For a language L⊆∑*, we defineL0 = {ε}, Ln = Ln-1 ・ L (n≧1)

Moreover, we defineL* = ∪n=0…∞ Ln

and call it as a closure of L. We also denote L+ = ∪n=1…∞ Ln. About look-behind notations

– I told in the lecture that I couldn’t find the precise description of look-behind notations. But I eventually found that!

Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, The MIT Press, Elsevier, 1990.

(Japanese translation) コンピュータ基礎理論ハンドブックⅠ:アルゴリズムと複雑さ,丸善, 1994.

– Chapter 5, section 2.3 and section 6.1– According to this, it seems that the notion of look-behind appeared in

1964.– It exceeds the frame of context-free grammar!– The matching problem of it is proved to be NP-complete.