hokkaido university 1 lecture on information knowledge network2011/1/7 lecture on information...

北海道大学 Hokkaido University

1

Lecture on Information knowledge network2011/1/7

Lecture on Information Knowledge Network

"Information retrieval and pattern matching"

Laboratory of Information Knowledge Network,Division of Computer Science,

Graduate School of Information Science and Technology,Hokkaido University

Takuya KIDA

The 5thRegular expression

matchingAbout regular expression

Flow of processingConstruction of syntax tree (parse tree)

Construction of NFA for RESimulating the NFA

2011/1/7 Lecture on Information knowledge network

2

北海道大学 Hokkaido University3


What is regular expression?

Notation for flexible and strong pattern matching–Example of a regular expression of filenames:

> rm ＊ .txt> cp Important[0-9].doc

–Example of a regular expression of search tool Grep ：> grep –E “for.+(256|CHAR_SIZE)” ＊ .c

–Example of a regular expression of programming language Perl ：

$line = m|^http://.+\.jp/.+$|

A regular expression can express a regular set (regular language). – It expresses a language L (sets of strings) which can be

accepted by a finite automaton

matches to any files whose extensions are “.txt. “matches to Important0.doc ～

Important9.doc

matches to strings which start with “http://” and include ”.jp/”.



Definition of regular expression

Definition:A regular expression is a string over Σ∪{ε, |, ･ , *, (, )}, which is recursively defined by the following rules.

– (1) An element of {ε}∪Σ is a regular expressions. – (2) If α and β are regular expressions, then (α ･ β) is a regular

expression. – (3) If α and β are regular expressions, then (α|β) is a regular

expression. – (4) If α is a regular expression, α* is a regular expression. – (5) Only ones led on from the above are regular expressions.

Example: (A ･ ((A ･ T)|(C ･G))*) 　　→ A(AT|CG)*(α ･ β) is often described αβ for short

※ Symbols ‘| ’, ‘ ・’ , ‘*’ are called operator. Moreover, for a regular expression α, “+" is often used in the meaning of α+ =α ・ α*.



Semantic of regular expression

A regular expression is mapped into a subset of Σ* (language L)– (i) ||ε|| = {ε}– (ii) For a∈Σ, || a || = { a }– (iii) For regular expressions α and β, ||(α ・ β)|| = ||α|| ・ ||

β||– (iv) For regular expressions α and β, ||(α|β)|| = ||α||∪||β||– (v) For a regular expression α, ||α*|| = ||α||*

For example ： (a ・ (a | b)*)|| (a ・ (a | b) *) ||

= ||a|| ・ ||(a | b)*||= {a} ・ ||(a | b)||*

= {a} ・ ({a}∪{b})*

= { ax | x∈{a, b}* }

a q1q0

q2

b

a,b

a,b

An equivalent DFA to the left example

※exercise: 　 What is the equivalent language to (AT|GA)(TT)* ?



What is the regular expression matching problem?

Regular expression matching problem ：– It is the problem to find any strings in L(α) ＝ ||α||, which is defined by a

given α, from a given text. The ability of regular expression to define a language is equal to

that of finite automaton!– We can construct a finite automaton that accepts the same language

expressed by a regular expression.– We also can describe a regular expression that expresses the same

language accepted by a finite automaton.※ Please refer to "Automaton and computability" (2.5 regular expressions and

regular sets), written by Setsuo Arikawa and Satoru Miyano.

What we should do for matching a regular expression is to make an automaton (NFA/DFA) corresponding to the regular expression and then to simulate it.

– A regular expression is easier to convert to NFA than to DFA.– The initialization state of the automaton is always active.– The pattern expressed by a given regular expression occurs when the

automaton reaches to the final states by reading a text.



Flow of pattern matching process

Regular expression

Parse tree NFAReport

the occurrences

DFA

Parsing

Constructing NFA by Thompson method

Constructing NFA by Glushkov method

Translate

Scan texts

Sca

n te

xts

General flow

Flow with filtering technique

Regularexpression

A set of strings

Find the candidates

Report the occurrences

ExtractingMultiple pattern

matchingVerify



Construction of parse tree

Parse tree: 　 a tree structure used in preparation for making NFA

– Each leaf node is labeled by a symbol a∈Σ or the empty word ε.– Each internal node is labeled by a operator symbol on {|, ・ , *}.– Although a parser tool like Lex and Flex can parse regular

expressions, it is too exaggerated. (The pseudo code of the next slide is enough to do that).

・

|

・・

A T G A

|

*

・・

A G ・

A A

A

Example: the parse tree TRE for regular expression RE=(AT|GA)((AG|AAA)*)

( A T | G A ) ( ( A G | A A A ) * )

1

Depth of parentheses Operator

|1

2 |

・

|

・・

A T G A

|

*

・・

A G ・

A A

A



Pseudo code

Parse (p=p1p2…pm, last)1 v ← θ;2 while plast≠$ do3 if plast Σ or p∈ last=ε then /* normal character */4 vr ← Create a node with plast;5 if v≠θthen v ← [ ・ ](v, vr);6 else v ← vr;7 last ← last + 1;8 else if plast = ‘|’ then /* union operator */9 (vr, last) ← Parse(p, last + 1);10 v ← [ | ](v, vr);11 else if plast = ‘*’ then /* star operator */12 v ← [ * ](v);13 last ← last + 1;14 else if plast = ‘( ’ then /* open parenthesis */15 (vr, last) ← Parse(p, last + 1);16 last ← last + 1;17 if v≠θthen v ← [ ・ ](v, vr);18 else v ← vr;19 else if plast = ‘)’ then /* close parenthesis */20 return (v, last);21 end of if22 end of while23 return (v, last);



NFA construction by Thompson method

Idea:– Traversing the parse tree TRE for a given RE in post-order traversal, we

construct the automaton Th(v) that accepts language L(REv) corresponding to a partial tree whose top is node v.

– The key point is that Th(v) can be obtained by connecting with ε transitions the automatons corresponding to each partial tree whose top is a child of v.

Properties of Thompson NFA– The number of states < 2m, and the number of state transitions < 4m

→O(m).– It contains many ε transitions.– The transitions other than ε connect the states from i to i+1.

K. Thompson. Regular expression search algorithm. Communications of the ACM, 11:419-422, 1968.

Example ：　 Thompson NFA for RE = (AT|GA)((AG|AAA)*)

A

0

1 2

17

3

4 5 6

T

7

8

9 10 11

12 13 14 15

16

G A

A G

A A Aε

εε

ε ε

ε

ε

ε ε

ε ε

ε



NFA construction algorithm

I F

(i) When v is the empty word ε

ε

(ii) When v is a character “a”

I Fa

(iii) When v is a concatenation ” ・” → (vL ・ vR)

IL vL FRvR

(iv) When v is a selection ”|” → (vL| vR)

IL vL FL

FI

IR vR FR

ε

ε ε

ε

(v) When v is a repetition”*” → v*

v

FI

ε ε

ε

ε

For the parse tree tree TRE, as traversing the tree in post-order traversal, it generates and connects the automatons for each node as follows.



Move of the NFA construction algorithm

Ex. ： Thompson NFA forRE = (AT|GA)((AG|AAA)*)

A1 2 3

T

54G

6A

9 10A

11G

12 13A

14A

15A

0

ε

ε

7ε

ε8 ε

ε

16

ε

ε

17

ε

ε

ε

ε

Ex. ： Parse tree TRE for RE=(AT|GA)((AG|AAA)*)

・

|

・・

A T G A

|

*

・・

A G ・

A A

A

1 2

3

4 5

6

7

8 9

10

11 12

13 14

15

16

17

18

A

0

1 2

17

3

4 5 6

T

7

8

9 10 11

12 13 14 15

16

G A

A G

A A Aε

εε

ε ε

ε

ε

ε ε

ε ε

ε



Pseudo code

Thompson_recur (v)1 if v = “|”(vL, vR) or v = “ ・” (vL, vR) then2 Th(vL) ← Thompson_recur(vL);3 Th(vR) ← Thompson_recur(vR);4 else if v=“*”(vC) then Th(v) ← Thompson_recur(vC);5 /* the above is for recursive traversal (post-order) */6 if v=(ε) then return construction (i);7 if v=(α), α Σ ∈ then return construction (ii);8 if v=“ ・” (vL, vR) then return construction (iii);9 if v=“|”(vL, vR) then return construction (iv);10 if v=“*”(vC) then return construction (v);

Thompon(RE)11 vRE ← Parse(RE$, 1); /* construct the parse tree */12 Th(vRE) ← Thompson_recur(vRE);



NFA construction by Glushkov method

Idea– Making a new expression RE’ by numbering each symbol a∈∑ sequentially

from the beginning to the end. (Let ∑’ be the alphabet with subscripts) Example: RE = (AT|GA)((AG|AAA)*) → RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)

– After constructing a NFA that accepts language L(RE'), we obtain the final NFA by removing the subscript numbers.

Properties of Glushkov NFA– The number of states is just m+1, and the number of state transitions is

O(m2).– It doesn't contain any ε transitions.– For any node, all the labels of transitions entering to the node are the same.

V-M. Glushkov. The abstract theory of automata. Russian Mathematical Surveys, 16:1-53, 1961.

A0 1 2

TG

3 4A A

5 6G A

7 8A

9A

A

A AA A

A

A10 1 2T2

G3

3 4A4 A5 5 6

G6 A7 7 8A8 9

A9

A7

A5 A7A5A7

A5

Example ：　 The Glushkov NFA

Example ：　 A NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)



NFA construction algorithm (1)

Construction procedure:– Making a new expression RE’ by numbering each symbol a∈∑

sequentially from the beginning to the end. Pos(RE’) = {1…m}, and ∑’ is the alphabet with subscript numbers.

– As traversing the parse tree TRE’ in post-order traversal, for each language REv’ corresponding to a partial tree whose top is v, it calculates set First(REv’), set Last(REv’), function Emptyv, and function Follow(RE', x) of position x as follows.

First(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, αxu∈L(RE’)} Last(RE’) = {x∈Pos(RE’) | ∃u∈∑’*, uαx∈L(RE’)} Follow(RE’, x) = {y∈Pos(RE’) | ∃u, v∈∑’*, uαxαyv∈L(RE’)} EmptyRE: a function that returns {ε} if ε belongs to L(RE), otherwise

returns φ.This can be recursively calculated as follows.

Emptyε = {ε},Emptyα∈∑ = φ,

EmptyRE1|RE2 = EmptyRE1 ∪ EmptyRE2,EmptyRE1 ・ RE2= EmptyRE1 ∩ EmptyRE2,EmptyRE* = {ε}.

– The NFA is constructed based on the values obtained from the above.

Positions of the initial states

Positions of the final states

transition functions

Whether the initial state of the NFA is a

final state or not?



NFA construction algorithm (2)

The Glushkov NFA GL’= (S, ∑’, I, F, δ’) that accepts language L(RE')–S :A set of states. S = {0, 1, …, m}–∑' :The alphabet with subscript numbers– I :The initial state id; I = 0–F :The final states;

F = Last(RE’)∪(EmptyRE ・ {0}).–δ' :Transition function defined by the followings

∀x∈ Pos(RE’), ∀y∈ Follow(RE’, x), δ’(x, αy) = yThe transitions from the initial state are as follows: ∀y∈ First(RE’), δ’(0, αy) = y

A10 1 2T2

G3

3 4A4 A5 5 6

G6 A7 7 8A8 9

A9

A7

A5 A7A5A7

A5

Example ： NFA for RE’ = (A1T2|G3A4)((A5G6|A7A8A9)*)



Pseudo code

Glushkov_variables (vRE, lpos)1 if v = [ | ](vl,vr) or v = [ ・ ](vl,vr) then2 lpos ← Glushkov_variables(vl, lpos);3 lpos ← Glushkov_variables(vr, lpos);4 else if v = [ * ](v*) then lpos ← Glushkov_variables(v*, lpos);5 end of if6 if v = (ε) then7 First(v) ← φ, Last(v) ← φ, Emptyv ← {ε};8 else if v = (a), a Σ ∈ then9 lpos ← lpos + 1;10 First(v) ← {lpos}, Last(v) ← {lpos}, Emptyv ← φ, Follow(lpos) ←

φ;11 else if v = [ | ](vl,vr) then12 First(v) ← First(vl) First(v∪ r);13 Last(v) ← Last(vl) Last(v∪ r);14 Emptyv ← Emptyvl Empty∪ vr;15 else if v = [ ・ ](vl,vr) then16 First(v) ← First(vl) (Empty∪ vl ・ First(vr));17 Last(v) ← (Emptyvr ・ Last(vl)) Last(v∪ r);18 Emptyv ← Emptyvl∩Emptyvr;19 for x Last(v∈ l) do Follow(x) ← Follow(x) First(v∪ r);20 else if v = [ * ](v*) then21 First(v) ← First(v*), Last(v) ← Last(v*), Emptyv ← {ε};22 for x Last(v∈ *) do Follow(x) ← Follow(x) First(v∪ *);23 end of if24 return lpos;

It takes O(m2) time

O(m3) time totally



Pseudo code (cont.)

Glushkov (RE)1 /* make the parse tree by parsing the regular expression */2 vRE ← Parse(RE$, 1);3 4 /* calculate each variable by using the parse tree */5 m ← Glushkov_variables(vRE, 0);6 7 /* construct NFA GL(S,∑, I, F,δ) by the variables */8 Δ←φ;9 for i 0…m ∈ do create state I;10 for x First(v∈ RE) do Δ←Δ {(0, α∪ x, x)};11 for i 0…m ∈ do12 for i Follow(i) ∈ do Δ←Δ {(i,α∪ x, x)};13 end of for14 for x Last(v∈ RE) (Empty∪ vRE ・ {0}) do mark x as terminal;



Take a breath

Taiwan High-speed Railway@Taipei 2011.11.8



Flow of pattern matching process

Regularexpression

Parse tree NFAReport

the occurrences

DFA

Parsing

An NFA can be simulated in O(mn) time

To translate, we need O(2m) time and space

There exists a method of converting directly into a DFA※Please refer the section 3.9 of “Compilers – Principles, Techniques and Tools,” written by A. V. Aho, R. Sethi, and J. D. Ullman, Addison-Wesley, 1986.

Constructing NFA by Thompson method

Constructing NFA by Glushkov method

TranslateScan texts

Sca

n te

xts



Methods of simulating NFAs Simulating a Thompson NFA directly

– The most naïve method– Storing the current active states with a list of size O(m), the method

updates the states of the NFA in O(m) time for each symbol read from a text.

– It obviously takes O(mn) time. Simulating a Thompson NFA by converting into an equivalent DFA

– It is a classical technique.– Refer “Compilers – Principles, Techniques and Tools,” written by A. V. Aho,

R. Sethi, and J. D. Ullman, Addison-Wesley, 1986. – The conversion is done as preprocessing → it takes O(2m) time and space.– There are also techniques that converses dynamically as scanning a text.

Hybrid method– E. W. Myers. A four russians algorithm for regular expression pattern

matching. Journal of the ACM, 39(2):430-448, 1992.– It is a method that combines NFA and DFA to do efficient matching.– It divides the Thompson NFA into modules which include O(k) nodes for

each, and then converses each module into DFA. It simulates the transitions between modules as a NFA.

High-speed NFA simulation by bit-parallel technique– Simulating the Thompson NFA: proposed by Wu and U. Manber[1992] – Simulating the Glushkov NFA: proposed by G. Navarro and M.

Raffinot[1999]



Simulating by converting into an equivalent DFA

DFA Classical (N = (Q,∑, I, F,Δ), T = t1t2…tn)1 Preprocessing:2 for σ ∑ ∈ do Δ←Δ (i, σ, I);∪3 (Qd,∑, Id, Fd,δ) ← BuildDFA(N); /* Make an equivalent DFA with NFA

N */4 Searching:5 s ← Id;6 for pos 1…n ∈ do7 if s F∈ d then report an occurrence ending at pos – 1;8 s ← δ(s, tpos); 9 end of for

C,T

0

0 1

A0 2

0 3 0 4 0 3 6 0 1 4 5 7 0 1 5 7 8

0 1 8 9

0 1 5 7 90 1 5 7

0 1 8 0 1 9

C,T

C

A

G

G

T

G

GC,T

A

C

C,T

AG

AC,T

G

A

AT

GC

G

TA

T

C

A G

G C

T

A

C T A

CT

G

G

A

G

A

C

T

Ex. ： A DFA converted from the Glushkov NFA for RE = (AT|GA)((AG|AAA)*)



Bit-parallel Thompson (BPThompson)

Simulating the Thompson NFA by bit-parallel technique– For Thompson NFAs, note that the next of the i-th state is the i+1th

except for ε transitions.→ Bit-parallelism similar to the Shift-and method can be applicable.

– ε transitions are separately simulated. This needs the mask table of size 2L (L is the number of states of the NFA)

– It takes O(2L + m|∑|) time for preprocessing.– It scans a text in O(n) time when L is small enough.

About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ)– The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|Q|-

11, Fn = |sj∈F 0|Q|-1-j10j

– Definitions of mask tables: Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j

En[ i ] = |sj∈E(i) 0|Q|-1-j10j (where E(i) is the ε-closure of state si)

Ed[D] = |i, i=0 OR D&0L-i-110i ≠ 0L En[ i ]

B[σ] = |i∈0…m Bn[i, σ]

S. Wu and U. Manber. Fast text searching allowing errors. Communications of the ACM, 35(10):83-91, 1992.



Pseudo code

BuildEps (N = (Qn,∑,In,Fn,Bn,En) )1 for σ ∑ ∈ do2 B[σ] ← 0L;3 for i 0…L–1 ∈ do B[σ] ← B[σ] | Bn[i,σ];4 end of for5 Ed[0] ← En[0];6 for i 0…L–1 ∈ do7 for j 0…2∈ i – 1 do8 Ed[2i + j] ← En[ i ] | Ed[ j ];9 end of for10 end of for11 return (B, Ed);

BPThompson (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn)1 Preprocessing:2 (B, Ed) ← BuildEps(N);3 Searching:4 D ← Ed[ In ]; /* initial state */5 for pos 1…n ∈ do6 if D & Fn≠ 0L then report an occurrence ending at pos–

1;7 D ← Ed[ (D << 1) & B[tpos] ];8 end of for



Bit-parallel Glushkov (BPGlushkov)

Simulating the Glushkov NFA by bit-parallel technique–For Glushkov NFAs, note that, for any node, all the labels of

transitions entering to the node are the same.→ Although the bit-parallel similar to the Shift-And method cannot be applicable, each state transition can be calculated by Td[D]&B[σ].

–The number of mask tables is 2|Q| (while it is 2L for BPThompson).– It takes O(2m + m|∑|) time for preprocessing.– It scans a text in O(n) time when m is small enough.– It is more efficient than BPThompson in almost all cases.

About NFA GL=(Q={s0,…,s|Q|-1}, ∑, I = s0, F, Δ)–The expression of mask tables of the NFA: Qn={0,…,|Q-1|}, In = 0|

Q|-11, Fn = |sj∈F 0|Q|-1-j10j

–Definitions of mask tables: Bn[i,σ] = |(si,σ,sj)∈Δ 0|Q|-1-j10j

B[σ] = |i∈0…m Bn[i, σ]

Td[D] = |(i,σ), D&0m-i10i ≠ 0m+1, σ∈∑ Bn[i,σ]

G. Navarro and M. Raffinot. Fast regular expression search. In Proc. of WAE99, LNCS1668, 199-213, 1999.



Pseudo codeBuildTran (N = (Qn,∑,In,Fn,Bn,En) )1 for i 0…m ∈ do A[ i ] ← 0m+1;2 for σ ∑ ∈ do B[σ] ← 0m+1;3 for i 0…m, σ ∑ ∈ ∈ do4 A[ i ] ← A[ i ] | Bn[I,σ];5 B[σ] ← B[σ] | Bn[i,σ];6 end of for7 Td[0] ← 0m+1;8 for i 0…m ∈ do9 for j 0…2∈ i – 1 do10 Td[2i + j] ← A[ i ] | Td[ j ];11 end of for12 end of for13 return (B, Ed);

BPGlushkov (N = (Qn,∑,In,Fn,Bn,En), T = t1t2…tn) 1 Preprocessing:2 for σ ∑ ∈ do Bn[0,σ] ← Bn[0,σ] | 0m1; /* initial self-loop */3 (B, Ed) ← BuildTran(N);4 Searching:5 D ← 0m1; /* initial state */6 for pos 1…n ∈ do7 if D & Fn≠ 0m+1 then report an occurrence ending at pos–1;8 D ← Td[D] & B[tpos];9 end of for



Other topics

Extended regular expression:–The one with allowing two operations, intersection and

complementation, in addition to connection, selection, and repetition.

￢ (UNIX)∧(UNI(.)* | (.)*NIX)– It is different from POSIX regular expression.

H. Yamamoto, An Automata-based Recognition Algorithm for Semi-extended Regular Expressions, Proc. MFCS2000, LNCS1893, 699-708, 2000.

O. Kupferman and S. Zuhovitzky, An Improved Algorithm for the Membership Problem for Extended Regular Expressions, Proc. MFCS2002, LNCS2420, 446-458, 2002.

Researches on speeding-up regular expression matching–Filtration technique using BNDM + verification

G. Navarro and M. Raffinot, New Techniques for Regular Expression Searching, Algorithmica, 41(2): 89-116, 2004.

- In this paper, the method of simulating the Glushkov NFA with mask tables of O(m2m) bits is also presented.



The 5th summary

Regular expression– the ability of it to define the language is the same as that of finite automaton.

Flow of regular expression matching– After translating it to a parse tree, the corresponding NFA is constructed. Matching is done

by simulating the NFA– Filtration + pattern plurals collation + inspection + NFA simulation

Methods for constructing an NFA– Thompson NFA:

The number of states < 2m, and the number of state transitions < 4m →O(m). It contains many ε transitions. The transitions other than ε connect the states from i to i+1.

– Glushkov NFA The number of states is just m+1, and the number of state transitions is O(m2). It doesn't contain any ε transitions. For any node, all the labels of transitions entering to the node are the same.

Methods of simulating NFAs– Simulating Thompson NFAs directly → O(mn) time– Converting into an equivalent DFA → It runs in O(n) for scanning, but it takes O(2m) time

and space for preprocessing.– Speeding-up by bit-parallel techniques: 　 Bit-parallel Thompson and Bit-parallel Glushkov

The next theme– Pattern matching on compressed texts: an introduction of Kida’s research (it’s a trend of

90's in this field!)



Appendix

About the definitions of terms which I didn’t explain in the first lecture.

– A subset of ∑* is called a formal language or a language for short.– For languages L1, L2∈∑*, the set

{ xy | x∈L1 and y∈L2 }is called a product of L1 and L2, and denoted by L1 ・ L2 or L1L2 for short.

– For a language L⊆∑*, we defineL0 = {ε}, Ln = Ln-1 ・ L (n≧1)

Moreover, we defineL* = ∪n=0…∞ Ln

and call it as a closure of L. We also denote L+ = ∪n=1…∞ Ln. About look-behind notations

– I told in the lecture that I couldn’t find the precise description of look-behind notations. But I eventually found that!

Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, The MIT Press, Elsevier, 1990.

(Japanese translation) コンピュータ基礎理論ハンドブックⅠ：アルゴリズムと複雑さ，丸善， 1994.

– Chapter 5, section 2.3 and section 6.1– According to this, it seems that the notion of look-behind appeared in

1964.– It exceeds the frame of context-free grammar!– The matching problem of it is proved to be NP-complete.

hokkaido university 1 lecture on information knowledge network2011/1/7 lecture on information...

Documents