regular expression that produce parse trees
DESCRIPTION
Presenting a regular expression engine, that gives parse trees in a single pass by modifying the standard non-deterministic finite-state automaton algorithm. My master thesis.TRANSCRIPT
Efficient Regular Expressions that produce Parse Trees
Aaron Karper Niko Schwarz
University of Bern
January 7, 2014
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 1 / 38
Regular expressions so far
Regular expressions
https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain
((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38
Regular expressions so far
Regular expressions
https? : // (([a− z ] + \.) + ([a− z ]+))︸ ︷︷ ︸domain
((/[a− z0− 9]+)/?)︸ ︷︷ ︸path segments
http : // www︸ ︷︷ ︸domain
. reddit︸ ︷︷ ︸domain
. com︸︷︷︸domain
/ r︸︷︷︸path
/ computerscience︸ ︷︷ ︸path
/ comments︸ ︷︷ ︸path
/ 1sg69d︸ ︷︷ ︸path
/
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 2 / 38
Regular expressions so far
Regular expressions are greedy by default:(a+)(a?) on "aaa" → "aaa" in group 0 and "" in group 1.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 3 / 38
Regular expressions so far
Regular expressions so far
Posix gives only one match.Regular languages are recognized, but parsing with combinatorical parserstakes O(n3).Backtracking implementations (Java, python, perl, . . . ) are exponentiallyslow in the worst case.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 4 / 38
Benchmarks
Parsing with https?://(([a-z]+\.)+([a-z]+))((/[a-z0-9]+)/?)
2http:// www. reddit. com /r /computerscience /comments /1sg69d
143
0
Figure : Posix
http:// www. reddit. com /r /computerscience /comments /1sg69d2
0
221 3
4 4 4 4
Figure : Our approach
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 5 / 38
Benchmarks
Benchmarks
Matching ((a+b)+c)+ against(a200bc)2000.
Tool Time
JParsec 4,498java.util.regex 1,992
Ours 5,332
Extract all class names from our projectwith complex regular expression1.
Tool Time
java.util.regex 11,319Ours 8,047
1(.*?([a-z]+\.)*([A-Z][a-zA-Z]*))*.*?Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 6 / 38
Benchmarks Optimizations of the algorithm
Benchmarks – Optimizations of the algorithm
Typically most time is spent in long repetitions, we optimize for that case by:Lazily compile deterministic FA.Avoiding to recreate state if seen similar state.Use compressed representation if in static repetition.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 7 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Parse(a?(a)b)+
over”a0a1b2a3b4”
a a b a b0 1 2 3 4
1 122
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 8 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2 q3 q4
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 9 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3 q4
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 10 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 11 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
[[0], [], [0], []]
q9
q5 q6 q7 q8
-
-
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 12 / 38
Benchmarks NFA interpretation
Threads
h1h1 h2 h3 h4 h5 h6
State:
Histories:
qCopy of thread is modified.Copy of array of histories makesreading a character O(m2)
Need faster persistent datastructure to get O(m logm).
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 13 / 38
Benchmarks NFA interpretation
Optimized thread forking
Set entry 2 to 20:
1
2
3
4 5
6
7 8
9
10
11 12
13
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 14 / 38
Benchmarks NFA interpretation
Optimized thread forking
Set entry 2 to 20:
1
2
3
4 5
6
7 8
9
10
11 12
13
1
20
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 15 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
[[0], [], [0], []]
q9
q5 q6 q7 q8
-
-
For each character read, threads start hungry and must eat immediately.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 16 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
q9
q5
[[0], [], [0], []]
q6 q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 17 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3
[[0], [], [], []]
q4
q9
q5
[[0], [], [0], []]
q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 18 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[], [], [], []]
q2
[[0], [], [], []]
q3 q4
q9
q5
[[0], [], [0], []]
q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 19 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3
[[0], [], [], []]
q4
[[0], [], [1], []]
q9
q5
[[0], [], [0], []]
q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 20 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
[[0], [], [1], []]
q9
q5 q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 21 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
[[0], [], [1], []]
q9
q5 q6
[[0], [], [0], [0]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 22 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
q9
q5
[[0], [], [1], []]
q6
[[0], [], [1], [1]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 23 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3 q4
q9
q5 q6
[[0], [], [1], [1]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 24 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[0], [2], [1], [1]]
q2
[[0,2], [2], [1], [1]]
q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,3], [1]]
q9
[[0], [2], [1], [1]]
q5 q6 q7
[[0], [], [1], [1]]
q8
[[0], [2], [1], [1]]
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 25 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[0], [2], [1], [1]]
q2
[[0,2], [2], [1], [1]]
q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,3], [1]]
q9
[[0], [2], [1], [1]]
q5 q6 q7
[[0], [], [1], [1]]
q8
[[0], [2], [1], [1]]
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 26 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,4], [1]]
q9
q5
[[0,2], [2], [1,3], [1]]
q6
[[0,2], [2], [1,3], [1,3]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 27 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1 q2 q3
[[0,2], [2], [1], [1]]
q4
[[0,2], [2], [1,4], [1]]
q9
q5
[[0,2], [2], [1,3], [1]]
q6
[[0,2], [2], [1,3], [1,3]]
q7 q8
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 28 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q1
[[0,2], [2,4], [1,3], [1,3]]
q2
[[0,2,5], [2,4], [1,3], [1,3]]
q3
[[0,2,5], [2,4], [1,3], [1,3]]
q4
[[0,2,5], [2,4,5], [1,3], [1,3]]
q9
[[0,2], [2,4], [1,3], [1,3]]
q5 q6 q7
[[0,2], [2], [1,3], [1,3]]
q8
[[0,2], [2,4], [1,3], [1,3]]
-
-
For each character read, threads start hungry and must eat immediately.Only a hungry thread can eat
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 29 / 38
Benchmarks NFA interpretation
Example: (a?(a)b)+
Reading "a0a1b2a3b4"
q9
[[0,2], [2,4], [1,3], [1,3]]
a a b a b0 1 2 3 4
1 122
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 30 / 38
Download
https://github.com/nes1983/tree-regex
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 31 / 38
NFA construction
S2
S1
-
AlternationS1|S2
S
-
OptionalS?
S
Capture group(S)
S
-
Star operationS*?
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 32 / 38
Backtracking’s nightmare
(a + a+) + b
against”anb”
will backtrack Θ(2n) times.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 33 / 38
Backtracking’s nightmare
Extract the first cell in a CSV that starts with "P"1:
∧(.∗?, ) + (P.∗?),
failing against”1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13”
is exponential.
1From http://www.regular-expressions.info/catastrophic.htmlAaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 34 / 38
Thread execution order matters
.*(a?)
q1start
q2
q3 q4 q5
any
τ1 ↑ a τ1 ↓
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 35 / 38
Priority matters
(a)|(a)
q1start
q2
q3
q4
q5
q6
τ1 ↑
τ2 ↑
a
a
τ1 ↓
τ2 ↓
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 36 / 38
Optimization Pipeline
1 Convert to nondeterministic FA2 Interpret nondeterministic FA, building deterministic FA lazily.3 Find similar/mappable states to avoid creating infinite DFA.4 Run on DFA if possible5 Compactify DFA if creation of new states wasn’t necessary for a while.
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 37 / 38
NFA interpretation
Aaron Karper, Niko Schwarz (UniBe) Regex Parse Trees January 7, 2014 38 / 38