regular expression made by to minh hoang - portal team
DESCRIPTION
This is a presentation from eXo Platform SEA.TRANSCRIPT
Regular Expressions
Minh Hoang TOPortal Team
2
Agenda
» Finite State Machine
» Pattern Parser
» Java Regex » Parsers in GateIn
» Advanced Theory
Finite State Machine
4
State Diagram
5
JIRA Issue Lifecycle
6
Java Thread Lifecycle
7
Java Compilation Flow
8
Finite State Machine - FSM
» Behavioral model to describe working flow of a system
9
Finite State Machine - FSM
» Directed graph with labeled edges
Pattern Parser
11
Classic Problem
» A – Finite characters set
Ex:
A = {a, b, c, d,..., z} or A = { a, b, c,..., z, public, class, extends, implements, while, if,...}
» Pattern P and input sequence INPUT made of A 's elements
Ex:
P = “a.*b” or P = “class.*extends.*”INPUT = “aaabbbcc” or INPUT = a Java source file
→ Parser reads character-by-character INPUT and recognizes all subsequences matching pattern P
12
Classic Problem - Samples
» Split a sequence of characters into an array of subsequences
String path = “/portal/en/classic/home”; String[] segments = path.split(“/”);
» Handle comment block encountered in a file
» Override readLine() in BufferedReader
» Extract data from REST response
» Write an XML parser from scratch
13
Finite State Machine & Classic Problem
» Acceptor FSM?
» How to transform Classic Problem into graph traversing problem with well-known generic solution?
Find pattern occurrences ↔ Traversing directed graph with labeled edges
14
FSM – Word Accepting
» Consider a word W – sequence of characters from character set A
W = “abcd...xyz”
FSM having graph edges labeled with characters from A, accepts W if there exists a path connecting START node to one of END nodes
START = S1 → S2 → … → Sn = END
1. Duplicate of intermediate nodes is allowed
2. The transition from S_i → S_(i+1) is determined (labeled) by i-th character of W
15
Acceptor FSM
» Given a pattern P, a FSM is called Acceptor FSM if it accepts any word matching pattern P.
Ex:
Acceptor FSM of “a[0-9]b” accepts any elements from word set
{ “a0b”, “a1b”, “a2b”, “a3b”, “a4b”, “a5b”, “a6b”, “a7b”, “a8b”, “a9b”}
16
How Pattern Parser Works?
Traversing directed graph associated with Acceptor FSM
1. Start from root node 2. Read next characters from INPUT, then makes move according to transition rules 3. Repeat second step until visiting one leaf node or INPUT becomes empty
4. Return OK if leaf node refers to success match.
17
Example One
» Recognize pattern
eXo.*er
in:
AAAeXo123erBBBeXoerCCCeXoeXoerDDD
18
Example One
» Acceptor FSM with 8 states:
START – Start reading input sequence
e – encounter eeX – encounter eX
eXo – encounter eXo
eXo.* – encounter eXo.*
eXo.*e – encounter eXo.*e
END – subsequence matching eXo.*er foundFAILURE
19
20
Example Two
» Recognize comment block
/* */in:
/* Don't ask * /final int innerClassVariable;
21
Example Two
» Acceptor FSM with 5 states:
START – start reading input sequence
OUT – stay away from comment blocks
ENTERING – at the beginning of comment block
IN – stay inside a comment block
LEAVING – at the end of comment block
22
23
Finite State Machine With Stack
» Example Two is slightly harder than Example One as transition decision depends on past information → We must keep something in memory
»
FSM with Stack = Ordinary FSM + Stack Structure storing past info
Contextual transition is determined by (next input character ,stack state)
Java Regex
25
Model
» Pattern: Acceptor Finite State Machine
» Matcher: Parser
26
java.util.regex.Pattern
» Construct FSM accepting pattern
Pattern p = Pattern.compile(“a.*b”);
FSM states are instances of java.util.regex.Pattern$Node
» Generate parser working on input sequence
Matcher matcher = p.matcher(“aaabbbb”);
27
java.util.regex.Matcher
» Find next subsequence matching pattern
find()
» Get capturing groups from latest match
group()
28
Capturing Group
Two Pattern objects
Pattern p = Pattern.compile(“abcd.*efgh”);Pattern q = Pattern.compile(“abcd(.*)efgh”);String text = “abcd12345efgh”;Matcher pM = p.match(text);Matcher qM = q.match(text);
» pM.find() == qM.find();
» pM.group(1) != qM.group(1);
29
Capturing Group
» Hold additional information on each match
while(matcher.find()){ matcher.group(index);}
» Pattern P = (A)(B(C))
matcher.group(0) = the whole sequence ABCmatcher.group(1) = ABCmatcher.group(2) = BCmatcher.group(3) = C
30
Capturing Group
» Pattern.compile(“abc(defgh”);Pattern.compile(“abcdef)gh”);
→ PatternSyntaxException
» Pattern.compile(“abc\\(defgh”);Pattern.compile(“abcdef\\)gh”);
→ Success thanks to escape character '\'
31
Operators
» Union
[a-zA-Z-0-9]» Negation
[^abc]
[^X]
32
Contextual Match
» X(?=Y)
Once match X, look ahead to find Y
» X(?!= Y)
Once match X, look ahead and expect not find Y
» X(?<= Y)
Once match X, look behind to find Y
» X(?<!= Y)
Once match X, look behind and expect not find Y
33
Tips
» Pattern is stateless → Maximize reuse
We often see:
static final Pattern p = Pattern.compile(“a*b”);
» Be careful with String.split
String.split vs Java loop + String.charAt
Parsers in GateIn
35
Parsers in GateIn
» JavaScript Compressor
» CSS Compressor
» Groovy Template Optimizer
» Navigation Controller
Extracting URL param = Regex matching + Backtracking algorithm
» StaxNavigator (Nice XML parser based on StAX)
Advanced Theory
37
Grammar & Language
» Any word matching pattern eXo.*er is a combination transforms, starting from S
S → eXoQerQ → RQTQ → ''R → {a,b,c,d,...}T → {a,b,c,d,...}
» Language of a Grammar = Vocabularies generated by finite-combination of transforms, starting from S
Ex: Any valid Java source file is generated by a finite number of transforms mentioned in Java Grammar (JLS)
38
Finite State Machine & Language
» Language accepted by a FSM with Stack must be built from a context-free grammar
Explicit steps to build such context-free grammar are described in Kleene theorem
» Context-free grammar Language is accepted by a FSM with Stack
Explicit steps to build such Finite State Machine aredescribed in Kleene theorem