1 regexes vs regular expressions; and recursive descent parser ras bodik, thibaud hottelier, james...
TRANSCRIPT
![Page 1: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/1.jpg)
1
Regexes vs Regular Expressions; andRecursive Descent Parser
Ras Bodik, Thibaud Hottelier, James IdeUC Berkeley
CS164: Introduction to Programming Languages and Compilers Fall 2010
![Page 2: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/2.jpg)
Expressiveness of recognizers
What does it mean to "tell strings apart"?Or "test a string" or "recognize a language", where language = a (potentially infinite) set of strings
It is to accept only a string with that has some property
such as can be written as ('1'*k)*m, k>1, m>1or contains only balanced parentheses: ((())()(()))
Why can't a reg expression test for ('1'*k)*m, k>1,m>1 ?
Recall reg expression: char . | *We can use sugar to add e+, by rewriting e+ to e.e*We can also add e++, which means 2+ of e: e++ --> e.e.e*
2
![Page 3: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/3.jpg)
… continued
So it seems we can test for ('1'*k)*m, k>1,m>1, right?
(1++)++ rewrite 1++ using e++ --> e.e+(11+)++ rewrite (11+)++ using e++ --> e.e+(11+)(11+)+
Now why isn't (11+)(11+)+ the same as (11+)\1+ ?
How do we show these test for different property?
3
![Page 4: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/4.jpg)
A refresher
Regexes and regular expressions both support operators in this grammar
R ::= char | R R | R* | R ‘|’ R
Regexes suppot more operators, such as backreferences \1, \2,Capturing groups
but let’s ignore this for now.
4
![Page 5: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/5.jpg)
Regexes vs RE
Regexes implemented with backtrackingThis regex requires exponential time to discover that it does not match the input string X==============.
X(.+)+X
REs implemented by translation to NFA, which is then translated to DFA.
Corresponding regular expression requires only linear time, after converted to DFA.
5
![Page 6: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/6.jpg)
MatchAll
On the problem of detecting whether a pattern (regex or RE) matches the entire string, both regex and RE interpretation of a patter agree
– After all, to match the whole string, it is sufficient to find any number of times that a Kleene star matches
6
![Page 7: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/7.jpg)
Let’s now focus on when regex and RE differ
7
![Page 8: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/8.jpg)
Example from Jeff Friedl’s book
Imagine you want to parse a config file:filesToCompile=a.cpp b.cpp
The regex for this command line format:[a-zA-Z]+=.*
Now let’s allow an optional \n-separated 2nd line:filesToCompile=a.cpp b.cpp \<\n> d.cpp e.h
We extend the original regex: [a-zA-Z]+=.*(\\\n.*)?
This regex does not match our two-line input. Why?
![Page 9: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/9.jpg)
What compiler textbooks don’t teach you
The textbook string matching problem is simple:Does a regex r match the entire string s?– a clean statement and suitable for theoretical
study– here is where regexes and FSMs are equivalent
The matching problem in the Real World:Given a string s and a regex r, find a substring
in s matching r.
Do you see the language design issue here?– There may be many such substrings. – We need to decide which substring to find.
It is easy to agree where the substring should start:– the matched substring should be the leftmost
match
![Page 10: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/10.jpg)
Two schools of regexes
They differ in where it should end:Declarative approach: longest of all matches
– conceptually, enumerate all matches and return longest
Operational approach: define behavior of *, | operatorse* match e as many times as possible while
allowing the remainder of the regex t o match
e|e select leftmost choice while allowing remainder to match
[a-zA-Z]+ = .* ( \\ \n .* )?
filesToCompile=a.cpp b.cpp \<\n> d.cpp e.h
![Page 11: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/11.jpg)
These are important differences
We saw a non-contrived regex can behave differently– personal story: I spent 3 hours debugging a
similar regex– despite reading the manual carefully
The (greedy) operational semantics of * – does not guarantee longest match (in case you
need it)– forces the programmer to reason about
backtracking
It seems that backtracking is nice to reason about– because it’s local: no need to consider the
entire regex– cognitive load is actually higher, as it breaks
composition
11
![Page 12: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/12.jpg)
Where in history of re did things go wrong?
It’s tempting to blame perl– but the greedy regex semantics seems older– there are other reasons why backtracking is used
Hypothesis 1:creators of re libs knew not that NFA can – can be the target language for compiling regexes– find all matches simultaneously (no backtracking)– be implemented efficiently (convert NFA to DFA)
Hypothesis 2: their hands were tied– Ken Thompson’s algorithm for re-to-NFA was
patented
With backtracking came the greedy semantics– longest match would be expensive (must try all
matches)– so semantics was defined greedily, and non-
compositionally
![Page 13: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/13.jpg)
Concepts
• Syntax tree-directed translation (re to NFA)• recognizers: tell strings apart• NFA, DFA, regular expressions = equally
powerful• but \1 (backreference) makes regexes
more pwrful• Syntax sugar: e+ to e.e*• Compositionality: be weary of greedy
semantics• Metacharacters: characters with special
meaning13
![Page 14: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/14.jpg)
Summary of DFA, NFA, Regexp
What you need to understand and remember– what is DFA, NFA, regular expression– the three have equal expressive power– what is the “expressive power”– you can convert
• RE NFA DFA• NFA RE • and hence also DFA RE, because DFA is a special
case of NFA– NFAs are easier to use, more costly to execute
• NFA emulation O(S2)-times slower than DFA• conversion NFADFA incurs exponential cost in space
Some of these concepts will be covered in the section
14
![Page 15: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/15.jpg)
Recursive descent parser
15
![Page 16: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/16.jpg)
16
Recursive Descent Parser
Poor man’s backtracking parserdoes not do full backtracking you must be a bit carefulbut quite fast, despite backtracking, and simple to implement
many successful languages implemented with r.d. parser
– in many situations, this parser is all you will need
when could you use an even simpler parser?– when the grammar is not (heavily) recursive
• ex: parse a formatted email message for answers to a quiz
– you could use the “spaghetti code” parser from last lecture
– but this simplification may not be worth it– because r.d. parser makes the grammar clear
maintainable
![Page 17: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/17.jpg)
17
A Recursive Descent Parser (1)
write a function for each terminal, production, non-terminal
– return true iff input matches that terminal, production, n/t
– advance next
Terminals:bool term(TOKEN tok) { return in[next++] == tok; }
nth production of non-terminal S:bool Sn() { … }
non-terminal S: bool S() { … }
![Page 18: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/18.jpg)
18
A Recursive Descent Parser (2)
For production E T + E bool E1() { return T() && term(PLUS) && E(); }
For production E T bool E2() { return T(); }
For all productions of E (with backtracking) bool E() { int save = next; return (next = save, E1()) || (next =
save, E2());
}
![Page 19: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/19.jpg)
Ras Bodik, CS 164, Spring 2007
19
A Recursive Descent Parser (4)
Functions for non-terminal Tbool T1() { return term(OPEN) && E() &&
term(CLOSE); }bool T2() { return term(INT) && term(TIMES) &&
T(); }bool T3() { return term(INT); }
bool T() { int save = next; return (next = save, T1())
|| (next = save, T2())
|| (next = save, T3());
}
![Page 20: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/20.jpg)
20
Recursive Descent Parsing. Notes.
To start the parser – Initialize next to point to first token– Invoke E()
Notice how this simulates our backtracking parser
– but r.d. parser does not perform full backtracking
– this is important to remember (see example in a HW)
LL and LR parsing algorithms are more efficient
– see a compiler textbook if interested
![Page 21: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/21.jpg)
Ras Bodik, CS 164, Spring 2007
21
First problem with Recursive-Descent Parsing
Parsing: – given a string of tokens t1 t2 ... tn, find its parse
tree
Recursive-descent parsing, backtracking parsing
– Try all the productions (almost) exhaustively– At a given moment the fringe of the parse tree is:
t1 t2 … tk A …– ie, parser will eventually derive a string starting
with terminals– parser compares this prefix with the remainder of
the input– if mismatch, parser backtracks
• but there are grammars such that – parser will NEVER derive a string starting with a
terminal
![Page 22: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/22.jpg)
Eliminating left recursion
![Page 23: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/23.jpg)
23
When Recursive Descent Does Not Always Work
Consider a production S S a:– In the process of parsing S we try the above
rule– What goes wrong?
A left-recursive grammar has a non-terminal S
S + S for some + : derives in one or more steps
Recursive descent may not work in such cases
– It may go into an loop
You say “may”?– is there a left-recursive. grammar that r.d.
parsers can handle?
![Page 24: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/24.jpg)
24
Elimination of Left Recursion
• Consider the left-recursive grammar S S |
• S generates all strings starting with a and followed by a number of
• Can rewrite using right-recursion S S’
S’ S’ |
![Page 25: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/25.jpg)
25
Elimination of Left-Recursion. Example
• Consider the grammar S 1 | S 0 ( = 1 and = 0 )
can be rewritten as S 1 S’
S’ 0 S’ |
![Page 26: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/26.jpg)
26
Oops, didn’t we break anything in the process?
Consider the grammar for additions:E E + id id
After left-recursion elimination: E id E’ E’ + id E’
Draw the parse tree for id+id+idyour figure comes here
![Page 27: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/27.jpg)
Ras Bodik, CS 164, Spring 2007
27
More Elimination of Left-Recursion
In general S S 1 | … | S n | 1 | … | m
All strings derived from S start with one of 1,…,m and continue with several instances of 1,…,n
Rewrite as S 1 S’ | … | m S’
S’ 1 S’ | … | n S’ |
![Page 28: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/28.jpg)
28
General Left Recursion
• The grammar S A | A S is also left-recursive because
S + S
• This left-recursion can also be eliminated• See [ALSU], Section 4.3 for general
algorithm
![Page 29: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/29.jpg)
29
A comment on removing left recursion
• Not a big deal in practice– ie, you won’t have to convert from left
recursion too often• Just define a right-recursive grammar from
the start– works for many cases
• Example: list of arguments – btw, lists are common in programming
language grammars– Left recursive: LIST id LIST , id– Right recursive: LIST id id , LIST – Just opt for the second alternative!
![Page 30: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/30.jpg)
Left Factoring
![Page 31: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/31.jpg)
Ras Bodik, CS 164, Spring 2007
31
Are all grammars equally efficient for r.d.p.?• Consider this grammar:
E T + E T – E TT id * T id / T id
• Parse this stringid * id
• Do you see the inefficiency?– the parser will repeat this derivation three
times (try it)T id * T id * id
![Page 32: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/32.jpg)
32
Left Factoring
• reduces backtracking in r.d. parser
• beforeE T + E T – E TT id * T id / T id
• afterE T E’E’ + E – E T id T’T’ * T / T
![Page 33: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/33.jpg)
Limited Backtracking
![Page 34: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/34.jpg)
34
Order of productions may matter in r.d. parser• Consider this grammar
E T + E T – E TT F F * T F / T ---- here we are
trying T F firstF id n ( E )
• Now try to parseid * id
• Why does the r.d. parser return “syntax error”?– it never backtracks and tries T F * T– it only tries T F and succeeds
• Lesson: put longer productions first
![Page 35: 1 Regexes vs Regular Expressions; and Recursive Descent Parser Ras Bodik, Thibaud Hottelier, James Ide UC Berkeley CS164: Introduction to Programming Languages](https://reader035.vdocuments.site/reader035/viewer/2022062404/551bd1ca550346b4588b5512/html5/thumbnails/35.jpg)
35
Summary of Recursive Descent
• Simple and general parsing strategy– Left recursion must be eliminated first– Left factoring not essential but helps reduce
backtracking– Ambiguity must be removed– Order of productions compensates for limited
backtracking
• Do you have to do all these by hand?– first two can be done automatically– third needs intelligence– last could perhaps be automated, too