bottom-up parsing
DESCRIPTION
Bottom-Up Parsing. CS 471 September 19, 2007. Where Are We?. Finished Top-Down Parsing Starting Bottom-Up Parsing. Lexical Analysis. Syntactic Analysis. Semantic Analysis. Building a Parser. Have a complete recipe for building a parser. Language Grammar. LL(1) Grammar. - PowerPoint PPT PresentationTRANSCRIPT
Bottom-Up Parsing
CS 471September 19,
2007
CS 471 – Fall 2007
Where Are We?
Finished Top-Down Parsing
Starting Bottom-Up Parsing
Lexical Analysis
Syntactic Analysis
Semantic Analysis
CS 471 – Fall 2007
Building a Parser
Have a complete recipe for building a parser
Language Grammar
LL(1) Grammar
Predictive Parse Table
Recursive-Descent Parser
Recursive-Descent Parser w/AST Gen
CS 471 – Fall 2007
Bottom-Up Parsing
More general than top-down parsing• And just as efficient• Builds on ideas in top-down parsing• Preferred method in practice
Also called LR parsing• L means that tokens are read left to right• R means that it constructs a rightmost derivation
CS 471 – Fall 2007
Top Down vs. Bottom Up Parsing
Bottom-up: Don’t need to figure out as much of the parse tree for a given amount of input
Top Down Bottom Up
scanned
unscanned
CS 471 – Fall 2007
An Introductory Example
Consider the following grammar:
E E + ( E ) | int
• Why is this not LL(1)?
LR parsers:• Can handle left-recursion• Don’t need left factoring
CS 471 – Fall 2007
The Idea
LR parsing reduces a string to the start symbol by inverting productions:
str = input string of terminals repeat
– Identify in str such that A is a production (i.e., str = )– Replace by A in str (i.e., str becomes A )
until str = G
CS 471 – Fall 2007
A Bottom-up Parse in Detail (1)
int++int int
( )
int + (int) + (int)
()
CS 471 – Fall 2007
A Bottom-up Parse in Detail (2)
E
int++int int
( )
int + (int) + (int)E + (int) + (int)
()
CS 471 – Fall 2007
A Bottom-up Parse in Detail (3)
E
int++int int
( )
int + (int) + (int)E + (int) + (int)E + (E) + (int)
()
E
CS 471 – Fall 2007
A Bottom-up Parse in Detail (4)
E
int++int int
( )
int + (int) + (int)E + (int) + (int)E + (E) + (int)E + (int) E
()
E
CS 471 – Fall 2007
A Bottom-up Parse in Detail (5)
E
int++int int
( )
int + (int) + (int)E + (int) + (int)E + (E) + (int)E + (int)E + (E)
E
()
EE
CS 471 – Fall 2007
A Bottom-up Parse in Detail (6)
E
E
int++int int
( )
int + (int) + (int)E + (int) + (int)E + (E) + (int)E + (int)E + (E)E
E
()
EE
CS 471 – Fall 2007
Another example
Grammar:
Is “abbcde” in L(G)?
Yes“Reverse” derivation:
# Production rule
1
2
3
4
G → a A B e
A → A b c
| b
B → d
Rule
Sentential form
-
3
2
4
1
abbcde
aAbcde
aAde
aABe
G
a b b c d e
AB
A
G
CS 471 – Fall 2007
Choosing reductions
Basic algorithm:• Search for right sides of productions, reduce• Does this work?
Not always:
Problem: “aAAcde” is not part of any sentential form
Rule Sentential form
-
3
2
?
abbcde
aAbcde
aAAcde
…now what?
# Production rule
1
2
3
4
G → a A B e
A → A b c
| b
B → d
CS 471 – Fall 2007
How do we choose?
Important Fact #1 about bottom-up parsing:
An LR parser traces a rightmost derivation in reverse
CS 471 – Fall 2007
A →
Why does this help?
Right-most derivation
• A is the right-most non-terminal in 3
• contains only terminal symbols• Unambiguous grammar
– Right-most derivation is unique At each step, reduction is unique
G → 1 → 2 → 3 → 4 → 5 → input
CS 471 – Fall 2007
Notation
Split input into two substrings• Right substring (a string of terminals) is as yet
unexamined by parser• Left substring has terminals and non-terminals
The dividing point is marked by a I• The I is not part of the string
Initially, all input is unexamined: Ix1x2 . . . xn
CS 471 – Fall 2007
Shift-Reduce Parsing
Bottom-up parsing uses only two kinds of actions: Shift and Reduce
Shift: Move I one place to the right• Shifts a terminal to the left string
E + (I int ) E + (int I )
Reduce: Apply an inverse production at the right end of the left string• If E E + ( E ) is a production, then
E + (E + ( E ) I ) E +(E I )
CS 471 – Fall 2007
Shift-Reduce Example
E
E
int++int int
( )
E
()
EE
I int + (int) + (int)$ shift
int I + (int) + (int)$ red. E int
E I + (int) + (int)$ shift 3 times
E + (int I ) + (int)$ red. E int
E + (E I ) + (int)$ shift
E + (E) I + (int)$ red. E E + (E)
E I + (int)$ shift 3 times
E + (int I )$ red. E int
E + (E I )$ shift
E + (E) I $ red. E E + (E)
E I $ accept
CS 471 – Fall 2007
How do we keep track?
Left part string implemented as a stack• Top of the stack is the I• Shift:
– Pushes a terminal on the stack• Reduce:
– Pops 0 or more symbols off of the stack– Symbols are right-hand side of a production– Pushes a non-terminal on the stack (production
LHS)• Terminology
– We refer to the top set of symbols as a handle
CS 471 – Fall 2007
Shift-Reduce Parsing
derivation stack input stream action
(1+2+(3+4))+5 ← (1+2+(3+4))+5 shift
(1+2+(3+4))+5 ← ( 1+2+(3+4))+5 shift
(1+2+(3+4))+5 ← (1 +2+(3+4))+5 reduce E→num
(E+2+(3+4))+5 ← (E +2+(3+4))+5 reduce S → E
(S+2+(3+4))+5 ← (S +2+(3+4))+5 shift
(S+2+(3+4))+5 ← (S+ 2+(3+4))+5 shift
(S+2+(3+4))+5 ← (S+2 +(3+4))+5 reduce E→num
(S+E+(3+4))+5 ← (S+E +(3+4))+5 reduce S→S+E
(S+(3+4))+5 ← (S +(3+4))+5 shift
(S+(3+4))+5 ← (S+ (3+4))+5 shift
(S+(3+4))+5 ← (S+( 3+4))+5 shift
(S+(3+4))+5 ← (S+(3 +4))+5 reduce E→num
CS 471 – Fall 2007
Problem
• How do we know which action to take -- whether to shift or reduce, and which production?
• Sometimes can reduce but shouldn’t
–e.g., X → ε can always be reduced
• Sometimes can reduce in different ways
CS 471 – Fall 2007
Action Selection Problem
• Given stack σ and look-ahead symbol b, should parser:
– shift b onto the stack (making it σb)– reduce some production X → γ assuming that
stack has the form γ (making it X)
• If stack has form γ, should apply reduction X → γ (or shift) depending on stack prefix
is different for different possible reductions, since γ’s have different length.
– How to keep track of possible reductions?
CS 471 – Fall 2007
Parser States
• Goal: know what reductions are legal at any given point
• Idea: summarize all possible stack prefixes as a finite parser state
• Parser state is computed by a DFA that reads in the stack • Accept states of DFA: unique reduction!
• Summarizing discards information– affects what grammars parser handles– affects size of DFA (number of states)
CS 471 – Fall 2007
LR(0) Parser
• Left-to-right scanning, Right-most derivation, “zero” look-ahead characters
• Too weak to handle most language grammars (e.g., “sum” grammar)
• But will help us understand shift-reduce parsing
CS 471 – Fall 2007
LR(0) States
• A state is a set of items keeping track of progress on possible upcoming reductions
• An LR(0) item is a production from the language with a separator “.” somewhere in the RHS of the production
• Stuff before “.” is already on stack (beginnings of possible γ’s to be reduced)
• Stuff after “.” : what we might see next
• The prefixes represented by state itself
E→num ● E→ (● S )
state
item
CS 471 – Fall 2007
Start State & Closure
Constructing a DFA to read stack:
• First step: augment grammar with prod’n S →S $
• Start state of DFA: empty stack = S → . S $
• Closure of a state adds items for all productions whose LHS occurs in an item in the state, just after “.”
– Set of possible productions to be reduced next– Added items have the “.” located at the beginning: no
symbols for these items on the stack yet
S →. S $S →. S $
S → . ( L )S → . id
S →( L ) | idL →S | L , S
closure
CS 471 – Fall 2007
Applying Terminal Symbols
In new state, include all items that have appropriate input symbol just after dot, advance dot in those items, and take closure.
S ’ → . S $S → . ( L )S → . id
S → id
S →( . L )L → . S
L → . L , SS →. ( L )S → . id
S →( L ) | idL →S | L , S
(
id id (
CS 471 – Fall 2007
Applying Nonterminal Symbols
• Non-terminals on stack treated just like terminals (except added by reductions)
S ’ → . S $S → . ( L )S → . id
S → id
S →( . L )L → . S
L → . L , SS →. ( L )S → . id
(
idid (
L → S .
S →( L . )L → L . , S
L
S
S →( L ) | idL →S | L , S
CS 471 – Fall 2007
Applying Reduce Actions
• Pop RHS off stack, replace with LHS X (X→γ)
S ’ → . S $S → . ( L )S → . id
S → id .
S →( . L )L → . S
L → . L , SS →. ( L )S → . id
(
idid (
L → S .
S →( L . )L → L . , S
L
S
States causing reductions
CS 471 – Fall 2007
Full DFA (Appel p. 62)
• reduce-only state: reduce
• if shift transition for look-ahead: shift otherwise: syntax error
• current state: push stack through DFA
S ’ → . S $S → . ( L )S → . id
S ’ → S . $
final state
S →( . L )L → . SL → . L , SS →. ( L )S → . id
S →id . L → L , . SS → . ( L )S → . id
S → ( L . )L → L . , S
L → S . S → ( L ) .
L → L , S .
S
$
(
id
S
id
L
)
,
idS
(
1
4
2
3
7
8
5
6
9
S →( L ) | idL →S | L , S
(
CS 471 – Fall 2007
Parsing Example: ((x),y)
derivation stack input action((x),y) ← 1 ((x),y) shift, goto 3((x),y) ← 1 (3 (x),y) shift, goto 3((x),y) ← 1 (3 (3 x),y) shift, goto 2((x),y) ← 1 (3 (3 x2 ),y) reduce Sid((S),y) ← 1 (3 (3 S7 ),y) reduce LS((L),y) ← 1 (3 (3 L5 ),y) shift, goto 6((L),y) ← 1 (3 (3 L5)6 ,y) reduce S(L)(S,y) ← 1 (3 S7 ,y) reduce LS(L,y) ← 1 (3 L5 ,y) shift, goto 8(L,y) ← 1 (3 L5 , 8 y) shift, goto 9(L,y) ← 1 (3 L5 , 8 y2 ) reduce Sid(L,S) ← 1 (3 L5 , 8 S9 ) reduce LL,S(L) ← 1 (3 L5 ) shift, goto 6(L) ← 1 (3 L5 )6 reduce S(L)S 1 S4 $ done
S →( L ) | idL →S | L , S
CS 471 – Fall 2007
Implementation: LR Parsing Table
nextaction
nextstate
input (terminal) symbols non-terminal symbols
state
state
Action table
Used at every step to decide whether to
shift or reduce
Goto table
Used only when reducing, to determine
next state
a XX ▪
CS 471 – Fall 2007
Shift-Reduce Parsing Table
Action table
1. shift and goto state n
2. reduce using X → γ– pop symbols γ off stack– using state label of top (end) of stack, look up X
in goto table and goto that state
• DFA + stack = push-down automaton (PDA)
next actions
next state on red’n
state
terminal symbols non-terminal symbols
CS 471 – Fall 2007
List Grammar Parsing Table
( ) id , $ S L
1 s3 s2 g4
2 S→id S→id S→id S→id S→id
3 s3 s2 g7 g5
4 accept
5 s6 s8
6 S→(L) S→(L) S→(L) S→(L) S→(L)
7 L→S L→S L→S L→S L→S
8 s3 s2 g9
9 L→L,S L→L,S L→L,S L→L,S L→L,S
CS 471 – Fall 2007
Shift-Reduce Parsing
• Grammars can be parsed bottom-up using a DFA + stack
– DFA processes stack σ to decide what reductions might be possible given
– shift-reduce parser or push-down automaton (PDA)– Compactly represented as LR parsing table
• State construction converts grammar into states that decide action to take
CS 471 – Fall 2007
Checkpoint
• Limitations of LR(0) grammars
• SLR, LR(1), LALR parsers
• Automatic parser generators
CS 471 – Fall 2007
LR(0) Limitations
• An LR(0) machine only works if states with reduce actions have a single reduce action – in those states, always reduce ignoring lookahead
• With more complex grammar, construction gives states with shift/reduce or reduce/reduce conflicts
• Need to use look-ahead to choose
L → L , S . L → L , S .S → S ., L
L → L , S .L → S .
ok shift/reduce reduce/reduce
CS 471 – Fall 2007
LR(0) Construction
+ $ E
1 2
2 S3/ SE SE
S’ →. S $S→ . E + SS→ . EE→ . numE→ . ( S )
S→E . + SS→E .
S→E + . S
S→ E + S | EE→num | ( S )
E +
1
23
What do we do in state 2?
CS 471 – Fall 2007
SLR grammars
Idea: Only add reduce action to table if look-ahead symbol is in the FOLLOW set of the non-terminal being reduced
• Eliminates some conflicts
• FOLLOW(S) = { $, ) }
• Many language grammars are SLR
+ $ E
1 2
2 s3 SE
CS 471 – Fall 2007
LR(1) Parsing
• As much power as possible out of 1 lookahead symbol parsing table
• LR(1) grammar = recognizable by a shift/reduce parser with 1 look-ahead.
• LR(1) item = LR(0) item + look-ahead symbols possibly following production
LR(0): S→ . S + E
LR(1): S→ . S + E +
CS 471 – Fall 2007
LR(1) State
• LR(1) state = set of LR(1) items
• LR(1) item = LR(0) item + set of lookahead symbols
• No two items in state have same production + dot configuration
S→S . + E +S→S . + E $S→S + . E num
S→S . + E +,$S→S + . E num
CS 471 – Fall 2007
LR(1) Closure
Consider A→β . C δ λ Closure formed just as for LR(0) except
1. Look-ahead symbols include characters following the non-terminal symbol to the right of dot: FIRST(δ)
2. If non-terminal symbol may produce last symbol of production (δ is nullable), look-ahead symbols include look-ahead symbols of production (λ)
S →. S $S→ . E + S $S→ . E $E→ . num +,$E→ . ( S ) +,$
S→ E + S | EE→num | ( S )
1
2
CS 471 – Fall 2007
LR(1) DFA construction
Given LR(1) state, for each symbol (terminal or non-terminal) following a dot, construct a state with dot shifted across symbol, perform closure
S→ E + S | EE→num | ( S )
S’ →. S $S→ . E + S $S→ . E $E→ . num +,$E→ . ( S) +,$
S→E . + S $S→E . $
E
1
2
CS 471 – Fall 2007
LR(1) example
Reductions unambiguous if: look-aheads are disjoint, not to right of any dot in state
S→ E + S | EE→num | ( S )
S’ →. S $S→ . E + S $S→ . E $E→ . num +,$E→ . ( S) +,$
S→E . + S $S→E . $
E
1
2
+ $ E
1 2
2 s3 SE
CS 471 – Fall 2007
LALR Grammars
• Problem with LR(1): too many states
• LALR(1) (Look-Ahead LR)– Merge any two LR(1) states whose items are
identical except look-ahead– Results in smaller parser tables—works extremely
well in practice– Usual technology for automatic parser generators
S→id . +S→E . $
S→id . $S→E . ++ = ?
CS 471 – Fall 2007
How are Parsers Written?
• Automatic parser generators: yacc, bison
• Accept LALR(1) grammar specification– plus: declarations of precedence, associativity– output: LR parser code (inc. parsing table)
• Some parser generators accept LL(1) – less powerful
CS 471 – Fall 2007
Associativity
S→ S + E | E
E→num | ( S )
E→ E + E | num | ( E )
What happens if we run this grammar
through LALR construction?
CS 471 – Fall 2007
Conflict!
E→ E + E | num | ( E )
E→ E + E . +E→ E . + E +,$
1+2+3
shift/reduce conflict
Shift: 1+(2+3)
Reduce: (1+2)+3^
CS 471 – Fall 2007
Grammar in Parser Generators
non terminal E; terminal PLUS, LPAREN...
precedence left PLUS;
E ::= E PLUS E
| LPAREN E RPAREN
| NUMBER ;
“When shifting + conflicts with reducinga production containing +, choose reduce”
CS 471 – Fall 2007
Precedence
Can also handle operator precedence
E→ E + E | T
T →T T | num | ( E )
E→ E + E | E E
| num | ( E )
CS 471 – Fall 2007
Conflicts without Precedence
E→ E + E | E E
| num | ( E )
E→ E . + E …E→ E E . +
E→ E + E . E→ E . E …
CS 471 – Fall 2007
Conflicts without Precedence
precedence left PLUS;
precedence left TIMES; // TIMES > PLUS
E ::= E PLUS E | E TIMES E | ...
E→ E . + E …E→ E E . +
E→ E + E . E→ E . E …
Rule: in conflict, choose reduce if production symbol higher precedence than shifted symbol; choose shift if vice-versa
CS 471 – Fall 2007
Parsing Summary
• Look-ahead information makes SLR(1), LALR(1), LR(1) grammars expressive
• Automatic parser generators support LALR(1)
• Precedence, associativity declarations simplify grammar writing
• Easiest and best way to read structured human-readable input
Coming Up:
• PA3: Defining a grammar for Tiger (Fri 9/28)– No AST … yet
• Test1: Wed 10/3