everything you always wanted to know about parsing part v...
TRANSCRIPT
Generalized LR
Everything You AlwaysWanted to Know About Parsing
Part V : LR Parsing
Giorgio SattaUniversity of Padua, Italy
ESSLLI, August 2013
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Introduction
Parsing strategies classified by the time the associated PDAcommits to a production in the analysis :
A→ B β
B
Top-Down
Left-Corner
Bottom-Up
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Introduction
LR parsing discovered in 1965 in an article by Donald Knuth,California Institute of Technology, Pasadena, California
‘L’ stands for left-to-right parsing and ‘R’ stands for reversedrightmost derivation
Works in linear time for deterministic context-free languages
The algorithm consists of a shift-reduce, deterministic PDA calledLR PDA, realizing a bottom-up parsing strategy
Widely used for parsing of programming languages, where LRPDAs are mechanically generated from a formal grammar
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Introduction
Generalized LR parsing (GLR for short) extends the LR methodto handle nondeterministic and ambiguous CFGs
First discovered in 1985 by Masaru Tomita in his PhDdissertation, Carnegie Mellon University
Very popular in computational linguistics in the late eighties
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Introduction
The GLR algorithm consists of
the LR PDA as in Knuth original proposal
the definition of a technique for simulating the LR PDA,based on a data structure called graph structured stack
We disregard the graph structured stack, and apply ourtabulation algorithm to simulate the LR PDA
The outcome is an optimized version of the original GLR parsingalgorithm
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
Let P be a set of CFG productions; a stack symbol of the LRPDA is a non-empty subset of
IP = {A→ α1 • α2 | (A→ α1α2) ∈ P}
satisfying some additional property, to be introduced later
Informally, a stack symbol represents a disjunction of dotted rules;in this way several alternative parsing analyses can be carriedout simultaneously
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
Example : Stack symbol q = {A→ B C • γ, A′ → C • γ′}represents two alternative analyses; observe how the stringspreceding the dots are suffixes of a common string
S
A
a1 · · · aj BC γ
aj+1 · · · ai
S
a1 · · · aj
B
aj+1 · · · aj ′
A′′
A′ · · ·
C γ′
aj ′+1 · · · aiGiorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
We need a function closure to precompute the closure of theprediction step from the Earley algorithm
Let q ⊆ IP ; closure(q) is the smallest set such that :
q ⊆ closure(q); and
if (A→ α • Bα′) ∈ closure(q) and (B → β) ∈ P then(B → • β) ∈ closure(q)
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
Example :
if P consists of the productions :
S → S + S ,
S → a
we have
closure({S → S + • S}) =
{S → S + • S , S → • S + S , S → • a}
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
We need a function goto to precompute the scanner and thecompleter steps from the Earley algorithm, followed by closure()
Let q ⊆ IP and X ∈ N ∪Σ; we have
goto(q,X ) =
closure({A→ αX • α′ | (A→ α • Xα′) ∈ q})
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
Example (cont’d) :
if P consists of the productions :
S → S + S ,
S → a
we have
goto({S → S + • S , S → • S + S , S → • a}, a)
= {S → a •}goto({S → S + •S , S → • S + S , S → • a}, S)
= {S → S + S •, S → S •+S}
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
The initial stack symbol of the LR PDA is
qin = closure({S → •α | (S → α) ∈ P})
The final stack symbol of the LR PDA is a special symbol qfin;
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
The stack alphabet QLR of the LR PDA is the smallest setsatisfying :
qin ∈ QLR ;
qfin ∈ QLR ; and
if q ∈ QLR and goto(q,X ) 6= ∅ for some X ∈ N ∪Σ, thenq′ ∈ QLR
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Stack Symbols
Example (cont’d) :
P consists of the productions S → S + S and S → a
The goto function, also called the LR table, excluding qfin :
qin
S → • S + SS → • a
S
a
q2
S → S •+S +
q1
S → a • a
q3
S → S + • SS → • S + SS → • a
S
+
q4
S → S + S •S → S •+S
By construction, in each state the strings preceding the dots aresuffixes of a common string
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
LR PDA
The LR PDA MLR is constructed from G as follows
The input alphabet of MLR is the terminal alphabet of G
The stack alphabet of MLR is QLR , as previously defined
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
LR PDA
Set ∆LR contains transitions having one of the following forms :
shift
q1 7→a q1 q2
for each q1, q2 and a with q2 = goto(q1, a)
reduce
q0 q1 · · · qm 7→ε q0 q′
for each q0, q1, · · · , qm, q′ and A→ α with (A→ α •) ∈ qm,|α| = m and q′ = goto(q0,A)
reduce (special case)
qin q1 · · · qm 7→ε qfin
for each q1, · · · , qm and S → α with (S → α •) ∈ qm and|α| = m
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
LR PDA
Each stack symbol of MLR records the set of rules that arecompatible with the sequence of nonterminals recognized so far
Thus alternative analyses within a stack symbol match suffixes ofthe sequence of nonterminals recognized so far; we call this thesuffix property
The implemented strategy is bottom-up : the commitment toa production occurs as late as possible, after all of theproduction’s right-hand side symbols have been recognized
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Computational Complexity
A production A→ α has exactly |Aα| dotted items; therefore
|IP | = |G |
It is not difficult to construct a worst case set P such that
|QLR | = O(exp(|G |))
|MLR | = O(exp(|G |))
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
Example (cont’d) : input string w = a + a + a
S
S
S + S
S + S
a a
a
S
S
S + S
a S + S
a a
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
Example (cont’d) : input string w = a + a + a, first computation
qin
0
7→a qin
q1
1
7→ε
reduce S → a
qin
q2
1
7→+ qin
q2
q3
2
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
Example (cont’d) : input string w = a + a + a, first computation
7→a qin
q2
q3
q1
3
7→ε
reduce S → a
qin
q2
q3
q4
3
7→ε
reduce S → S + S
qin
q2
3
7→+ qin
q2
q3
4
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
Example (cont’d) : input string w = a + a + a, first computation
7→a qin
q2
q3
q1
5
7→ε
reduce S → a
qin
q2
q3
q4
5
7→ε
reduce S → S + S
qfin
5
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
Example (cont’d) : input string w = a + a + a, secondcomputation
qin
0
7→a qin
q1
1
7→ε
reduce S → a
qin
q2
1
7→+ qin
q2
q3
2
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
7→a qin
q2
q3
q1
3
7→ε
reduce S → a
qin
q2
q3
q4
3
7→+ qin
q2
q3
q4
q3
4
7→a qin
q2
q3
q4
q3
q1
5
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
MLR Computation
7→ε
reduce S → a
qin
q2
q3
q4
q3
q4
5
7→ε
reduce S → S + S
qin
q2
q3
q4
5
7→ε
reduce S → S + S
qfin
5Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
ME and MLR Comparison
Example : Earley PDA on sample rule A0 → A1A2A3 (simplifiednotation)
...
[•A1A2A3]
i1
· · ·...
[A1 • A2A3]
i2
· · ·...
[A1A2 • A3]
i3
· · ·...
[A1A2A3•]
i4
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
ME and MLR
Example (cont’d) : LR PDAs on sample rule A0 → A1A2A3
(simplified notation)
...
{•A1A2A3}
i1
· · ·...
{•A1A2A3}
{A1 • A2A3}
i2
· · ·...
{•A1A2A3}
{A1 • A2A3}
{A1A2 • A3}
i3
· · ·...
{•A1A2A3}
{A1 • A2A3}
{A1A2 • A3}
{A1A2A3•}
i4
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
ME and MLR
Example (cont’d) : ME tabulation on sample rule A0 → A1A2A3
i1
A1
i2
A2
i3
A3
i4
A1 • A2A3
A1A2 • A3
A1A2A3•
No need to reduce
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
ME and MLR
Example (cont’d) : MLR tabulation on sample rule A0 → A1A2A3
i1
A1
i2
A2
i3
A3
i4
•A1A2A3
· · ·
A1 • A2A3 A1A2 • A3 A1A2A3•
reduce
Single step reduction should be avoided; this issue will beaddressed later
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Remarks
MLR is deterministic only for so-called deterministiccontext-free languages
For general CFGs, MLR is strictly nondeterministic, with anexponential proliferation of computations in the length of theinput string
But this needs not concern us :
the only purpose of MLR is to specify a parsing strategy
nondeterminism can be simulated using (an extension of) ourdeterministic tabulation algorithm
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
GLR Algorithm
We derive the GLR algorithm by tabulation of the LR PDA
Items in the GLR algorithm have the form
(q, j , q′, i)
In this case we can not drop the first item component as we didfor the CKY and Earley algorithm
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Deduction Rules
We overview the GLR tabular algorithm using inference rules
MLR starts with qin in the stack. This corresponds to the init stepof the tabular algorithm
(⊥, 0, qin, 0)
0
(⊥, qin)
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Deduction Rules
The shift transition q1 7→a q1 q2 can be implemented as
(q′, j , q1, i − 1)
(q1, i − 1, q2, i)
{a = ai
j i − 1
(q′, q1)
ia
(q1, q2)
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Deduction Rules
The reduce transition
(q0q1 · · · qm−1qm 7→ε q0q′)
is in a form more general than the reduce transition available forour tabulation algorithm
This reduction can be implemented as
(q0, j0, q1, j1)(q1, j1, q2, j2)
...(qm−1, jm−1, qm, jm)
(q0, j0, q′, jm)
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Deduction Rules
The deduction rule for the reduce transition
(q0q1 · · · qm−1qm 7→ε q0q′)
can be graphically represented as
j0 j1 j2
· · ·
im−1 im
(q0, q1) (q1, q2) · · · (qm−1, qm)
(q0, q′)
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
GLR Pseudocode
Algorithm 5 GLR Algorithm (from Tabulation of LR PDA)
1: T ← {(⊥, 0, qin, 0)}2: for i ← 1, . . . , n do . edges with right end at i3: D ← ∅4: for each (q′, j , q1, i − 1) ∈ T do5: for each (q1 7→ai q1 q2) ∈ ∆ do . shift step6: T ← T ∪ {(q1, i − 1, q2, i)}7: D ← D ∪ {(q1, i − 1, q2, i)}8: end for9: end for
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Pseudocode
10: while D 6= ∅ do11: pop (qm−1, jm−1, qm, jm) from D12: for each (q0q1 · · · qm 7→ε q0q
′) ∈ ∆ do . reduce step13: for each (q0, j0, q1, j1), (q1, j1, q2, j2), . . . ,
(qm−2, jm−2, qm−1, jm−1) ∈ T do14: if (q0, j0, q
′, jm) 6∈ T then15: T ← T ∪ {(q0, j0, q′, jm)}16: D ← D ∪ {(q0, j0, q′, jm)}17: end if18: end for19: end for
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Pseudocode
20: for each (qinq1 · · · qm 7→ε qfin) ∈ ∆ do . reduce step21: for each (⊥, 0, qin, 0), (qin, 0, q1, j1), . . .,
(qm−2, jm−2, qm−1, jm−1) ∈ T do22: if (⊥, 0, qfin, jm) 6∈ T then23: T ← T ∪ {(⊥, 0, qfin, jm)}24: D ← D ∪ {(⊥, 0, qfin, jm)}25: end if26: end for27: end for28: end while29: end for30: accept iff (⊥, 0, qfin, n) ∈ T
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Computational Complexity
There are two different sources of complexity in the GLRalgorithm
The first is related to the worst case exponential size of the set ofstack symbols QLR ; see our previous analysis
This effect can be alleviated by using optimization techniques thatsimplify item representation for certain rules
Giorgio Satta Everything About Parsing
Generalized LRGLR PDAGeneralized LR AlgorithmFormal Properties
Computational Complexity
The second source of complexity is related to the length of theinput string; let m be the length of the longest rule in the grammar
The reduce step involves m + 1 indices and can be instantiated anumber of times O(|w |m+1), resulting in exponential timecomplexity in the grammar size
Such an exponential behavior can be avoided by factorizing eachreduce transition into a sequence of pop transitions with only twostack symbols in the left-hand side
Giorgio Satta Everything About Parsing