chapter 2 :: programming language syntaxtadavis/cs312/ch02f.pdf · 2019. 1. 30. · ebnf to bnf...
TRANSCRIPT
-
Copyright © 2005 Elsevier
Chapter 2 ::
Programming Language Syntax
Programming Language Pragmatics
Michael L. Scott
-
Copyright © 2005 Elsevier
Introduction
• programming languages need to be precise
– natural languages less so
– both form (syntax) and meaning (semantics)
must be unambiguous
– example: digits
digit → 0|1|2|3|4|5|6|7|8|9
– we need good notation (or a metalanguage) to
describe precise languages by recognizing tokens
• regular expressions
• context-free grammars
-
Copyright © 2005 Elsevier
Tokens
• tokens are the building blocks of programs
– shortest strings with individual meaning
– examples
• keywords (type names, control structures)
• identifiers (variable names)
• symbols (mathematical operators)
• constants (literals)
– considerations
• case sensitivity
• international characters
• maximum lengths
-
Copyright © 2005 Elsevier
Regular Expressions
• a regular expression is one of the following:
– a character
– the empty string, denoted by or ϵ
– two regular expressions concatenated
– two regular expressions separated by | (i.e., or)
– a regular expression followed by the Kleene star
(concatenation of zero or more strings)
• these simple rules help us find tokens in the
programming language
• useful in unix/linux environments
-
Copyright © 2005 Elsevier
Regular Expressions
• numerical literals in Pascal may be generated
by the following:
• arrow can be read as – can be replaced by
– goes to
-
Copyright © 2005 Elsevier
Context-Free Grammars
• the notation for context-free grammars (CFG)
is sometimes called Backus-Naur Form (BNF) – necessary since regular expressions cannot specify nested
constructs
– used to define the syntax of a language
• with Kleene star and other facilitating
symbols, the notation is termed Extended BNF
(EBNF)
-
Context-Free Grammars
Source: Tucker & Noonan (2007)
-
Derivations
• example grammar binaryDigit → 0
binaryDigit → 1
or equivalently
binaryDigit → 0 | 1
Source: Tucker & Noonan (2007)
-
Derivations
• consider the grammar Integer → Digit | Integer Digit
Digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
we can derive any unsigned integer, like 352, from this
grammar:
Integer → Integer Digit
→ Integer 2
→ Integer Digit 2
→ Integer 5 2
→ Digit 5 2
→ 3 5 2
Source: Tucker & Noonan (2007)
-
Derivations
– a different derivation of 352
Integer → Integer Digit
→ Integer Digit Digit
→ Digit Digit Digit
→ 3 Digit Digit
→ 3 5 Digit
→ 3 5 2
– this is called a leftmost derivation since at each step, the
leftmost nonterminal is replaced
– the previous derivation was a rightmost derivation
Source: Tucker & Noonan (2007)
-
Derivations
– notation for derivations
Integer →* 352
– meaning that 352 can be derived in a finite number of
steps using the grammar for Integer
352 ϵ L(G)
– meaning that 352 is a member of the language
defined by grammar G
L(G) → { ω ϵ T* | Integer →* ω }
– meaning that the language defined by grammar G is
the set of all symbol strings ω that can be derived as
an Integer
Source: Tucker & Noonan (2007)
-
Copyright © 2005 Elsevier
Grammars
• conventional in general discussions of grammars to
use
– lower case letters near the beginning of the alphabet for
terminals
– lower case letters near the end of the alphabet for strings of
terminals
– upper case letters near the beginning of the alphabet for
non-terminals
– upper case letters near the end of the alphabet for arbitrary
symbols
– greek letters for arbitrary strings of symbols
-
Parse Trees
• a parse tree is a graphical representation of a
derivation – each internal node of the tree corresponds to a step in the
derivation
– the children of a node represent a right-hand side of a
production
– each leaf node represents a symbol of the derived string
reading from left to right
Source: Tucker & Noonan (2007)
-
Parse Trees
• the step, Integer → Integer Digit appears in
the parse tree as
Source: Tucker & Noonan (2007)
-
Parse Trees
• parse tree for 352 as in Integer
Source: Tucker & Noonan (2007)
-
Copyright © 2005 Elsevier
Context-Free Grammars
• expression grammar with precedence and
associativity
-
Copyright © 2005 Elsevier
Context-Free Grammars
• parse tree for expression grammar (with precedence) for 3 + 4 * 5
-
Copyright © 2005 Elsevier
Context-Free Grammars
• parse tree for expression grammar (with left associativity) for 10 - 4 - 3
-
Context-Free Grammars
• another grammar with precedence and
associativity – + and – are left-associative operators in mathematics
– * and / have higher precedence than + and –
• Grammar G1
Source: Tucker & Noonan (2007)
-
Context-Free Grammars
• parse tree for 4**2**3 + 5 * 6 + 7
Source: Tucker & Noonan (2007)
-
Context-Free Grammars
• associativity and precedence shown in the
structure of the parse tree – highest precedence at the bottom
– left-associativity on the left at each level
Source: Tucker & Noonan (2007)
-
Ambiguous Grammars
• a grammar is ambiguous if one of its strings
has two or more different parse trees
– grammar G1 above is unambiguous
• ambiguous expression grammar G2 equivalent
to G1
– fewer productions and nonterminals, but
ambiguous Source: Tucker & Noonan (2007)
-
Ambiguous Grammars
• ambiguous parse of 5 – 4 + 3 using G2
Source: Tucker & Noonan (2007)
-
Abstract Syntax Tree
• the shape of a parse tree reveals the meaning
of the program
• we want a tree that removes its inefficiency,
but keeps its shape – remove separator/punctuation terminal symbols
– remove all trivial root nonterminals
– replace remaining nonterminals with leaf terminals
• removes syntactic sugar and keeps essential elements
of a language
Source: Tucker & Noonan (2007)
-
Abstract Syntax Tree
Source: Tucker & Noonan (2007)
-
Dangling Else
• with which if statement does the else associate?
Source: Tucker & Noonan (2007)
-
Dangling Else Ambiguity
Source: Tucker & Noonan (2007)
-
Dangling Else Solutions
• Algol 60, C, C++ – associate each else with closest if
– use {}or begin/end to override
• Algol 68, Modula, Ada – use explicit delimiter to end every conditional (e.g., if..fi)
• Java – rewrite the grammar to limit what can appear in a conditional
Source: Tucker & Noonan (2007)
-
Extended BNF (EBNF)
• BNF – recursion for iteration
– nonterminals for grouping
• EBNF additional metacharacters – { } for a series of zero or more
– ( ) for a list; must pick one
– [ ] for an optional list; pick none or one
Source: Tucker & Noonan (2007)
-
EBNF Examples
• Expression is a list of Terms separated by
operators + and -
Source: Tucker & Noonan (2007)
-
EBNF to BNF
• we can always rewrite an EBNF grammar as a
BNF grammar
can be rewritten as
• try rewriting EBNF rules with { } and ( )
• while EBNF is no more powerful than BNF,
its rules are often simpler and clearer
Source: Tucker & Noonan (2007)
-
Copyright © 2005 Elsevier
Scanning
• recall that the scanner is responsible for
– tokenizing source
– removing comments
• may be difficult if nested
– (often) dealing with pragmas (i.e., significant
comments)
– saving text of identifiers, numbers, strings
– saving source locations (file, line, column) for
error messages
-
Copyright © 2005 Elsevier
Scanning
• suppose we are building an ad-hoc (hand-
written) scanner for Pascal:
– we read the characters one at a time with look-
ahead
• if it is one of the one-character tokens { ( ) [ ] < > , ; = + - etc }
we announce that token
• if it is a ., we look at the next character
– if that is a dot, we announce ..
– otherwise, we announce . and reuse the look-
ahead
-
Copyright © 2005 Elsevier
Scanning
• if it is a
-
Copyright © 2005 Elsevier
Scanning
• if it is a digit, we keep reading until we find
a non-digit
– if that is not a . , we announce an integer
– otherwise, we keep looking for a real number
– if the character after the . is not a digit, we
announce an integer and reuse the . and the
look-ahead
-
Copyright © 2005 Elsevier
Scanning
• pictorial
representation
of a Pascal
scanner as a
finite
automaton
-
Copyright © 2005 Elsevier
Scanning
• a scanner can be represented by a
deterministic finite automaton (DFA)
– lex, scangen, etc. build these things
automatically from a set of regular expressions
– specifically, they construct a machine that
accepts the language identifier | int const
| real const | comment | symbol |
...
-
Copyright © 2005 Elsevier
Scanning
• we run the machine over and over to get one
token after another
– nearly universal rule
• always take the longest possible token from the
input
– thus foobar is foobar and never f or foo or foob
• more to the point, 3.14159 is a real const and
never 3, ., and 14159
• regular expressions "generate" a regular
language; DFAs "recognize" it
-
Copyright © 2005 Elsevier
Scanning
• scanners tend to be built three ways
– ad-hoc
– semi-mechanical pure DFA
(usually realized as nested case statements)
– table-driven DFA
• ad-hoc generally yields the fastest, most
compact code by doing lots of special-
purpose things, though good automatically-generated scanners come very close
-
Copyright © 2005 Elsevier
Scanning
• writing a pure DFA as a set of nested case
statements is a surprisingly useful
programming technique
– though it's often easier to use perl, awk, sed
– for details see Figure 2.11
• table-driven DFA is what lex and scangen
produce
– lex (flex) in the form of C code
– scangen in the form of numeric tables and a
separate driver (for details see Figure 2.12)
-
Copyright © 2005 Elsevier
Scanning
• note that the rule about longest-possible tokens means you return only when the next character can't be used to continue the current token
– the next character will generally need to be saved for the next token
• in some cases, you may need to peek at more than one character of look-ahead in order to know whether to proceed
– in Pascal, for example, when you have a 3 and you a see a dot
• do you proceed (in hopes of getting 3.14)? or
• do you stop (in fear of getting 3..5)?
-
Copyright © 2005 Elsevier
Scanning
• in messier cases, you may not be able to get
by with any fixed amount of look-ahead; in
Fortran, for example, we have DO 5 I = 1,25 loop
DO 5 I = 1.25 assignment
• here, we need to remember we were in a
potentially final state, and save enough
information that we can back up to it, if we get stuck later
-
Copyright © 2005 Elsevier
Parsing
• terminology:
– context-free grammar (CFG)
– symbols
• terminals (tokens)
• non-terminals
– production
– derivations (left-most and right-most - canonical)
– parse trees
– sentential form
-
Copyright © 2005 Elsevier
Parsing
• by analogy to RE and DFAs, a context-free
grammar (CFG) is a generator for a
context-free language (CFL)
– a parser is a language recognizer
• there is an infinite number of grammars for
every context-free language
– not all grammars are created equal, however
-
Copyright © 2005 Elsevier
Parsing
• it turns out that for any CFG we can create a
parser that runs in O(n^3) time
• there are two well-known parsing
algorithms that permit this
– Early's algorithm
– Cooke-Younger-Kasami (CYK) algorithm
• O(n^3) time is clearly unacceptable for a
parser in a compiler - too slow