introduction to compilers professor yihjia tsai 2006 spring tamkang university
Post on 19-Dec-2015
227 views
TRANSCRIPT
Introduction to Compilers
Professor Yihjia Tsai2006 Spring
Tamkang University
2
What is a compiler?
• Translates source code to target code– Source code is typically a high level
programming language (Java, C++, etc) but does not have to be
– Target code is often a low level language like assembly or machine code but does not have to be
• Can you think of other compilers that you have used – according to this definition?
3
Before we begin
• A-Z, a-z, 0-9• “ double quote• # hash• $ dollar sign• % percent• & ampersand• ‘ single quote• ( left parenthesis• ) right parenthesis
• * star• + plus• , comma• - hyphen, minus• / slash• : colon• ; semicolon• < less than• = equal
4
Symbols
• > greater than• ? question mark• @ at sign• [ left (open) square
bracket• \ back slash• ] right (close) square
bracket• ^ caret, power• _ underscore
• ` back quote• { open brace• | or• } close brace• ~ tilde• . period, dot bullet
5
Greek symbols
alpha beta gamma delta epsilon phi zeta theta iota kappa lambda
mu nu xi pi rho sigma tau chi psi eta omega
6
Other Compilers
• Javadoc -> HTML• XML -> HTML• SQL Query output -> Table• Poscript -> PDF• High level description of a circuit -
> machine instructions to fabricate circuit
The C
om
pila
tion P
roce
ss
8
The analysis Stage
• Broken up into four phases– Lexical Analysis (also called scanning
or tokenization)– Parsing– Semantic Analysis– Intermediate Code Generation
9
Lexing Example
double d1;double d2;d2 = d1 * 2.0;
double TOK_DOUBLE reserved wordd1 TOK_ID variable name; TOK_PUNCT has value of “;”double TOK_DOUBLE reserved wordd2 TOK_ID variable name ; TOK_PUNCT has value of “;”d2 TOK_ID variable name = TOK_OPER has value of “=”d1 TOK_ID variable name* TOK_OPER has value of “*”2.0 TOK_FLOAT_CONST has value of 2.0; TOK_PUNCT has value of “;”
lexemes
10
Syntax and Semantics
• Syntax - the form or structure of the expressions – whether an expression is well formed
• Semantics – the meaning of an expression
11
Syntactic Structure
• Syntax almost always expressed using some variant of a notation called a context-free grammar (CFG) or simply grammar– BNF– EBNF
12
A CFG has 4 parts
• A set of tokens (lexemes), known as terminal symbols
• A set of non-terminals• A set of rules (productions) where each
production consists of a left-hand side (LHS) and a right-hand side (RHS) The LHS is a non-terminal and the RHS is a sequence of terminals and/or non-terminal symbols.
• A special non-terminal symbol designated as the start symbol
13
An example of BNF syntax for real numbers
<r> ::= <ds> . <ds><ds> ::= <d> | <d> <ds><d> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7| 8 | 9
< > encloses non-terminal symbols::= 'is' or 'is made up of ' or 'derives' (sometimes denoted with an arrow ->) | or
14
Example
• On the example from the previous slide:– What are the tokens?– What are the lexemes?– What are the non terminals?– What are the productions?
15
Token vs. lexeme
• to·ken One that represents a group, as an employee whose presence is used to deflect from the employer criticism or accusations of discrimination.
• to·ken A basic, grammatically indivisible unit of a language such as a keyword, operator or identifier.
• lexeme A minimal unit (as a word or stem) in the lexicon of a language; `go' and `went' and `gone' and `going' are all members of the English lexeme `go'
• lexeme A minimal lexical unit of a language. Lexical analysis converts strings in a language into a list of lexemes. For a programming language these word-like pieces would include keywords, identifiers, literals and punctuations. The lexemes are then passed to the parser for syntactic analysis.
16
BNF Points
• A non terminal can have more than RHS or an OR can be used
• Lists or sequences are expressed via recursion
• A derivation is just a repeated set of production (rule) applications
• Examples
17
Example Grammar
<program> -> <stmts><stmts> -> <stmt> | <stmt> ; <stmts><stmt> -> <var> = <expr><var> -> a | b | c | d<expr> -> <term> + <term> | <term> - <term><term> -> <var> | const
18
Example Derivation
<program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
19
Parse Trees• Alternative representation for a
derivation• Example parse tree for the previous
example
var expr=
term+
var
b
const
stmts
stmt
terma
20
Another Example
Expression -> Expression + Expression | Expression - Expression | ... Variable | Constant |...Variable -> T_IDENTIFIERConstant -> T_INTCONSTANT | T_DOUBLECONSTANT
21
The Parse
Expression -> Expression + Expression -> Variable + Expression
-> T_IDENTIFIER + Expression -> T_IDENTIFIER + Constant -> T_IDENTIFIER + T_INTCONSTANT
a + 2
22
Parse Trees
PS -> P | P PS
P -> | '(' PS ')' | '<' PS '>' | '[' PS ']'
What’s the parsetree for this statement ? < [ ] [ < > ] >
23
EBNF - Extended BNF
• Like BNF except that• Non-terminals start w/ uppercase • Parens are used for grouping
terminals • Braces {} represent zero or more
occurrences (iteration ) • Brackets [] represent an optional
construct , that is a construct that appears either once or not at all.
24
EBNF example
Exp -> Term { ('+' | '-') Term }Term -> Factor { ('*' | '/') Factor }Factor -> '(' Exp ')' | variable | constant
25
EBNF/BNF
• EBNF and BNF are equivalent• How can {} be expressed in BNF?• How can ( ) be expressed?• How can [ ] be expressed?
26
Semantic Analysis
• The syntactically correct parse tree (or derivation) is checked for semantic errors
• Check for constructs that while valid syntax do not obey the semantic rules of the source language.
• Examples:– Use of an undeclared/un-initialized variable– Function called with improper arguments– Incompatible operands and type mismatches,
27
Examples
int i;int j;i = i + 2;
int arr[2], c;c = arr * 10;
Most semantic analysis pertains to the checking of types.
void fun1(int i);double d;d = fun1(2.1);
28
Intermediate Code Generation
• Where the intermediate representation of the source program is created.
• The representation can have a variety of forms, but a common one is called three-address code (TAC)
• Like assembly – the TAC is a sequence of simple instructions, each of which can have at most three operands.
29
Example
_t1 = b * c_t2 = b * d_t3 = _t1 + _t2a = _t3
a = b * c + b * d
Note: temps
30
Another Example
_t1 = a > b if _t1 goto L0 _t2 = a - c a = _t2L0: t3 = b * c c = _t3
if (a <= b) a = a - c;c = b * c;
Note TempsSymbolic addresses
31
Next Time
• Finish introduction to compilation stages
• Read Appel Chapter 1, and 2 if you have not already done so.
• What is a splay tree?
32
Selected References
• Appel, A., Modern Compiler Implementation In Java (2nd Ed), Cambridge University Press, 2002. ISBN 052182060X.
• Aho, A.V., R. Sethi, and J.D. Ullman, Compilers Principles, Techniques and Tools, Addison-Wesley, 1988. ISBN 0-201-10088-6.
• Muchnick, S., Advanced Compiler Design and Implementation, Morgan Kaufmann, 1998. ISBN 1-55860-320-4.