a teaching compiler overview x is a programming language hyper supports edit-compile-go for x 1986...
Post on 22-Dec-2015
219 views
TRANSCRIPT
A teaching compileroverview
• X is a programming language• hyper supports edit-compile-go for X• 1986 hyper was in C
• used static, extern,longjmp• ran on VAX, alpha, SUN, …
• 2003 hyper implemented in 6000 lines of C++ • uses exceptions, pure virtual methods• runs on Intel x86 linux, power pc
• Similar to MATLAB JIT
My Copyleft
~mckeeman/src/cxx/hyper
COPYRIGHT W. M. McKeeman 1987. You may do anything you like with the file except remove or alter this notice.
I expect the students to learn…
• the mathematics of language description
• scanning, parsing, abstract S/R sequence, semantics
• symbol tables, polish postfix, hardware
• integrated programming environment, editor
• separation of concerns, design, build, test, quality
• dealing with deadlines
I assume you already know…
• Something about compiler implementation
• Grammars
• Scanners
• Recursive Parsers
• Machine Language
• Something about C++
Main Points of Talk
Components of the integrated programming environment are an emacs-style editor, a compiler taking source from a source buffer, quickly compiling it into Intel x86 code and executing it, leaving the results in an output buffer.
C++ class hierarchy is used to represent independent layers of the components and enforce separation of concerns. Pure virtual functions are used to communicate between the layers.
Individual components are in subdirectories together with unit tests. The containing directory contains the single makefile for the project and its components and tests.
A sample X program` FILE: fact.xn := inputval+0; res := 1; i := 1;it if i <= n -> res := res*i; i := i + 1 :: else exit fiti;result := res;
inputval only appears on rhs – must be input
result only appears on lhs – must be output
X Summary
Scalar assignments and expressions
Strong type inferred from use (int & logical)
Input/output inferred from use
if-fi and it-ti control flow
be-eb nested scopes
Subroutines defined but not implemented
Typical Student Projects
•Replace 32 bit int with 32 bit float (too easy)
•Implement 64 bit double
•Implement arrays
•Implement sets (perhaps infinite)
•Implement BIGNUM rationals
•Implement subroutines
•Implement C-style declarations (too dull)
•Implement conventional I/O (retro)
Assembler
The Intel x86 assembler implements an open-ended set of methods. Examples of calls to Asm object x86:
x86->addRR(int, int) – add register-register
x86-> daddp(void) – double add, pop
x86-> dldM(double*) – double load from double memory
x86-> dildM(int*) – double load from int memory
x86-> callRi(int) – call indirect through register
x86-> pushA(void) – push all 8 registers
x86-> svcCC(int,int) – supervisor call with constant args
Asm InternalsThe bit-layout is done with macros that implement the Intel documentation. Ugly stuff.
The data member code is an allocated array long enough to hold the assembled instructions (x86 hardware format).
The member function go(), called from the hyper environment, jumps to the code as a void subroutine. On some platforms the icache and dcache have to be tweaked.
Branches are relative. Forward fixups are inserted when the destination becomes known. Big/little endian problems are handled internally by Asm.
Destructing an Asm object also frees the code.
A Sample Asm Routinevoid Asm::
addRC(int r, int c) { // r += c
if (r == EAX) {
assmop(0x05);
} else {
assmop(0x81);
assm8(MOD_REG(r,0));
}
assm(c);
}
Disassembler Internals
dis produces a printout of the assembly code.
It is a 256-way switch, each case of which is potentially another switch. In essence, dis is an Intel x86 interpreter that prints instead of doing.
The mnemonics are more closely related to asm than to Intel. The disassembler interprets only what the assembler makes. Otherwise it dumps the hexadecimal.
dis output passed back via a callback. Hyper places it in the output buffer.
Memory Manager
• MATLAB allocates run-frames in allocated blocks of store.
• Hyper does the same (but I might change it).
• The trick is getting constant addresses from an object containing an arbitrary number of memory locations. Solution: expandable array of fixed-sized blocks.
Symbol Table
The symbol table is a stack of frames, each of which is a stack of symbols. The classes Sym, Frame, Symbol represent the concepts above. There is an expandable array of frames in the symbol table class, and an expandable array of symbols in each frame. As frames are closed (exit scope) they are placed aside for later access via the symbol table dumper. Upon destruction the expandable arrays and the objects in them must be freed.
In X type is inferred from use; there are no declarations.
Class Symbol
Class Symbol {
Symbol(char*, int);// ptr to src,len
~Symbol(void); // dtor
char *getName(void);
void setType(int);
int getType(void);
void setAddress(int *);
int *getAddress(void);
…etc.
Class ScanA scan object accepts a sequence of pointers to text fragments (actually lines) and computes a sequence of tokens, each consisting of
• a token code
• a pointer to the start (in the source)
• a length
All scanner table lookups use a perfect hash.
The scan object provides navigation of the token sequence, making parser lookahead straightforward. The tokens are good only so long as the source text is unchanged and not moved.
A case in scan
case ‘0’: case ‘1’: case ‘2’: case ‘3’: case ‘4’:
case ‘5’: case ‘6’: case ‘7’: case ‘8’: case ‘9’:
while (++end<lim && isdigit(*end)) ; // D+
report(numLEX, begin, end);
break;
Switch on raw character. Manipulate pointers only.
One Pass Compiling
Parsing is recursive; the basic token handling, shift/reduce output and diagnostic facilities are in Parse. Parse knows nothing of X.
The language specific recursive routines are in Lang:Parse. Here is where shift and reduce are called. Nothing is known of the semantics of X.
The class Gen:Lang:Parse implements shift and reduce, calling in turn a sequence of pure virtual abstract methods like endIffi(). Nothing is known of the target platform.
Finally, Compile:Gen:Lang:Parse having a symbol table, memory manager and assembler available, implements the concrete semantics of X.
Parse Class
Class Parse { Parse(Scan*); ~Parse(void); virtual void parse(void) = 0; virtual void shift(Token*) = 0; virtual void reduce(int) = 0; virtual void hint(int) = 0;
void start(void); void step(int); void accept(int,char*); void reject(char*);etc.}
X.cfgprogram statementsstatements statement statements ; statementstatement
exit block selection iteration assignmentblock be identifiers . statements eb
etc.
A recursive routine
void Lang::conjunction(void) { // a /\ b complement(); reduce(CONJUNCTION1); while (tok == divslashLEX) { step(); // discard /\… complement(); reduce(CONJUNCTION2); }}
One could use LALR(1)The Lang class could be implemented with a LALR(1) machine (YACC-like tables) and the rest of hyper would never know…
void Lang::lalr(void) { while (lhs != Program) { if isShift(state,tok) { shift(tok); step(); } else { lhs = apply(state, tok); reduce(lhs); } }}
shift/reduce
void Gen::shift(Token *tok) { // stack token tokStack[tokPtr++] = tok;}void Gen::reduce(int rule) { // obey rule switch (rule) { case PROGRAM1: getRet(); // return to hyper break; case STATEMENTS1: // one stmt break; case STATEMENTS2: // more stmts break;etc.}
Compile Class
Implements pure virtuals called from Gen.
Creates a symbol table, assembler and memory manager.
Turns abstract actions into concrete actions.
Layers of the one pass compiler
Compile Gen Lang Parse
parse()
shift()reduce()
hint()
beginIffi()endIffi()
genInfixop()
Pure virtualsSym
Mem
Asm
ScanRecursive routines or LALR(1)go()
abstractconcrete
Some Emittersvoid Emit::genRet(void) { // end of program a->epilog(); // x86 return s->exitScope(); // close global frame}
void Emit::genExit(void) { // loop exit stmt if (itPtr > 0) { // inside loop a->movRC(itVar[itPtr-1], 1); } else { failure(“exit outside loop”); }}
Edit Class
A Line contains a fragment of text.
Lines contains an array of Line.
A Display is Lines with a visual presentation.
An Edit contains an array of Display (buffers)
Edit has one pure virtual function named callback used to pass uninterpretable keystrokes on to its user.
Hyper Class
Hyper is an Edit with the additional capability of compile and go. The editor knows nothing of this, merely passing the keystrokes ^xe to hyper via the pure virtual callback.
Hyper maintains 3 buffers:
•Source text
•Run shell
•Help
Results, i/o, dumps and diagnostics are placed in the run shell. The user initially sees the help display.
MakefileCC=g++ -gMAKE=gmake
EDITDIR = 1editSCANDIR = 2scanPARSEDIR= 3parseSYMDIR = 4symMEMDIR = 5memASMDIR = 6asmGENDIR = 7gen
include $(EDITDIR)/edit.mkinclude $(SCANDIR)/scan.mkinclude $(PARSEDIR)/parse.mkinclude $(SYMDIR)/sym.mkinclude $(MEMDIR)/mem.mkinclude $(ASMDIR)/asm.mkinclude $(GENDIR)/gen.mk
OBJ=$(EDITOBJ) $(COMPILER)
test: hyperhyper smoke.x
unittests:$(MAKE) linetest scantest symtest memtest asmtest parsetest gentest
hyper: hyper.h hyper.cxx $(OBJ)$(CC) -o hyper hyper.cxx $(OBJ)
clean:rm -f *.o bin/*.o *.out *~ */*~ *driver hyper edit
Line Counts•hyper.cxx 314•1edit/ 722
•edit.cxx 336•display.cxx 137•lines.cxx 128•line.cxx 154•terminal.cxx 84
•2scan/scan.cxx 314•3parse/ 454
•parse.cxx 61•x.cxx 393
•4sym/ 282•symbols.cxx 126•frame.cxx 60•symbol.cxx 96
•5mem/int32.cxx 47
•6asm/ 1595
•x86dis.cxx 887
•x86asm.cxx 708
•7gen/ 907
•gen.cxx 225
•xcc.cxx 682
•dot-cxx files 4600+
•dot-h files 1300+
•test drivers 1300+