a teaching compiler overview x is a programming language hyper supports edit-compile-go for x 1986...

35
A teaching compiler overview • X is a programming language • hyper supports edit-compile-go for X • 1986 hyper was in C • used static, extern,longjmp • ran on VAX, alpha, SUN, … • 2003 hyper implemented in 6000 lines of C++ • uses exceptions, pure virtual methods • runs on Intel x86 linux, power pc • Similar to MATLAB JIT

Post on 22-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

A teaching compileroverview

• X is a programming language• hyper supports edit-compile-go for X• 1986 hyper was in C

• used static, extern,longjmp• ran on VAX, alpha, SUN, …

• 2003 hyper implemented in 6000 lines of C++ • uses exceptions, pure virtual methods• runs on Intel x86 linux, power pc

• Similar to MATLAB JIT

My Copyleft

~mckeeman/src/cxx/hyper

COPYRIGHT W. M. McKeeman 1987. You may do anything you like with the file except remove or alter this notice.

I expect the students to learn…

• the mathematics of language description

• scanning, parsing, abstract S/R sequence, semantics

• symbol tables, polish postfix, hardware

• integrated programming environment, editor

• separation of concerns, design, build, test, quality

• dealing with deadlines

I assume you already know…

• Something about compiler implementation

• Grammars

• Scanners

• Recursive Parsers

• Machine Language

• Something about C++

Main Points of Talk

Components of the integrated programming environment are an emacs-style editor, a compiler taking source from a source buffer, quickly compiling it into Intel x86 code and executing it, leaving the results in an output buffer.

C++ class hierarchy is used to represent independent layers of the components and enforce separation of concerns. Pure virtual functions are used to communicate between the layers.

Individual components are in subdirectories together with unit tests. The containing directory contains the single makefile for the project and its components and tests.

A sample X program` FILE: fact.xn := inputval+0; res := 1; i := 1;it if i <= n -> res := res*i; i := i + 1 :: else exit fiti;result := res;

inputval only appears on rhs – must be input

result only appears on lhs – must be output

X Summary

Scalar assignments and expressions

Strong type inferred from use (int & logical)

Input/output inferred from use

if-fi and it-ti control flow

be-eb nested scopes

Subroutines defined but not implemented

demo

Fast

Typical Student Projects

•Replace 32 bit int with 32 bit float (too easy)

•Implement 64 bit double

•Implement arrays

•Implement sets (perhaps infinite)

•Implement BIGNUM rationals

•Implement subroutines

•Implement C-style declarations (too dull)

•Implement conventional I/O (retro)

Object Structure

Hyper:Edit Display:Lines Line

Compile:Gen:Lang:Parse

Sym Frame Symbol

Mem

Asm

Scan

Assembler

The Intel x86 assembler implements an open-ended set of methods. Examples of calls to Asm object x86:

x86->addRR(int, int) – add register-register

x86-> daddp(void) – double add, pop

x86-> dldM(double*) – double load from double memory

x86-> dildM(int*) – double load from int memory

x86-> callRi(int) – call indirect through register

x86-> pushA(void) – push all 8 registers

x86-> svcCC(int,int) – supervisor call with constant args

Asm InternalsThe bit-layout is done with macros that implement the Intel documentation. Ugly stuff.

The data member code is an allocated array long enough to hold the assembled instructions (x86 hardware format).

The member function go(), called from the hyper environment, jumps to the code as a void subroutine. On some platforms the icache and dcache have to be tweaked.

Branches are relative. Forward fixups are inserted when the destination becomes known. Big/little endian problems are handled internally by Asm.

Destructing an Asm object also frees the code.

A Sample Asm Routinevoid Asm::

addRC(int r, int c) { // r += c

if (r == EAX) {

assmop(0x05);

} else {

assmop(0x81);

assm8(MOD_REG(r,0));

}

assm(c);

}

Disassembler Internals

dis produces a printout of the assembly code.

It is a 256-way switch, each case of which is potentially another switch. In essence, dis is an Intel x86 interpreter that prints instead of doing.

The mnemonics are more closely related to asm than to Intel. The disassembler interprets only what the assembler makes. Otherwise it dumps the hexadecimal.

dis output passed back via a callback. Hyper places it in the output buffer.

Memory Manager

• MATLAB allocates run-frames in allocated blocks of store.

• Hyper does the same (but I might change it).

• The trick is getting constant addresses from an object containing an arbitrary number of memory locations. Solution: expandable array of fixed-sized blocks.

Symbol Table

The symbol table is a stack of frames, each of which is a stack of symbols. The classes Sym, Frame, Symbol represent the concepts above. There is an expandable array of frames in the symbol table class, and an expandable array of symbols in each frame. As frames are closed (exit scope) they are placed aside for later access via the symbol table dumper. Upon destruction the expandable arrays and the objects in them must be freed.

In X type is inferred from use; there are no declarations.

Symbol Table Objects

level 2

level 1

global

symbolsstack of frames

lookup direction

frames

sym

Class Symbol

Class Symbol {

Symbol(char*, int);// ptr to src,len

~Symbol(void); // dtor

char *getName(void);

void setType(int);

int getType(void);

void setAddress(int *);

int *getAddress(void);

…etc.

Class ScanA scan object accepts a sequence of pointers to text fragments (actually lines) and computes a sequence of tokens, each consisting of

• a token code

• a pointer to the start (in the source)

• a length

All scanner table lookups use a perfect hash.

The scan object provides navigation of the token sequence, making parser lookahead straightforward. The tokens are good only so long as the source text is unchanged and not moved.

A case in scan

case ‘0’: case ‘1’: case ‘2’: case ‘3’: case ‘4’:

case ‘5’: case ‘6’: case ‘7’: case ‘8’: case ‘9’:

while (++end<lim && isdigit(*end)) ; // D+

report(numLEX, begin, end);

break;

Switch on raw character. Manipulate pointers only.

One Pass Compiling

Parsing is recursive; the basic token handling, shift/reduce output and diagnostic facilities are in Parse. Parse knows nothing of X.

The language specific recursive routines are in Lang:Parse. Here is where shift and reduce are called. Nothing is known of the semantics of X.

The class Gen:Lang:Parse implements shift and reduce, calling in turn a sequence of pure virtual abstract methods like endIffi(). Nothing is known of the target platform.

Finally, Compile:Gen:Lang:Parse having a symbol table, memory manager and assembler available, implements the concrete semantics of X.

Parse Class

Class Parse { Parse(Scan*); ~Parse(void); virtual void parse(void) = 0; virtual void shift(Token*) = 0; virtual void reduce(int) = 0; virtual void hint(int) = 0;

void start(void); void step(int); void accept(int,char*); void reject(char*);etc.}

X.cfgprogram statementsstatements statement statements ; statementstatement

exit block selection iteration assignmentblock be identifiers . statements eb

etc.

A recursive routine

void Lang::conjunction(void) { // a /\ b complement(); reduce(CONJUNCTION1); while (tok == divslashLEX) { step(); // discard /\… complement(); reduce(CONJUNCTION2); }}

One could use LALR(1)The Lang class could be implemented with a LALR(1) machine (YACC-like tables) and the rest of hyper would never know…

void Lang::lalr(void) { while (lhs != Program) { if isShift(state,tok) { shift(tok); step(); } else { lhs = apply(state, tok); reduce(lhs); } }}

shift/reduce

void Gen::shift(Token *tok) { // stack token tokStack[tokPtr++] = tok;}void Gen::reduce(int rule) { // obey rule switch (rule) { case PROGRAM1: getRet(); // return to hyper break; case STATEMENTS1: // one stmt break; case STATEMENTS2: // more stmts break;etc.}

Compile Class

Implements pure virtuals called from Gen.

Creates a symbol table, assembler and memory manager.

Turns abstract actions into concrete actions.

Layers of the one pass compiler

Compile Gen Lang Parse

parse()

shift()reduce()

hint()

beginIffi()endIffi()

genInfixop()

Pure virtualsSym

Mem

Asm

ScanRecursive routines or LALR(1)go()

abstractconcrete

Compile

Using an AST

Sym

Mem

Asmgo()

Ast Lang Parse

parse()

shift()reduce()

hint()

Scan

Some Emittersvoid Emit::genRet(void) { // end of program a->epilog(); // x86 return s->exitScope(); // close global frame}

void Emit::genExit(void) { // loop exit stmt if (itPtr > 0) { // inside loop a->movRC(itVar[itPtr-1], 1); } else { failure(“exit outside loop”); }}

Edit Class

A Line contains a fragment of text.

Lines contains an array of Line.

A Display is Lines with a visual presentation.

An Edit contains an array of Display (buffers)

Edit has one pure virtual function named callback used to pass uninterpretable keystrokes on to its user.

Hyper Class

Hyper is an Edit with the additional capability of compile and go. The editor knows nothing of this, merely passing the keystrokes ^xe to hyper via the pure virtual callback.

Hyper maintains 3 buffers:

•Source text

•Run shell

•Help

Results, i/o, dumps and diagnostics are placed in the run shell. The user initially sees the help display.

Directory structure

/hyper/1edit/2scan/3parse/4sym/5mem/6asm/7gen

MakefileCC=g++ -gMAKE=gmake

EDITDIR = 1editSCANDIR = 2scanPARSEDIR= 3parseSYMDIR = 4symMEMDIR = 5memASMDIR = 6asmGENDIR = 7gen

include $(EDITDIR)/edit.mkinclude $(SCANDIR)/scan.mkinclude $(PARSEDIR)/parse.mkinclude $(SYMDIR)/sym.mkinclude $(MEMDIR)/mem.mkinclude $(ASMDIR)/asm.mkinclude $(GENDIR)/gen.mk

OBJ=$(EDITOBJ) $(COMPILER)

test: hyperhyper smoke.x

unittests:$(MAKE) linetest scantest symtest memtest asmtest parsetest gentest

hyper: hyper.h hyper.cxx $(OBJ)$(CC) -o hyper hyper.cxx $(OBJ)

clean:rm -f *.o bin/*.o *.out *~ */*~ *driver hyper edit

Line Counts•hyper.cxx 314•1edit/ 722

•edit.cxx 336•display.cxx 137•lines.cxx 128•line.cxx 154•terminal.cxx 84

•2scan/scan.cxx 314•3parse/ 454

•parse.cxx 61•x.cxx 393

•4sym/ 282•symbols.cxx 126•frame.cxx 60•symbol.cxx 96

•5mem/int32.cxx 47

•6asm/ 1595

•x86dis.cxx 887

•x86asm.cxx 708

•7gen/ 907

•gen.cxx 225

•xcc.cxx 682

•dot-cxx files 4600+

•dot-h files 1300+

•test drivers 1300+