simple compiler

8/8/2019 Simple Compiler

1/26

Compilers and Compiler Generators P.D. Terry, 2000

14 A SIMPLE COMPILER - THE FRONT END

At this point it may be of interest to consider the construction of a compiler for a simpleprogramming language, specifically that of section 8.7. In a text of this nature it is impossible todiscuss a full-blown compiler, and the value of our treatment may arguably be reduced by the factthat in dealing with toy languages and toy compilers we shall be evading some of the real issues

that a compiler writer has to face. However, we hope the reader will find the ensuing discussion ofinterest, and that it will serve as a useful preparation for the study of much larger compilers. Thetechnique we shall follow is one of slow refinement, supplementing the discussion with numerousasides on the issues that would be raised in compiling larger languages. Clearly, we could opt todevelop a completely hand-crafted compiler, or simply to use a tool like Coco/R. We shall discussboth approaches. Even when a compiler is constructed by hand, having an attributed grammar todescribe it is very worthwhile.

On the source diskette can be found a great deal of code, illustrating different stages ofdevelopment of our system. Although some of this code is listed in appendices, its volumeprecludes printing all of it. Some of it has deliberately been written in a way that allows for simplemodification when attempting the exercises, and is thus not really of "production quality". Forexample, in order to allow components such as the symbol table handler and code generator to beused either with hand-crafted or with Coco/R generated systems, some compromises in design havebeen necessary.

Nevertheless, the reader is urged to study the code along with the text, and to attempt at least someof the many exercises based on it. A particularly worthwhile project is to construct a similarcompiler, based on a language whose syntax resembles C++ rather more than it does Pascal, andwhose development, like that of C++, will be marked by the steady assimilation of extra features.This language we shall name "Topsy", after the little girl in Harriet Beecher Stowes story whoknew little of her genealogy except a suspicion that she had "growd". A simple Topsy programwas illustrated in Exercise 8.25, where the reader was invited to create an initial syntactic

specification in Cocol.

14.1 Overall compiler structure

In Chapter 2 we commented that a compiler is often developed as a sequence of phases, of whichsyntactic analysis is only one. Although a recursive descent parser is easily written by applying theideas of earlier chapters, it should be clear that consideration will have to be given to therelationship of this to the other phases. We can think of a compiler with a recursive descent parserat its core as having the structure depicted in Figure 14.1.


2/26

We emphasize that phases need not be sequential, as passes would be. In a recursive descentcompiler the phases of syntax analysis, semantic analysis and code generation are very ofteninterleaved, especially if the source language is designed in such a way as to permit one-passcompilation. Nevertheless, it is useful to think of developing modular components to handle thevarious phases, with clear simple interfaces between them.

In our Modula-2 implementations, the various components of our diagram have been implemented

as separate modules, whose DEFINITION MODULE components export only those facilities that theclients need be aware of. The corresponding C++ implementations use classes to achieve the samesort of abstraction and protection.

In principle, the main routine of our compiler must resemble something like the following

void main(int argc, char *argv[]){ char SourceName[256], ListName[256];

// handle command line parametersstrcpy(SourceName, argv[1]);if (argc > 2) strcpy(ListName, argv[2]);else appendextension(SourceName, ".lst", ListName);

// instantiate compiler components

SRCE *Source = new SRCE(SourceName, ListName, "Compiler Version 1", true);REPORT *Report = new REPORT(Source);SCAN *Scanner = new SCAN(Source, Report);CGEN *CGen = new CGEN(Report);TABLE *Table = new TABLE(Report);PARSER *Parser = new PARSER(CGen, Scanner, Table, Report);

// start compilationParser->parse();

}

where we notice that instances of the various classes are constructed dynamically, and that theirconstructors establish links between them that correspond to those shown in Figure 14.1.

In practice our compilers do not look exactly like this. For example, Coco/R generates only a

scanner, a parser, a rudimentary error report generator and a driver routine. The scanner and thesource handling section of the file handler are combined into one module, and the routines forproducing the error listing are generated along with the main driver module. The C++ version ofCoco/R makes use of a standard class hierarchy involving parser, scanner and error reporter classes,and establishes links between the various instances of these classes as they are constructed. Thisgives the flexibility of having multiple instances of parsers or scanners within one system (however,our case studies will not exploit this power).


3/26

14.2 Source handling

Among the file handling routines is to be found one that has the task of transmitting the source,character by character, to the scanner or lexical analyser (which assembles it into symbols forsubsequent parsing by the syntax analyser). Ideally this source handler should have to scan theprogram text only once, from start to finish, and in a one-pass compiler this should always bepossible.

14.2.1 A hand-crafted source handler

The interface needed between source handler and lexical analyser is straightforward, and can besupplied by a routine that simply extracts the next character from the source each time it is invoked.It is convenient to package this with the routines that assume responsibility for producing a sourcelisting, and, where necessary, producing an error message listing, because all these requirementswill be input/output device dependent. It is useful to add some extra functionality, and so the publicinterface to our source handling class is defined by

class SRCE {public:FILE *lst; // listing filechar ch; // latest character read

void nextch(void);// Returns ch as the next character on this source line, reading a new// line where necessary. ch is returned as NUL if src is exhausted.

bool endline(void);// Returns true when end of current line has been reached

void listingon(void);// Requests source to be listed as it is read

void listingoff(void);// Requests source not to be listed as it is read

void reporterror(int errorcode);// Points out error identified by errorcode with suitable message

virtual void startnewline() {;}

// Called at start of each line

int getline(void);// Returns current line number

SRCE(char *sourcename, char *listname, char *version, bool listwanted);// Opens src and lst files using given names.// Resets internal state in readiness for starting to scan.// Notes whether listwanted. Displays version information on lst file.

~SRCE();// Closes src and lst files

};

Some aspects of this interface deserve further comment:

We have not shown the private members of the class, but of course there are several of these.

The startnewline routine has been declared virtual so that a simple class can be derivedfrom this one to allow for the addition of extra material at the start of each new line on thelisting - for example, line numbers or object code addresses. In Modula-2, Pascal or C, thesame sort of functionality may be obtained by manipulating a procedure variable or functionpointer.

Ideally, both source and listing files should remain private. This source handler declares the


4/26

listing file public only so that we can add trace or debugging information while the system isbeing developed.

The class constructor and destructor assume responsibility for opening and closing the files,whose names are passed as arguments.

The implementation of this class is fairly straightforward, and has much in common with thesimilar class used for the assemblers of Chapter 6. The code appears in Appendix B, and thefollowing implementation features are worthy of brief comment:

The source is scanned and reflected a whole line at a time, as this makes subsequent errorreporting much easier.

The handler effectively inserts an extra blank at the end of each line. This decouples the restof the system from the vagaries of whatever method the host operating system uses torepresent line ends in text files. It also ensures that no symbol may extend over a line break.

It is not possible to read past the end of file - attempts to do so simply return a NUL character.

The reporterror routine will not display an error message unless a minimum number of

characters have been scanned since the last error was reported. This helps suppress thecascade of error messages that might otherwise appear at any one point during error recoveryof the sort discussed in sections 10.3 and 14.6.

Our implementation has chosen to use the stdio library, rather than iostreams, mainly totake advantage of the concise facilities provided by the printf routine.

Exercises

14.1 The nextch routine will be called once for every character in the source text. This canrepresent a considerable bottleneck, especially as programs are often prepared with a great manyblanks at the starts of indented lines. Some editors also pad out the ends of lines with unnecessaryblanks. Can you think of any way in which this overhead might be reduced?

14.2 Some systems allowing an ASCII character set (with ordinal values in the range 0 ... 127) areused with input devices which generate characters having ordinal values in the range 0 ... 255 -typically the "eighth bit" might always be set, or used or confused with parity checking. How andwhere could this bit be discarded?

14.3 A source handler might improve on efficiency rather dramatically were it able to read theentire source file into a large memory buffer. Since modern systems are often blessed with

relatively huge amounts of RAM, this is usually quite feasible. Develop such a source handler,compatible with the class interface suggested above, bearing in mind that we wish to be able toreflect the source line by line as it is read, so that error messages can be appended as exemplified insection 14.6.

14.4 Develop a source handler implementation that uses the C++ stream-based facilities from theiostreams library.


5/26

14.2.2 Source handling in Coco/R generated systems

As we have already mentioned, Coco/R integrates the functions of source handler and scanner, soas to be able to cut down on the number of files it has to generate. The source and listing files haveto be opened before making the call to instantiate or initialize the scanner, but this is handledautomatically by the generated driver routine. It is of interest that the standard frame files suppliedwith Coco/R arrange for this initialization to read the entire source file into a buffer, as suggested inExercise 14.3.

14.3 Error reporting

As can be seen from Figure 14.1, most components of a compiler have to be prepared to signal thatsomething has gone awry in the compilation process. To allow all of this to take place in a uniformway, we have chosen to introduce a base class with a very small interface:

class REPORT {public:REPORT();// Initializes error reporter

virtual void error(int errorcode);// Reports on error designated by suitable errorcode number

bool anyerrors(void);// Returns true if any errors have been reported

protected:bool errors;

};

Error reporting is then standardized by calling on the error member of this class whenever an erroris detected, passing it a unique number to distinguish the error.

The base class can choose simply to abort compilation altogether. Although at least one highlysuccessful microcomputer Pascal compiler uses this strategy (Turbo Pascal, from Borland

International), it tends to become very annoying when one is developing large systems. Since theerror member is virtual, it is an easy matter to derive a more suitable class from this one, without,of course, having to amend any other part of the system. For our hand-crafted system we can do thisas follows:

class clangReport : public REPORT {public:clangReport(SRCE *S) { Srce = S; }virtual void error(int errorcode){ Srce->reporterror(errorcode); errors = true; }

private:SRCE *Srce;

};

and the same technique can be used to enhance Coco/R generated systems. The Modula-2 and

Pascal implementations achieve the same functionality through the use of procedure variables.

14.4 Lexical analysis

The main task of the scanner is to provide some way of uniquely identifying each successive tokenor symbol in the source code that is being compiled. Lexical analysis was discussed in section 10.4,


6/26

and presents few problems for a language as simple as ours.

14.4.1 A hand-crafted scanner

The interface between the scanner and the main parser is conveniently provided by a routinegetsym for returning a parameter SYM of a record or structure type assembled from the source text.This can be achieved by defining a class with a public interface as follows:

enum SCAN_symtypes {

SCAN_unknown, SCAN_becomes, SCAN_lbracket, SCAN_times, SCAN_slash, SCAN_plus,SCAN_minus, SCAN_eqlsym, SCAN_neqsym, SCAN_lsssym, SCAN_leqsym, SCAN_gtrsym,SCAN_geqsym, SCAN_thensym, SCAN_dosym, SCAN_rbracket, SCAN_rparen, SCAN_comma,SCAN_lparen, SCAN_number, SCAN_stringsym, SCAN_identifier, SCAN_coendsym,SCAN_endsym, SCAN_ifsym, SCAN_whilesym, SCAN_stacksym, SCAN_readsym,SCAN_writesym, SCAN_returnsym, SCAN_cobegsym, SCAN_waitsym, SCAN_signalsym,SCAN_semicolon, SCAN_beginsym, SCAN_constsym, SCAN_varsym, SCAN_procsym,SCAN_funcsym, SCAN_period, SCAN_progsym, SCAN_eofsym

};

const int lexlength = 128;typedef char lexeme[lexlength + 1];

struct SCAN_symbols {SCAN_symtypes sym; // symbol typeint num; // valuelexeme name; // lexeme

};

class SCAN {public:void getsym(SCAN_symbols &SYM);// Obtains the next symbol in the source text

SCAN(SRCE *S, REPORT *R);// Initializes scanner

};

Some aspects of this interface deserve further comment:

SCAN_symbols makes provision for returning not only a unique symbol type, but also the

corresponding textual representation (known as a lexeme), and also the numeric value when asymbol is recognized as a number.

SCAN_unknown caters for erroneous characters like # and ? which do not really form part ofthe terminal alphabet. Rather than take action, the humble scanner returns the symbol withoutcomment, and leaves the parser to cope. Similarly, an explicit SCAN_eofsym is alwaysreturned ifgetsym is called after the source code has been exhausted.

The ordering of the SCAN_symtypes enumeration is significant, and supports an interestingform of error recovery that will be discussed in section 14.6.1.

The enumeration has also made provision for a few symbol types that will be used in theextensions of later chapters.

A scanner for Clang is readily programmed in an ad hoc manner, driven by a selection statement,and an implementation can be found in Appendix B. As with source handling, someimplementation issues call for comment:

Some ingenuity has to be applied to the recognition of literal strings. A repeated quote withina string is used (as in Pascal) to denote a single quote, so that the end of a string can only bedetected when an odd number of quotes is followed by a non-quote.


7/26

The scanner has assumed responsibility for a small amount of semantic activity, namely theevaluation of a number. Although it may seem a convenient place to do this, such analysis isnot always as easy as it might appear. It becomes slightly more difficult to develop scannersthat have to distinguish between numbers represented in different bases, or between real andinteger numbers.

There are two areas where the scanner has been given responsibility for detecting errors:

Although the syntactic description of the language does not demand it, practicalconsiderations require that the value of a numeric constant should be within the range of themachine. This is somewhat tricky to ensure in the case of cross-compilers, where the range onthe host and target machines may be different. Some authors go so far as to suggest that thissemantic activity be divorced from lexical analysis for that reason. Our implementation showshow range checking can be handled for a self-resident compiler.

Not only do many languages insist that no identifier (or any other symbol) be carried across aline break, they usually do this for strings as well. This helps to guard against the chaos thatwould arise were a closing quote to be omitted - further code would become string text, andfuture string text would become code! The limitation that a string be confined to one sourceline is, in practice, rarely a handicap, and the restriction is easily enforced.

We have chosen to use a binary search to recognize the reserved keywords. Tables of symboltypes that correspond to keywords, and symbols that correspond to single character terminals,are initialized as the scanner is instantiated. The idioms of C++ programming suggest thatsuch activities are best achieved by having static members of the class, set up by a"initializer" that forms part of their definition. For a binary search to function correctly it isnecessary that the table of keywords be in alphabetic order, and care must be taken if andwhen the scanner is extended.

Exercises

14.5 The only screening this scanner does is to strip blanks separating symbols. How would youarrange for it to strip comments

(a) of the form { comment in curly braces }(b) of the form (* comment in Modula-2 braces *)(c) of the form // comment to end of the line as in C++(d) of either or both of forms (a) and (b), allowing for nesting?

14.6 Balanced comments are actually dangerous. If not properly closed, they may consume validsource code. One way of assisting the coder is to issue a warning if a semicolon is found within a

comment. How could this be implemented as part of the answer to Exercise 14.5?

14.7 The scanner does not react sensibly to the presence of tab or formfeed characters in the source.How can this be improved?

14.8 Although the problem does not arise in the case of Clang, how do you suppose a hand-craftedscanner is written for languages like Modula-2 and Pascal that must distinguish between REALliterals of the form 3.4 and subrange specifiers of the form 3..4, where no spaces delimit the "..",


8/26


9/26

scanner simply return a token number. If we need to determine the text of a string, the name of anidentifier, or the value of a numeric literal, we are obliged to write appropriately attributedproductions into the phrase structure grammar. This is easily done, as will be seen by studying theseproductions in the grammars to be presented later.

Exercises

14.16 Is it possible to write a Cocol specification that generates a scanner that can handle thesuggestions made in Exercises 14.10 and 14.11 (allowing strings that immediately follow oneanother to be automatically concatenated, and allowing for escape sequences like "\n" to appearwithin strings to have the meanings that they do in C++)? If not, how else might such features beincorporated into Coco/R generated systems?

14.4.3 Efficient keyword recognition

The subject of keyword recognition is important enough to warrant further comment. It is possibleto write a FSA to do this directly (see section 10.5). However, in most languages, including Clangand Topsy, identifiers and keywords have the same basic format, suggesting the construction of

scanners that simply extract a "word" into a string, which is then tested to see whether it is, in fact,a keyword. Since string comparisons are tedious, and since typically 50%-70% of program textconsists of either identifiers or keywords, it makes sense to be able to perform this test as quickly aspossible. The technique used in our hand-crafted scanner of arranging the keywords in analphabetically ordered table and then using a binary search is only one of several ideas that strivefor efficiency. At least three other methods are often advocated:

The keywords can be stored in a table in length order, and a sequential search used amongthose that have the same length as the word just assembled.

The keywords can be stored in alphabetic order, and a sequential search used among thosethat have the same initial letter as the word just assembled. This is the technique employed byCoco/R.

A "perfect hashing function" can be derived for the keyword set, allowing for a single stringcomparison to distinguish between all identifiers and keywords.

A hashing function is one that is applied to a string so as to extract particular characters, map theseonto small integer values, and return some combination of those. The function is usually kept verysimple, so that no time is wasted in its computation. A perfecthash function is one chosen to beclever enough so that its application to each of the strings in a set of keywords results in a uniquevalue for each keyword. Several such functions are known. For example, if we use an ASCIIcharacter set, then the C++ function

int hash (char *s){ int L = strlen(s); return (256 * s[0] + s[L-1] + L) % 139; }

will return 40 unique values in the range 0 ... 138 when applied to the 40 strings that are thekeywords of Modula-2 (Gough and Mohay, 1988). Of course, it will return some of these values fornon-keywords as well (for example the keyword "VAR" maps to the value 0, as does any other threeletter word starting with "V" and ending with "R"). To use this function one would first construct a139 element string table, with the appropriate 40 elements initialized to store the keywords, and the


10/26

rest to store null strings. As each potential identifier is scanned, its hash value is computed using theabove formula. A single probe into the table will then ascertain whether the word just recognized isa keyword or not.

Considerable effort has gone into the determination of "minimal perfect hashing functions" - onesin which the number of possible values that the function can return is exactly the same as thenumber of keywords. These have the advantage that the lookup table can be kept small (Goughsfunction would require a table in which nearly 60% of the space was wasted).

For example, when applied to the 19 keywords used for Clang, the C++ function

int hash (char *s){ int L = strlen(s); return Map[s[0]] + Map[s[L-2]] + L - 2; }

will return a unique value in the range 0 ... 18 for each of them. Here the mapping is done via a 256element array Map, which is initialized so that all values contain zero save for those shown below:

Map[B] = 6; Map[D] = 8; Map[E] = 5; Map[L] = 9;Map[M] = 7; Map[N] = 8; Map[O] = 12; Map[P] = 3;Map[S] = 3; Map[T] = 8; Map[W] = 1;

Clearly this particular function cannot be applied to strings consisting of a single character, but suchstrings can easily be recognized as identifiers anyway. It is one of a whole class of similar functions

proposed by Cichelli (1980), who also developed a backtracking search technique for determiningthe values of the elements of the Map array.

It must be emphasized that if a perfect hash function technique is used for constructing scanners forlanguages like Clang and Topsy that are in a constant state of flux as new keywords are proposed,then the hash function has to be devised afresh with each language change. This makes it anawkward technique to use for prototyping. However, for production quality compilers for wellestablished languages, the effort spent in finding a perfect hash function can have a markedinfluence on the compilation time of tens of thousands of programs thereafter.

Exercises

To assist with these exercises, a program incorporating Cichellis algorithm, based on the onepublished by him in 1979, appears on the source diskette. Another well known program for theconstruction of perfect hash functions is known as gperf. This is written in C, and is available fromvarious Internet sites that mirror the extensive GNU archives of software distributed by the FreeSoftware Foundation (see Appendix A).

14.17 Develop hand-crafted scanners that make use of the alternative methods of keywordidentification suggested here.

14.18 Carry out experiments to discover which method seems to be best for the reserved word listsof languages like Clang, Topsy, Pascal, Modula-2 or C++. To do so it is not necessary to develop afull scale parser for each of these languages. It will suffice to invoke the getsym routine repeatedlyon a large source program until all symbols have been scanned, and to time how long this takes.

Further reading


11/26

Several texts treat lexical analysis in far more detail than we have done; justifiably, since for largerlanguages there are considerably more problem areas than our simple one raises. Good discussionsare found in the books by Gough (1988), Aho, Sethi and Ullman (1986), Welsh and Hay (1986) andElder (1994). Pemberton and Daniels (1982) give a very detailed discussion of the lexical analyserfound in the Pascal-P compiler.

Discussion of perfect hash function techniques is the source of a steady stream of literature. Besidesthe papers by Cichelli (1979, 1980), the reader might like to consult those by Cormack, Horspool

and Kaiserwerth (1985), Sebesta and Taylor (1985), Panti and Valenti (1992), and Trono (1995).

14.5 Syntax analysis

For languages like Clang or Topsy, which are essentially described by LL(1) grammars,construction of a simple parser presents few problems, and follows the ideas developed in earlierchapters.

14.5.1 A hand-crafted parser

Once again, if C++ is the host language, it is convenient to define a hand-crafted parser in terms ofits own class. If all that is required is syntactic analysis, the public interface to this can be kept verysimple:

class PARSER {public:PARSER(SCAN *S, REPORT *R);// Initializes parser

void parse(void);// Parses the source code

};

where we note that the class constructor associates the parser instance with the appropriate

instances of a scanner and error reporter. Our complete compiler will need to go further than this -an association will have to be made with at least a code generator and symbol table handler. Asshould be clear from Figure 14.1, in principle no direct association need be made with a sourcehandler (in fact, our system makes such an association, but only so that the parser can directdiagnostic output to the source listing).

An implementation of this parser, devoid of any attempt at providing error recovery, constraintanalysis or code generation, is provided on the source diskette. The reader who wishes to see amuch larger application of the methods discussed in section 10.2 might like to study this. In thisconnection it should be noted that Modula-2 and Pascal allow for procedures and functions to benested. This facility (which is lacking in C and C++) can be used to good effect when developingcompilers in those languages, so as to mirror the highly embedded nature of the phrase structuregrammar.

14.5.2 A Coco/R generated parser

A parser for Clang can be generated immediately from the Cocol grammar presented in section8.7.2. At this stage, of course, no attempt has been made to attribute the grammar to incorporateerror recovery, constraint analysis, or code generation.


12/26

Exercises

Notwithstanding the fact that the construction of an parser that does little more than check syntax isstill some distance away from having a complete compiler, the reader might like to turn his or herattention to some of the following exercises, which suggest extensions to Clang or Topsy, and toconstruct grammars, scanners and parsers for recognizing such extensions.

14.19 Compare the hand-crafted parser found on the source diskette with the source code that isproduced by Coco/R.

14.20 Develop a hand-crafted parser for Topsy as suggested by Exercise 8.25.

14.21 Extend your parser for Clang to accept the REPEAT ... UNTIL loop as it is found in Pascal orModula-2, or add an equivalent do loop to Topsy.

14.22 Extend the IF ... THEN statement to provide an ELSE clause.

14.23 How would you parse a Pascal-like CASE statement? The standard Pascal CASE statement does

not have an ELSE or OTHERWISE option. Suggest how this could be added to Clang, and modify theparser accordingly. Is it a good idea to use OTHERWISE or ELSE for this purpose - assuming that youalready have an IF ... THEN ... ELSE construct?

14.24 What advantages does the Modula-2 CASE statement have over the Pascal version? Howwould you parse the Modula-2 version?

14.25 The C++switch statement bears some resemblance to the CASE statement, although itssemantics are rather different. Add the switch statement to Topsy.

14.26 How would you add a Modula-2 or Pascal-like FOR loop to Clang?

14.27 The C++for statement is rather different from the Pascal one, although it is often used inmuch the same way. Add a for statement to Topsy.

14.28 The WHILE, FOR and REPEAT loops used in Wirths languages are structured- they have onlyone entry point, and only one exit point. Some languages allow a slightly less structured loop,which has only one entry point, but which allows exit from various places within the loop body. Anexample of this might be as follows

Like others, LOOP statements can be nested. However, EXIT statements may only appear withinLOOP sequences. Can you find context-free productions that will allow you to incorporate thesestatements into Clang?


13/26

14.29 If you are extending Topsy to make it resemble C++ as closely as possible, the equivalent ofthe EXIT statement would be found in the break or continue statements that C++ allows within itsvarious structured statements like switch, do and while. How would you extend the grammar forTopsy to incorporate these statements? Can the restrictions on their placement be expressed in acontext-free grammar?

14.30 As a more challenging exercise, suppose we wished to extend Clang or Topsy to allow forvariables and expressions of other types besides integer (for example, Boolean). Variousapproaches might be taken, as exemplified by the following

(a) Replacing the Clang keyword VAR by a set of keywords used to introduce variable lists:

int X, Y, Z[4];bool InTime, Finished;

(b) Retention of the VAR symbol, along with a set of standard type identifiers, used after variablelists, as in Pascal or Modula-2:

VARX, Y, Z[4] : INTEGER;InTime, Finished : BOOLEAN;

Develop a grammar (and parser) for an extended version of Clang or Topsy that uses one or other

of these approaches. The language should allow expressions to use Boolean operators (AND, OR,NOT) and Boolean constants (TRUE and FALSE). Some suggestions were made in this regard inExercise 13.17.

14.31 The approach used in Pascal and Modula-2 has the advantage that it extends seamlessly to themore general situations in which users may introduce their own type identifiers. In C++ one finds ahybrid: variable lists may be preceded either by special keywords or by user defined type names:

typedef bool sieve[1000];int X, Y; // introduced by keywordsieve Primes; // introduced by identifier

Critically examine these alternative approaches, and list the advantages and disadvantages either

seems to offer. Can you find context-free productions for Topsy that would allow for theintroduction of a simple typedef construct?

A cynic might contend that if a language has features which are the cause of numerous beginnerserrors, then one should redesign the language. Consider a selection of the following:

14.32 Bailes (1984) made a plea for the introduction of a "Rational Pascal". According to him, thekeywords DO (in WHILE and FOR statements), THEN (in IF statements) and the semicolons which areused as terminators at the ends of declarations and as statement separators should all be discarded.(He had a few other ideas, some even more contentious). Can you excise semicolons from Clangand Topsy, and then write a recursive descent parser for them? If, indeed, semicolons seem to serve

no purpose other than to confuse learner programmers, why do you suppose language designers usethem?

14.33 The problems with IF ... THEN and IF ... THEN ... ELSE statements are such that onemight be tempted to try a language construct described by

IfStatement = "IF" Condition "THEN" Statement{ "ELSIF" Condition "THEN" Statement }[ "ELSE Statement ] .


14/26

Discuss whether this statement form might easily be handled by extensions to your parser. Does ithave any advantages over the standard IF ... THEN ... ELSE arrangement - in particular, does itresolve the "dangling else" problem neatly?

14.34 Extend your parser to accept structured statements on the lines of those used in Modula-2, forexample

IfStatement = "IF" Condition "THEN" StatementSequence{ "ELSIF" Condition "THEN" StatementSequence }[ "ELSE" StatementSequence ]

"END" . WhileStatement = "WHILE" Condition "DO" StatementSequence "END" . StatementSequence = Statement { ";" Statement } .

14.35 Brinch Hansen (1983) did not approve of implicit "empty" statements. How do these appearin our languages, are they ever of practical use, and if so, in what ways would an explicit statement(like the SKIP suggested by Brinch Hansen) be any improvement?

14.36 Brinch Hansen incorporated only one form of loop into Edison - the WHILE loop - arguingthat the other forms of loops were unnecessary. What particular advantages and disadvantages dothese loops have from the points of view of a compiler writer and a compiler user respectively? Ifyou were limited to only one form of loop, which would you choose, and why?

14.6 Error handling and constraint analysis

In section 10.3 we discussed techniques for ensuring that a recursive descent parser can recoverafter detecting a syntax error in the source code presented to it. In this section we discuss how bestto apply these techniques to our Clang compiler, and then go on to discuss how the parser can beextended to perform context-sensitive or constraint analysis.

14.6.1 Syntax error handling in hand-crafted parsers

The scheme discussed previously - in which each parsing routine is passed a set of "follower"symbols that it can use in conjunction with its own known set of "first" symbols - is easily appliedsystematically to hand-crafted parsers. It suffers from a number of disadvantages, however:

It is quite expensive, since each call to a parsing routine is effectively preceded by twotime-consuming operations - the dynamic construction of a set object, and the parameterpassing operation itself - operations which turn out not to have been required if the sourcebeing translated is correct.

If, as often happens, seemingly superfluous symbols like semicolons are omitted from thesource text, the resynchronization process can be overly severe.

Thus the scheme is usually adapted somewhat, often in the light of experience gained by observingtypical user errors. A study of the source code for such parsers on the source diskette will revealexamples of the following useful variations on the basic scheme:

In those many places where "weak" separators are found in constructs involving iterations,such as

VarDeclarations = "VAR" OneVar{ "," OneVar} ";"


15/26

CompoundStatement = "BEGIN" Statement { ";" Statement } "END" . Term = Factor { MulOp Factor} .

the iteration is started as long as the parser detects the presence of the weak separator oravalid symbol that would follow it in that context (of course, appropriate errors are reported ifthe separator has been omitted). This has the effect of "inserting" such missing separators intothe stream of symbols being parsed, and proves to be a highly effective enhancement to thebasic technique.

Places where likely errors are expected - such as confusion between the ":=" and "="

operators, or attempting to provide an integer expression rather than a Boolean comparisonexpression in anIfStatementor WhileStatement- are handled in an ad-hoc way.

Many sub-parsers do not need to make use of the prologue and epilogue calls to the testroutine. In particular, there is no need to do this in routines like those for IfStatement,WhileStatementand so on, which have been introduced mainly to enhance the modularity ofStatement.

The Modula-2 and Pascal implementations nest their parsing routines as tightly as possible.Not only does this match the embedded nature of the grammar very nicely, it also reduces thenumber of parameters that have to be passed around.

Because of the inherent cost in the follower-set based approach to error recovery, some compilerwriters make use of simpler schemes that can achieve very nearly the same degree of success at farless cost. One of these, suggested by Wirth (1986, 1996), is based on the observation that thesymbols passed as members of follower sets to high level parsing routines - such as Block-effectively become members of every follower set parameter computed thereafter. When one finallygets to parse a Statement, for example, the set of stopping symbols used to establishsynchronization at the start of the Statementroutine is the union of FIRST(Statement) +FOLLOW(Program) + FOLLOW(Block), while the set of stopping symbols used to establishsynchronization at the end ofStatementis the union of FOLLOW(Statement) +FOLLOW(Program) + FOLLOW(Block). Furthermore, if we treat the semicolon that separatesstatements as a "weak" separator, as previously discussed, no great harm is done if the set used to

establish synchronization at the end ofStatementalso includes the elements of FIRST(Statement).

Careful consideration of the SCAN_symtypes enumeration introduced in section 14.4.1 will revealthat the values have been ordered so that the following patterns hold to a high degree of accuracy:

SCAN_unknown .. SCAN_lbracket, MiscellaneousSCAN_times, SCAN_slash, FOLLOW(Factor)SCAN_plus, SCAN_minus, FOLLOW(Term)SCAN_eqlsym .. SCAN_geqsym, FOLLOW(Expression1) in ConditionSCAN_thensym, SCAN_dosym, FOLLOW(Condition)SCAN_rbracket .. SCAN_comma, FOLLOW(Expression)SCAN_lparen, .. SCAN_identifier, FIRST(Factor)SCAN_coendsym, SCAN_endsym, FOLLOW(Statement)SCAN_ifsym .. SCAN_signalsym, FIRST(Statement)SCAN_semicolon, FOLLOW(Block)

SCAN_beginsym .. SCAN_funcsym, FIRST(Block)SCAN_period, FOLLOW(Program)SCAN_progsym, FIRST(Program)SCAN_eofsym

The argument now goes that, with this carefully ordered enumeration, virtually all of the tests of theform

Sym SynchronizationSet


16/26

can be accurately replaced by tests of the form

Sym SmallestElement(SynchronizationSet)

and that synchronization at crucial points in the grammar can be achieved by using a routinedeveloped on the lines of

void synchronize(SCAN_symtypes SmallestElement, int errorcode){ if (SYM.sym >= SmallestElement) return;reporterror(errorcode);do { getsym(); } while (SYM.sym < SmallestElement);

}

The way in which this idea could be used is exemplified in a routine for parsing Clang statements.

void Statement(void)// Statement = [ CompoundStatement | Assignment | IfStatement// | WhileStatement | WriteStatement | ReadStatement ] .{ synchronize(SCAN_identifier, 15);// We shall return correctly if SYM.sym is a semicolon or END (empty statement)// or if we have synchronized (prematurely) on a symbol that really follows// a Blockswitch (SYM.sym){ case SCAN_identifier: Assignment(); break;case SCAN_ifsym: IfStatement(); break;case SCAN_whilesym: WhileStatement(); break;case SCAN_writesym: WriteStatement(); break;case SCAN_readsym: ReadStatement(); break;

case SCAN_beginsym: CompoundStatement(); break;default: return;

}synchronize(SCAN_endsym, 32);// In some situations we shall have synchronized on a symbol that can start// a further Statement, but this should be handled correctly from the call// made to Statement from within CompoundStatement

}

It turns out to be necessary to replace some other set inclusion tests, by providing predicatefunctions exemplified by

bool inFirstStatement(SCAN_symtypes Sym)// Returns true if Sym can start a Statement{ return (Sym == SCAN_identifier || Sym == SCAN_beginsym ||

Sym >= SCAN_ifsym && Sym


17/26

the use ofWEAK can actually deteriorate, since the union of all the SYNC symbol sets tends to becomethe entire universe. Below we show a modification of the grammar in section 8.7.2 that has beenfound to work quite well, and draw attention to the use of two places (in the productions forCondition and Term) where an explicit call to the error reporting interface has been used to handlesituations where one wishes to be lenient in the treatment of missing symbols.

PRODUCTIONS /* some omitted to save space */Clang = "PROGRAM" identifier WEAK ";" Block "." .Block = SYNC { ( ConstDeclarations | VarDeclarations ) SYNC }

CompoundStatement .OneConst = identifier WEAK "=" number ";" .VarDeclarations = "VAR" OneVar { WEAK "," OneVar } ";" .CompoundStatement = "BEGIN" Statement { WEAK ";" Statement } "END" .Statement = SYNC [ CompoundStatement | Assignment

| IfStatement | WhileStatement| ReadStatement | WriteStatement ] .

Assignment = Variable ":=" Expression SYNC .Condition = Expression ( RelOp Expression | (. SynError(91); .) ) .ReadStatement = "READ" "(" Variable { WEAK "," Variable } ")" .WriteStatement = "WRITE"

[ "(" WriteElement { WEAK "," WriteElement } ")" ] .Term = Factor { ( MulOp | (. SynError(92); .) ) Factor } .

14.6.3 Constraint analysis and static semantic error handling

We have already had cause to remark that the boundary between syntactic and semantic errors can

be rather vague, and that there are features of real computer languages that cannot be readilydescribed by context-free grammars. To retain the advantages of simple one-pass compilation whenwe include semantic analysis, and start to attach meaning to our identifiers, usually requires that the"declaration" parts of a program come before the "statement" parts. This is easily enforced by acontext-free grammar, all very familiar to a Modula-2, Pascal or C programmer, and seems quitenatural after a while. But it is only part of the story. Even if we insist that declarations precedestatements, a context-free grammar is still unable to specify that only those identifiers which haveappeared in the declarations (so-called defining occurrences) may appear in the statements (inso-called applied occurrences). Nor is a context-free grammar powerful enough to specify suchconstraints as insisting that only a variable identifier can be used to denote the target of anassignment statement, or that a complete array cannot be assigned to a scalar variable. We might betempted to write productions that seem to capture these constraints:

Clang = "PROGRAM" ProgIdentifier";" Block "." . Block = { ConstDeclarations | VarDeclarations } CompoundStatement . ConstDeclarations = "CONST" OneConst { OneConst } . OneConst = ConstIdentifier"=" number";" . VarDeclarations = "VAR" OneVar{ "," OneVar} ";" . OneVar = ScalarVarIdentifier| ArrayVarIdentifier UpperBound. UpperBound = "[" number"]" . Assignment = Variable ":=" Expression . Variable = ScalarVarIdentifier| ArrayVarIdentifier"[" Expression "]" . ReadStatement = "READ" "(" Variable { "," Variable } ")" . Expression = ( "+" Term | "-" Term | Term ) { AddOp Term } . Term = Factor{ MulOp Factor} . Factor = ConstIdentifier| Variable | number

| "(" Expression ")" .

This would not really get us very far, since all identifiers are lexically equivalent! We could attemptto use a context-sensitive grammar to overcome such problems, but that turns out to beunnecessarily complicated, for they are easily solved by leaving the grammar as it was, addingattributes in the form of context conditions, and using a symbol table.

Demanding that identifiers be declared in such a way that their static semantic attributes can berecorded in a symbol table, whence they can be retrieved at any future stage of the analysis, is notnearly as tedious as users might at first imagine. It is clearly a semantic activity, made easier by asyntactic association with keywords like CONST, VAR and PROGRAM.


18/26

Setting up a symbol table may be done in many ways. If one is interested merely in performing thesort of constraint analysis suggested earlier for a language as simple as Clang we may begin bynoting that identifiers designate objects that are restricted to one of three simple varieties - namelyconstant, variable andprogram. The only apparent complication is that, unlike the other two, avariable identifier can denote either a simple scalar, or a simple linear array. A simple table handlercan then be developed with a class having a public interface like the following:

const int TABLE_alfalength = 15; // maximum length of identifierstypedef char TABLE_alfa[TABLE_alfalength + 1];

enum TABLE_idclasses { TABLE_consts, TABLE_vars, TABLE_progs };

struct TABLE_entries {TABLE_alfa name; // identifierTABLE_idclasses idclass; // classbool scalar; // distinguish arrays from scalars

};

class TABLE {public:TABLE(REPORT *R);// Initializes symbol table

void enter(TABLE_entries &entry);// Adds entry to symbol table

void search(char *name, TABLE_entries &entry, bool &found);

// Searches table for presence of name. If found then returns entry

void printtable(FILE *lst);// Prints symbol table for diagnostic purposes

};

An augmented parser must construct appropriate entry structures and enter these into the tablewhen identifiers are first recognized by the routine that handles the productions for OneConstandOneVar. Before identifiers are accepted by the routines that handle the production forDesignatorthey are checked against this table for non-declaration ofname, abuse ofidclass (such as trying toassign to a constant) and abuse ofscalar (such as trying to subscript a constant, or a scalarvariable). The interface suggested here is still inadequate for the purposes of code generation, so wedelay further discussion of the symbol table itself until the next section.

However, we may take the opportunity of pointing out that the way in which the symbol tablefacilities are used depends rather critically on whether the parser is being crafted by hand, or byusing a parser generator. We draw the readers attention to the production for Factor, which wehave written

Factor= Designator| number| "(" Expression ")" .

rather than the more descriptive

Factor= ConstIdentifier| Variable | number| "(" Expression ")" .

which does not satisfy the LL(1) constraints. In a hand-crafted parser we are free to use semanticinformation to drive the parsing process, and to break this LL(1) conflict, as the following extractfrom such a parser will show.

void Factor(symset followers)// Factor = Variable | ConstIdentifier | Number | "(" Expression ")" .// Variable = Designator .{ TABLE_entries entry;bool found;test(FirstFactor, followers, 14); // Synchronizeswitch (SYM.sym){ case SCAN_identifier:

Table->search(SYM.name, entry, found); // Look it upif (!found) Report->error(202); // Undeclared identifier


19/26

if (entry.idclass = TABLE_consts) GetSym(); // ConstIdentifierelse Designator(entry, followers, 206); // Variablebreak;

case SCAN_number:GetSym(); break;

case SCAN_lparen:GetSym(); Expression(symset(SCAN_rparen) + followers);accept(SCAN_rparen, 17); break;

default: // Synchronized on aReport->error(14); break; // follower instead

}}

In a Coco/R generated parser some other way must be found to handle the conflict. The generatedparser will always set up a call to parse aDesignator, and so the distinction must be drawn at thatstage. The following extracts from an attributed grammar shows one possible way of doing this.

Factor= (. int value; TABLE_entries entry; .)

Designator| Number| "(" Expression ")" .

Notice that theDesignatorroutine is passed the set ofidclasses that are acceptable in this context.The production forDesignatorneeds to check quite a number of context conditions:

Designator= (. TABLE_alfa name;

bool isvariable, found; .)Ident (. Table->search(name, entry, found);if (!found) SemError(202);if (!allowed.memb(entry.idclass)) SemError(206);isvariable = entry.idclass == TABLE_vars; .)

( "[" (. if (!isvariable || entry.scalar) SemError(204); .)Expression "]"| (. if (isvariable && !entry.scalar) SemError(205); .)

) .

Other variations on this theme are possible. One of these is interesting in that it effectively usessemantic information to drive the parser, returning prematurely if it appears that a subscript shouldnot be allowed:

Designator= (. TABLE_alfa name;

bool found; .)Ident (. Table->search(name, entry, found);if (!found) SemError(202);if (!allowed.memb(entry.idclass)) SemError(206);if (entry.idclass != TABLE_vars) return; .)

( "[" (. if (entry.scalar) SemError(204); .)Expression "]"| (. if (!entry.scalar) SemError(205); .)

) .

As an example of how these ideas combine in the reporting of incorrect programs we present asource listing produced by the hand-crafted parser found on the source diskette:

1 : PROGRAM Debug2 : CONST2 : ^; expected3 : TooBigANumber = 328000;

3 : ^Constant out of range4 : Zero := 0;4 : ^:= in wrong context5 : VAR6 : Valu, Smallest, Largest, Total;7 : CONST8 : Min = Zero;8 : ^Number expected9 : BEGIN

10 : Total := Zero;11 : IF Valu THEN;11 : ^Relational operator expected12 : READ (Valu); IF Valu > Min DO WRITE(Valu);


20/26

12 : ^THEN expected13 : Largest := Valu; Smallest = Valu;13 : ^:= expected14 : WHILE Valu Zero DO15 : BEGIN16 : Total := Total + Valu17 : IF Valu > = Largest THEN Largest := Value;17 : ^; expected17 : Învalid factor17 : Ûndeclared identifier18 : IF Valu < Smallest THEN Smallest := Valu;19 : READLN(Valu); IF Valu > Zero THEN WRITE(Valu)19 : Ûndeclared identifier20 : END;

21 : WRITE(TOTAL:, Total, LARGEST:, Largest);22 : WRITE(SMALLEST: , Smallest)22 : Încomplete string23 : END.23 : ^) expected

Exercises

14.37 Submit the incorrect program given above to a Coco/R generated parser, and compare thequality of error reporting and recovery with that achieved by the hand-crafted parser.

14.38 At present the error messages for the hand-crafted system are reported one symbol afterthepoint where the error was detected. Can you find a way of improving on this?

14.39 A disadvantage of the error recovery scheme used here is that a user may not realize whichsymbols have been skipped. Can you find a way to mark some or all of the symbols skipped bytest? Has test been used in the best possible way to facilitate error recovery?

14.40 If, as we have done, all error messages after the first at a given point are suppressed, onemight occasionally find that the quality of error message deteriorates - "early" messages might beless apposite than "later" messages might have been. Can you implement a better method than theone we have? (Notice that the Followers parameter passed to a sub-parser for S includes not onlythe genuine FOLLOW(S) symbols, but also furtherBeacons.)

14.41 If you study the code for the hand-crafted parser carefully you will realize that Identifiereffectively appears in all the Followersets? Is this a good idea? If not, what alterations are needed?

14.42 Although not strictly illegal, the appearance of a semicolon in a program immediatelyfollowing a DO or THEN, or immediately preceding an END may be symptomatic of omitted code. Is itpossible to warn the user when this has occurred, and if so, how?

14.43 The error reporter makes no real distinction between context-free and semantic orcontext-sensitive errors. Do you suppose it would be an improvement to try to do this, and if so,how could it be done?

14.44 Why does this parser not allow you to assign one array completely to another array? Whatmodifications would you have to make to the context-free grammar to permit this? How would theconstraint analysis have to be altered?

14.45 In Topsy - at least as it is used in the example program of Exercise 8.25 - all "declarations"seem to precede "statements". In C++ it is possible to declare variables at the point where they arefirst needed. How would you define Topsy to support the mingling of declarations and statements?


21/26

14.46 One school of thought maintains that in a statement like a Modula-2 FOR loop, the controlvariable should be implicitly declared at the start of the loop, so that it is truly local to the loop. Itshould also not be possible to alter the value of the control variable within the loop. Can you extendyour parser and symbol table handler to support these ideas?

14.47 Exercises 14.21 through 14.36 suggested many syntactic extensions to Clang or Topsy.Extend your parsers so that they incorporate error recovery and constraint analysis for all theseextensions.

14.48 Experiment with error recovery mechanisms that depend on the ordering of theSCAN_symtypes enumeration, as discussed in section 14.6.1. Can you find an ordering that worksadequately for Topsy?

14.7 The symbol table handler

In an earlier section we claimed that it would be advantageous to split our compiler into distinctphases for syntax/constraint analysis and code generation. One good reason for doing this is toisolate the machine dependent part of compilation as far as possible from the language analysis.

The degree to which we have succeeded may be measured by the fact that we have not yet madeany mention of what sort of object code we are trying to generate.

Of course, any interface between source and object code must take cognizance of data-relatedconcepts like storage, addresses and data representation, as well as control-related ones likelocation counter, sequentialexecution and branch instruction, which are fundamental to nearly allmachines on which programs in our imperative high-level languages execute. Typically, machinesallow some operations which simulate arithmetic or logical operations on data bit patterns whichsimulate numbers or characters, these patterns being stored in an array-like structure ofmemory,whose elements are distinguished by addresses. In high-level languages these addresses are usuallygiven mnemonic names. The context-free syntax of many high-level languages, as it happens,

rarely seems to draw a distinction between the "address" for a variable and the "value" associatedwith that variable, and stored at its address. Hence we find statements like

X := X + 4

in which the X on the left of the := operator actually represents an address, (sometimes called theL-value ofX) while the X on the right (sometimes called theR-value ofX) actually represents thevalue of the quantity currently residing at the same address. Small wonder that mathematicallytrained beginners sometimes find the assignment notation strange! After a while it usually becomessecond nature - by which time notations in which the distinction is made clearer possibly onlyconfuse still further, as witness the problems beginners often have with pointer types in C++ orModula-2, where *P or P^ (respectively) denote the explicit value residing at the explicit address P.If we relate this back to the productions used in our grammar, we would find that each X in theabove assignment was syntactically aDesignator. Semantically these two designators are verydifferent - we shall refer to the one that represents an address as a Variable Designator, and to theone that represents a value as a Value Designator.

To perform its task, the code generation interface will require the extraction of further informationassociated with user-defined identifiers and best kept in the symbol table. In the case of constantswe need to record the associated values, and in the case of variables we need to record theassociated addresses and storage demands (the elements of array variables will occupy a contiguous


22/26

block of memory). If we can assume that our machine incorporates a "linear array" model ofmemory, this information is easily added as the variables are declared.

Handling the different sorts of entries that need to be stored in a symbol table can be done invarious ways. In a object-oriented class-based implementation one might define an abstract baseclass to represent a generic type of entry, and then derive classes from this to represent entries forvariables or constants (and, in due course, records, procedures, classes and any other forms of entrythat seem to be required). The traditional way, still required if one is hosting a compiler in alanguage that does not support inheritance as a concept, is to make use of a variant record (inModula-2 terminology) or union (in C++ terminology). Since the class-based implementation givesso much scope for exercises, we have chosen to illustrate the variant record approach, which is veryefficient, and quite adequate for such a simple language. We extend the declaration of theTABLE_entries type to be

struct TABLE_entries {TABLE_alfa name; // identifierTABLE_idclasses idclass; // classunion {struct {int value;

} c; // constantsstruct {int size, offset; // number of words, relative addressbool scalar; // distinguish arrays

} v; // variables};

};

The way in which the symbol table is constructed can be illustrated with reference to the relevantparts of a Cocol specification for handling OneConstand OneVar:

OneConst= (. TABLE_entries entry; .)

Ident (. entry.idclass = TABLE_consts; .)WEAK "="Number ";" (. Table->enter(entry); .) .

OneVar= (. TABLE_entries entry;

entry.idclass = TABLE_vars;entry.v.size = 1; entry.v.scalar = true;

entry.v.offset = framesize + 1; .)Ident[ UpperBound (. entry.v.scalar = false; .)] (. Table->enter(entry);

framesize += entry.v.size; .) .

UpperBound= "[" Number "]" (. size++; .) .

Ident= identifier (. LexName(name, TABLE_alfalength); .) .

Here framesize is a simple count, which is initialized to zero at the start of parsing a Block. Itkeeps track of the number of variables declared, and also serves to define the addresses which thesevariables will have relative to some known location in memory when the program runs. A trivialmodification gets around the problem if it is impossible or inconvenient to use zero-based addressesin the real machine.

Programming a symbol table handler for a language as simple as ours can be correspondinglysimple. On the source diskette can be found such implementations, based on the idea that thesymbol table can be stored within a fixed length array. A few comments on implementationtechniques will guide the reader who wishes to study this code:

The table is set up so that the entry indexed by zero can be used as a sentinel in a simple


23/26

sequential search by search. Although this is inefficient, it is adequate for prototyping thesystem.

A call to Table->search(name, entry, found) will always return with a well definedvalue for entry, even if the name had never been declared. Such undeclared identifiers willseem to have an effective idclass = TABLE_progs, which will be semantically unacceptableeverywhere, thus ensuring that incorrect code can never be generated.

Exercises

14.49 How would you check that no identifier is declared more than once?

14.50 Identifiers that are undeclared by virtue of mistyped declarations tend to be annoying, forthey result in many subsequent errors being reported. Perhaps in languages as simple as ours onecould assume that all undeclared identifiers should be treated as variables, and entered as such inthe symbol table at the point of first reference. Is this a good idea? Can it easily be implemented?What happens if arrays are undeclared?

14.51 Careful readers may have noticed that a Clang array declaration is different from a C++ one -the bracketed number in Clang specifies the highest permitted index value, rather than the arraylength. This has been done so that one can declare variables like

VAR Scalar, List[10], VeryShortList[0];

How would you modify Clang and Topsy to use C++ semantics, where the declaration ofVeryShortList would have to be forbidden?

14.52 The names of identifiers are held within the symbol table as fixed length strings, truncated ifnecessary. It may seem unreasonable to expect compilers (especially written in Modula-2 or Pascal,which do not have dynamic strings as standard types) to cater for identifiers of any length, but too

small a limitation on length is bound to prove irksome sooner or later, and too generous a limitationsimply wastes valuable space when, as so often happens, users choose very short names. Develop avariation on the symbol table handler that allocates the name fields dynamically, to be of the correctsize. (This can, of course, also be done in Modula-2.) Making table entries should be quite simple;searching for them may call for a little more ingenuity.

14.53 A simple sequential search algorithm is probably perfectly adequate for the small Clangprograms that one is likely to write. It becomes highly inefficient for large applications. It is farmore efficient to store the table in the form of a binary search tree, of the sort that you may haveencountered in other courses in Computer Science. Develop such an implementation, noting that itshould not be necessary to alter the public interface to the table class.

14.54 Yet another approach is to construct the symbol table using a hash table, which probablyyields the shortest times for retrievals. Hash tables were briefly discussed in Chapter 7, and shouldalso be familiar from other courses you may have taken in Computer Science. Develop a hash tableimplementation for your Clang or Topsy compiler.

14.55 We might consider letting the scanner interact with the symbol table. Consider theimplications of developing a scanner that stores the strings for identifiers and string literals in astring table, as suggested in Exercise 6.6 for the assemblers of Chapter 6.


24/26

14.56 Develop a symbol table handler that utilizes a simple class hierarchy for the possible types ofentries, inheriting appropriately from a suitable base class. Once again, construction of such a tableshould prove to be straightforward, regardless of whether you use a linear array, tree, or hash tableas the underlying storage structure. Retrieval might call for more ingenuity, since C++ does notprovide syntactic support for determining the exact class of an object that has been staticallydeclared to be of a base class type.

14.8 Other aspects of symbol table management - further types

It will probably not have escaped the readers attention, especially if he or she has attempted theexercises in the last few sections, that compilers for languages which handle a wide variety oftypes, both "standard" and "user defined", must surely take a far more sophisticated approach toconstructing a symbol table and to keeping track of storage requirements than anything we haveseen so far. Although the nature of this text does not warrant a full discussion of this point, a fewcomments may be of interest, and in order.

In the first place, a compiler for a block-structured language will probably organize its symbol table

as a collection of dynamically allocated trees, with one root for each level of nesting. Althoughusing simple binary trees runs the risk of producing badly unbalanced trees, this is unlikely. Exceptfor source programs which are produced by program generators, user programs tend to introduceidentifiers with fairly random names; few compilers are likely to need really sophisticated treeconstructing algorithms.

Secondly, the nodes in the trees will be fairly complex record structures. Besides the obvious linksto other nodes in the tree, there will probably be pointers to other dynamically constructed nodes,which contain descriptions of the types of the identifiers held in the main tree.

Thus a Pascal declaration like

VARMatrix : ARRAY [1 .. 10, 2 .. 20] OF SET OF CHAR;

might result in a structure that can be depicted something like that of Figure 14.2.


25/26

We may take this opportunity to comment on a rather poorly defined area in Pascal, one that camein for much criticism. Suppose we were to declare

TYPELISTS = ARRAY [1 .. 10] OF CHAR;

VARX : LISTS;A : ARRAY [1 .. 10] OF CHAR;B : ARRAY [1 .. 10] OF CHAR;Z : LISTS;

A and B are said to be ofanonymous type, but to most people it would seem obvious that A and Bare of the same type, implying that an assignment of the form A := B should be quite legal, and,furthermore, that X and Z would be of the same type as A and B. However, some compilers will besatisfied with mere structuralequivalence of types before such an assignment would be permitted,

while others will insist on so-called nameequivalence. The original Pascal Report did not specifywhich was standard.

In this example A, B, X and Z all have structural equivalence. X and Z have name equivalence aswell, as they have been specified in terms of a named type LISTS.

With the insight we now have we can see what this difference means from the compiler writersviewpoint. Suppose A and B have entries in the symbol table pointed to by ToA and ToB respectively.

Then for name equivalence we should insist on ToA^.Type and ToB^.Type being the same (that is,their Type pointers address the same descriptor), while for structural equivalence we should insiston ToA^.Type^ and ToA^.Type^ being the same (that is, their Type pointers address descriptorsthat have the same structure).

Further reading and exploration

We have just touched on the tip of a large and difficult iceberg. If one adds the concept of types andtype constructors into a language, and insists on strict type-checking, the compilers become much

larger and harder to follow than we have seen up till now. The energetic reader might like to followup several of the ideas which should now come to mind. Try a selection of the following, which aredeliberately rather vaguely phrased.

14.57 Just how do real compilers deal with symbol tables?

14.58 Just how do real compilers keep track of type checking? Why should name equivalence beeasier to handle than structural equivalence?

14.59 Why do some languages simply forbid the use of "anonymous types", and why dont morelanguages forbid them?

14.60 How do you suppose compilers keep track of storage allocation for struct or RECORD types,and for union or variant record types?

14.61 Find out how storage is managed for dynamically allocated variables in language like C++,Pascal, or Modula-2.

14.62 How does one cope with arrays of variable (dynamic) length in subprograms?


26/26

14.63 Why can we easily allow the declaration of a pointer type to precede the definition of the typeit points to, even in a one-pass system? For example, in Modula-2 we may write

TYPELINKS = POINTER TO NODES (* NODES not yet seen *);NODES = RECORD

ETC : JUNK;Link : LINKS;. . .

14.64 Brinch Hansen did not like the Pascal subrange type because it seems to lead to ambiguities

(for example, a value of34 can be of type 0 .. 45, and also of type 30 .. 90 and so on), and soomitted them from Edison. Interestingly, a later Wirth language, Oberon, omits them as well. Howmight Pascal and Modula-2 otherwise have introduced the subrange concept, how could weovercome Brinch Hansens objections, and what is the essential point that he seems to haveoverlooked in discarding them?

14.65 One might accuse the designers of Pascal, Modula-2 and C of making a serious error ofudgement - they do not introduce a string type as standard, but rely on programmers to manipulate

arrays of characters, and to use error prone ways of recognizing the end, or the length, of a string.Do you agree? Discuss whether what they offer in return is adequate, and if not, why not. Suggestwhy they might deliberately not have introduced a string type.

14.66 Brinch Hansen did not like the Pascal variant record (or union). What do such types allowone to do in Pascal which is otherwise impossible, and why should it be necessary to provide thefacility to do this? How else are these facilities catered for in Modula-2, C and C++? Which is thebetter way, and why? Do the ideas of type extension, as found in Oberon, C++ and other "objectoriented" languages provide even better alternatives?

14.67 Many authors dislike pointer types because they allow "insecure" programming". What ismeant by this? How could the security be improved? If you do not like pointer types, can you thinkof any alternative feature that would be more secure?

There is quite a lot of material available on these subjects in many of the references cited

previously. Rather than give explicit references, we leave the Joys of Discovery to the reader.

simple compiler

Documents