facet grammars: towards static semantic analysis by context-free parsing

21
Comput. Lang. Vol. 18, No. 4, pp. 251-271, 1993 0096-0551/93 $5.00+0.00 Printed in Great Britain. All rights reserved Copyright © 1992 Pergamon Press Ltd FACET GRAMMARS: TOWARDS STATIC SEMANTIC ANALYSIS BY CONTEXT-FREE PARSING PAUL A. BAILESand TREVORCHORVAT Language Design Laboratory, Key Centre for SoftwareTechnology, Department of Computer Science, University of Queensland, QLD 4072, Australia (Received 3 July 1991; revision received 8 April 1992) Alaetraet--Use of executable declarative metalanguages has simplified programming language syntax specification and implementation,whereas existing formalisms for static semantics are still relatively procedural. A workinghypothesis is that the context-sensitivity of languages(under static semanticrules) is derived in significant part from the interleavedpresencestherein of sentencesin implicitly-defined and effectively invisible context-free languages. Procedures by which these sentences and context-free grammars for their languagescan be respectively derivedfrom the originalsentenceand the combination of the original language's grammar and semantic rules, lead to the possibilityof automatic generation of static semanticanalysers from the purely context-freespecificationsof "Facet Grammars" (FG)! We show that the utilityof FG for static semanticanalysishas a non-triviallower bound, by specifying the relatively complicated identifier scope and accessibility rules for Dijkstra's Guarded Commands Language. Automata Context-free grammar Context-free language Semantics 1. INTRODUCTION Increased reliability for language processors doubly in turn increases the reliability of applications software products developed using them and flexibility in experimentation in language design and implementation. Consider, e.g. our overall confidence in a compiler whose syntax analyser is derived, automatically or not, from a formal-language-theoretic parsing model. This exploitation of the "context-free paradigm" in parser development, whereby essentially all the creativity deployed in crafting a parser goes into the enunciation of the relevant context-free grammar (CFG), is a good example, both of how an executable declarative specification (EDS) of a computation is better than a procedural one, and how special-purpose declarative notations (in this case, CFG) can arise. The purpose of our paper is to advance the employment of EDS in language processing by exposing a glimpse of the extent to which Facet Grammars (FG), a particular application of CFG, can be used to specify static semantics, and how practical implementations of the specified procedures can be derived. 1.1 Aims and significance Our concern is that the exploitation of EDS in the parsing phase of language processing has not been matched to any similar extent in subsequent phases. Certainly, numerous special-purpose notations and formalisms have been proposed (attribute grammars [1], two-level grammars [2], e.g.), but we contend that none of these have the obvious simplicity, nor the claimed pscyhological resonance, of CFG. Context-free grammars are generally accepted as being easy to use, with concepts that most people can grasp. Many language specification methods such as the above- mentioned use CFG as bases. However, they have added layers of additional and relatively complicating metalinguistic baggage, which have not yielded semantic specifications of simplicity and clarity comparable to CFG syntactic specifications. Facet Grammars (FG), on the other hand, re-use the CFG concept together with only a couple of very simple linking concepts, taking advantage of human acceptance of the CFG concept without adding a complicated superstructure. The strategies used by humans for understanding and processing FG attempt to more closely match human processing capabilities. In particular, FG try 251

Upload: paul-a-bailes

Post on 31-Aug-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Facet Grammars: Towards static semantic analysis by context-free parsing

Comput. Lang. Vol. 18, No. 4, pp. 251-271, 1993 0096-0551/93 $5.00+0.00 Printed in Great Britain. All rights reserved Copyright © 1992 Pergamon Press Ltd

FACET GRAMMARS: TOWARDS STATIC SEMANTIC ANALYSIS BY CONTEXT-FREE PARSING

PAUL A. BAILES and TREVOR CHORVAT Language Design Laboratory, Key Centre for Software Technology, Department of Computer Science,

University of Queensland, QLD 4072, Australia

(Received 3 July 1991; revision received 8 April 1992)

Alaetraet--Use of executable declarative metalanguages has simplified programming language syntax specification and implementation, whereas existing formalisms for static semantics are still relatively procedural. A working hypothesis is that the context-sensitivity of languages (under static semantic rules) is derived in significant part from the interleaved presences therein of sentences in implicitly-defined and effectively invisible context-free languages. Procedures by which these sentences and context-free grammars for their languages can be respectively derived from the original sentence and the combination of the original language's grammar and semantic rules, lead to the possibility of automatic generation of static semantic analysers from the purely context-free specifications of "Facet Grammars" (FG)!

We show that the utility of FG for static semantic analysis has a non-trivial lower bound, by specifying the relatively complicated identifier scope and accessibility rules for Dijkstra's Guarded Commands Language.

Automata Context-free grammar Context-free language Semantics

1. I N T R O D U C T I O N

Increased reliability for language processors doubly in turn increases the reliability of applications software products developed using them and flexibility in experimentation in language design and implementation. Consider, e.g. our overall confidence in a compiler whose syntax analyser is derived, automatically or not, from a formal-language-theoretic parsing model. This exploitation of the "context-free paradigm" in parser development, whereby essentially all the creativity deployed in crafting a parser goes into the enunciation of the relevant context-free grammar (CFG), is a good example, both of how an executable declarative specification (EDS) of a computation is better than a procedural one, and how special-purpose declarative notations (in this case, CFG) can arise.

The purpose of our paper is to advance the employment of EDS in language processing by exposing a glimpse of the extent to which Facet Grammars (FG), a particular application of CFG, can be used to specify static semantics, and how practical implementations of the specified procedures can be derived.

1.1 A ims and significance

Our concern is that the exploitation of EDS in the parsing phase of language processing has not been matched to any similar extent in subsequent phases. Certainly, numerous special-purpose notations and formalisms have been proposed (attribute grammars [1], two-level grammars [2], e.g.), but we contend that none of these have the obvious simplicity, nor the claimed pscyhological resonance, of CFG. Context-free grammars are generally accepted as being easy to use, with concepts that most people can grasp. Many language specification methods such as the above- mentioned use CFG as bases. However, they have added layers of additional and relatively complicating metalinguistic baggage, which have not yielded semantic specifications of simplicity and clarity comparable to CFG syntactic specifications.

Facet Grammars (FG), on the other hand, re-use the CFG concept together with only a couple of very simple linking concepts, taking advantage of human acceptance of the CFG concept without adding a complicated superstructure. The strategies used by humans for understanding and processing FG attempt to more closely match human processing capabilities. In particular, FG try

251

Page 2: Facet Grammars: Towards static semantic analysis by context-free parsing

252 PAUL A. BAIL~S and TP.EVO~ CHOtVAT

to make greater use of human patterning ability rather than over-extend human memory capabilities.

Facet grammars contribute to the modularity of programming language specifications by transmitting the following desirable complementary characteristics of CFG specifications of programming language syntax in the specifications of programming language semantics:

(a) logically-separate concepts may be written separately; (b) variations of a single concept all appear together.

This improved modularity allows a language designer to deal with different language components independently and incidentally fosters orthogonality in the resulting design (as exemplified in the decoupling of the "nesting", "initialization" and "access" aspects of variables and identifiers given in Section 10--"extended application in language design"; below). For example, consider how the CFG-defined syntax rules for declaration parts of programming languages can be separated from statement parts [(a) above], but all the alternative production rules defining the syntax of declarations can appear together [(b) above]. With FG, this decoupling is applicable in semantic situations, e.g. type-checking can be specified with entire textual independence from the constraints of declaration of identifiers prior to use that many languages enforce.

Facet grammars also support the specification, without repetition, of a semantic concept with multiple instances. For example, with attribute grammars a repeated construct--say, a list of identifiers described as

identifier-list ---, identifier-list identifier [ identifier

requires the semantic specification for "identifier" to be repeated for each occurrence in the underlying CFG. Alternatively, misleading, artificial attributes need to be introduced to promote information pertaining to identifiers up the tree to a point where the separate occurrences can be treated once (at last!).

1.2. Technical overview

The Facet grammar (FG) model of static semantic checking rests upon the hyopothesis that, in a significant number of cases, complex static semantic rules requiring e.g. the power of a Context-Sensitive Grammar for their expression, can instead be expressed as the logical conjunction of sets of simpler rules specified using only CFG. Moreover, each CFG will define natural, orthogonalfacets of the language being described. Just as a language consists of a number of facets, so are sentences comprised of interleaved facet instances, or threads. Semantic checking therefore involves first a process of thread extraction, in the course of which putative threads are dis-interleaved, and then each putative thread is subject to context-free parsing according to the appropriate FG.

A distinguished facet is the context-free syntax facet, the language of which we call the Concrete Language (CL). The mechanism by which interleaved sub-sentences are separated (see Section 3.2 "extraction", below) will as a matter of course ensure that putative sentences belong to the CL. Also, the input sentence will typically only be considered as valid if all of the sub-sentences are valid, but logical combinations other than conjunction could conceivably be accommodated. (See Appendix A for glossary of technical vocabulary used and introduced in this paper.)

Corresponding to this Thread Extraction (TE) that applies to sentences, we have Grammar Extraction (GE) that applies to the language in order to identify its component facet grammars. GE may be applied to an existing, quasi-formal language design, or at the time of the original language design/definition, and involves the following four stages.

1.2.1. Separation. The whole language (including the informal static semantic rules) is partitioned into orthogonal concepts (or facets). This is done following the usual software engineering precepts of structuring (orthogonality, modularity, etc.) For example, the guarded commands language (see below) can be separated into a number of facets (e.g. concrete syntax, visibility, access, type).

1.2.2. Discernment. A grammar for a (separate) facet is transformed to recognise a single thread of that facet. A single concept (facet) may allow for multiple and independent instances. For

Page 3: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 253

example, a grammar dealing with the concept identifier or label needs to acknowledge that there may be many individual instances of such a concept. Discernment is a mechanism which allows each and every instance to be treated separately while checking each against the general pattern for that concept. Discernment allows a general pattern to be applied to a number of separate (discernible) instances of that concept,

1.2.3. Tuning. Conceptually, all the syntactically allowable patterns of discerned, and other instances and keywords are examined and only those that satisfy the "static semantic" patterns are allowed to remain in the grammar. That is, the meaning of the informal static semantic rules is formalized by incorporation into the grammar.

1.2.4. Optimization. As there are by this stage a number of CF parsers, each checking a subsentence of the input, there is no need to verify syntax if it is checked somewhere else. That is, in a situation where there are a number of parsers checking various facets of the sentence,

1 optimisation al ows the incidence of double-checking of syntax to be eliminated. The removal of double-checking and the tokens to which that checking related allows considerable simplification in each separate FG, and further focuses attention on just those objects and relationships of importance to the current facet.

2. STATIC SEMANTIC ANALYSIS: I N T R O D U C I N G A SIMPLE EXAMPLE

A simple programming language "P0" has the following static semantics for identifier behaviours:

(1) declaration of identifiers required before use; but (2) declaration of some identifiers may involve the use of others.

Syntactically, identifier scopes are not nested. An essential abstract syntax for P0 would therefore at least involve:

prog :: dcl* use* dcl :: name use use : : name*

where:

• "*" is Kleene star; • name denotes an effectively infinite set of distinguishable token values; • and a language's "essence" is a simplified description of a language relative to some linguistic

concept, and ignores symbols that do not relate. For example, a semantic facet solely concerned with the declaration of names prior to use need not retain in its threads information about whether variables are being updated or merely accessed, and could delete occurrences of the ": = "

The language generated by some language generator ("gen", say) for this abstract syntax-- "gen (prog)"--is a set of trees. So that we can directly inherit formal-language-theoretical results, a concrete syntax defining the set of representations of these trees as strings (in effect, the CL for P0) is:

PROG --* DCL* USE* DCL -* dcl NAME [with USE] USE --* use NAME*

where N A M E is as name above. The set "gen (PROG)" essentially characterises the set of traversals of elements of "gen (prog)". Because the traversal mapping

Traverse : gen (prog) --~ gen (PROG)

is bijective, PROG and prog may be viewed in effect synonymously, as facets of a common concept, this being the essence of P0 from the "identifier behaviours" point of view.

Page 4: Facet Grammars: Towards static semantic analysis by context-free parsing

254 PAUL A. BAILES and "IMEVOR CHORVAT

NOW, the static semantic rules for any programming language P may be viewed as a filter F for the putative sentences of P. Where P is partially specified by some CFG G, then

F : gen (G) ---, {true, false}

P itself is the (sub-) set (of "gen (G)") characterized by the predicate F:

P = chr (F) = {p ~ gen (G) I F (p)}

where "chr" maps a predicate into the set of which it (the predicate) is the characteristic predicate. Thus

P c gen (G)

in any non-trivial case.

In our case, definition of P0 therefore involves a pair (prog, F):

PO = {p E gen (prog) I F (p)}

where F enforces the static semantics conditions (1, 2 above) that apply to P0. Because this "chr (F)", viewed from its PROG facet, is a well-known context-sensitive language, static semantic correctness is typically checked by explicitly programming a traversal of the abstract tree, in which the focus is on the maintenance of a symbol table.

3. FACETS

3. I. Semantic processing--an automata-theoretic view

Consider the symbol table typically needed for P0 above. It contains a single entry for each distinct name encountered in a traversal. Now, for the universe of names, the symbol table may be thought of as indicating one of two states: "not declared (yet)", indicated by the absence of the name; or "already declared", indicated by its presence. Likewise, from the point of view of each name, the traversal may be thought of as the execution of an automaton (here, a Finite-State Machine), where the symbol table provides state information, and events of interest to that name (its declaration(s) and use(s)) provide the transitions. We call this sort of traversal the execution of a Parallel Semantic Automaton (PSA), involving virtually the execution in parallel of a collection of automata, one for each name, the activations of all of which are interleaved. Note that the automata are the same, but with different instantiations for the behaviour of different names in the one program.

Now, an automaton recognises a language. Thus, corresponding to the above interleaved parallel execution of automaton instances as parts of a PSA, (putative) sentences in the language of the automaton must also occur interleaved in P0 programs. For each distinct name there will be one interleaved sentence corresponding to an automaton instance. We call this language the Facet Language (FL) for P0 (FL~), and each interleaved sentence a thread. Therefore, the specification of static semantic rules for P0 reduces to the problem of defining FLy0, with their implementation being simply effected by extraction of the various separate threads in a P0 program, and application to each of them of the FL automaton.

To exploit this insight in general, rather than attempting to derive the FL from a hand-crafted PSA above, we should begin with its enunciation, and derive the PSA, in just the same way that programming language syntactic analysers are derived defining CFG, the idea being that the FL will be easier than the automaton to develop ab initio.

The extraction process depends upon identifying the seed, the linguistic property that necessitates context-sensitive analysis and so induces the employment of a PSA therein. Our operating hypothesis is that a significant amount of the context-sensitivity of programming languages derives from the fact that they are "overlapped matching" languages [3], i.e. generalizations from the form

{anbmanbm I m, n >f 1}

Programming language features such as matching actual and formal parameters, dimensions of arrays and declaration and use of identifiers could be considered as overlapped-matching language

Page 5: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 255

aspects. For example, the "declaration before use" rule for identifiers in P0 provides the seed upon which extraction is based.

3.2. Extraction

It follows from the above that, if the overlapped matching can be ignored in language specifications, context-sensitivity evaporates. This is possible, without any loss of information, by the following device: replace the original sentence by a set of sentential forms, one for each "overlapped matching" pattern in the original. (For example, each distinct identifier in P0 above is a pattern in this sense.) In each sentential form so derived from a particular pattern, occurrences of all but the particular matching pattern are replaced by some meta-variable, so that occurrences of one pattern only are discerned. By regarding the introduced meta-variable in each derived sentential form as standing for the set of arbitrary strings over the language's alphabet, then each separate sentential form actually defines a set of sentences, and, the intersection of all these sets is the original sentence. Moreover, if the meta-variable in each sentential form is alternatively viewed as a terminal symbol, the sentential forms become sentences in a context-free language (the threads above: note that the P0 case is degenerate in that the FL is regular). For the present, we will consider the case that there is one meta-variable common to each thread, and that the threads are all members of the same language.

In general, the use of Facet Languages relies on the intersection of context-free languages. Well-known results [4] imply that the intersection of two context-free languages results in a context-sensitive language (at worst). Thus, FG languages are capable of defining (many) context-sensitive languages as the intersection of two (or more) context-free Facet Languages. Consequently, FG emerge as a plausible tool for static semantics specification.

Precisely, given any language specified by the pair (CFG G, static semantic filter F as above), we require a thread extracting transformation "TE"

TE : gen (G) ~ P ((terms (G) u {M})*)

(where: recalling that "gen" generates a language from a grammar, then "gen (G)" is the relevant CL; "terms" yields the set of terminal symbols of a grammar; M denotes the introduced meta-variable; and "P (S)" is the powerset of S) which will produce a sequence for each meta-variable consisting only of those symbols which appear as terminals in the grammar of the FL.

We also need to derive the FG for the FL from the grammar G and filter F. We represent this as a grammar-extracting transformation "GE"

GE : CFG ~ CFG

such that the identity

F (P) = V p ~ TE (P). p ~ gen (GE (G))

holds. That is, static semantic analysis of some putative program P conforming to syntax G with static semantics F, may be performed by thread-extracting from P its threads p and then checking each thread for membership of each FL of P, merely by performing a number of context-free parses. Diagrammatically, we can see the difference between the existing filter-based approach (lower path), by comparison with the proposed approach based on multiple context-free grammars (upper path), as shown in Fig. 1.

3.3. The reasonable limits on extraction processing

Of course, work must be done to extract the threads, and to maximize the benefits of our proposed approach, the extraction process should be as simple as possible. The essential process in extraction is to discern one instance of an object from another (e.g. discern different instances of object "identifier") and then to pass the occurrence of this instance to a relevant CF parser.

There is a trade-off between the power of the extraction process and the power of the FL. In Section 5--"optimization"--below, this trade-off is exercised. The extraction process should not do any real work (i.e. analysis) itself but rather facilitate the work of other processes. This

CL I 8/4---D

Page 6: Facet Grammars: Towards static semantic analysis by context-free parsing

256 PAUL A. BAILES and TREVOR CHORVAT

strines.,,,.'e~'-'¢ v " ~ b o o l e a n s

E x t r a c t i o n ~ "

fFilter - based on] boolean " ~ semantic rules J

Fig. I

is important both cognitively and in implementation terms. The definitive power of the FG model comes from a number of separate context-free grammars--the notion of extraction connects these grammars. So far we have considered just two" that describing the concrete syntax, and that describing the static semantics; but a generalisation appears in Section 9--"developing facet grammars as language design tools", below. We want the creator/user of the definition to concentrate on the grammars rather than the mechanism which allows the separate grammars to define the one language. In implementation terms the extraction process is an overhead which is the means of transferring information between the separate parsing processes. It is important to keep this overhead as small as possible---linear or close to linear if possible.

The FL processor must therefore be designed so that thread extraction can be performed with minimal effort. Formally, we want to preclude the extraction process itself from performing any significant semantic analysis, typified in the extreme case by a TE and a GE where

TE (P) = if F (P) then "yes" else "no"

gen (GE (G)) = {"yes"}

(i.e. complete static semantic analysis is performed in the course of thread extraction TE, with an absolutely degenerate corresponding FL).

4. EXTRACTION: A SIMPLE EXAMPLE

In the case of P0, the separated occurrences of multiple, distinct n a m e s provides the context- sensitive seed. Thread extraction TE involves generation of putative sentences in FLp0 by creating, for each distinct name N in some putative P0 program P, a copy of P but with discern- ment of all names other than N effectively avoided. This may be implemented by designating occurrences of the name of interest by a distinct token name, and occurrences of all other names by the "non-discerning" token other (in other words, the meta-variable introduced in extraction). TE may be applied either to abstract trees or to the token stream produced by a lexical analyser, so introduction of the other token implies no complication of any existing lexical analyser p e r se.

For example, from the concrete, i.e. PROG facet of P0, applying thread extraction TE to the sentence

dd x dcl y with use x y u s e x y u s e y z

yields the set of putative FLp0 threads---one for each distinct identifier.

Page 7: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 257

Extracting for "x" gives the thread:

dcl name del other with use name other use name other use other other

Extracting for "y":

dcl other dcl name with use other name use other name use name other

Likewise for "z". The view of each thread as a set of sentences is achieved by regarding other in each case as standing for at least the set of tokens {"x", "y" , "z"}, in which case the derivation of the original example sentence as the intersection of the threads is obvious.

Grammar extraction involves derivation of a CFG for FLp0, and therefore proceeds by first transforming PROG so that only a single name is discerned:

PROG' ---, DCL* USE* DCL ---, del NAME [with USE] USE ---, use NAME* NAME ~ name

I other

A derived set of static semantic rules F' is still required to filter out invalid sentences, so the definition of FLpo still involves a pair, now (PROG', F').

However, we are at the same time in a position to refine (or tune) PROG" so that the meaning of the static semantic rules can be incorporated into the grammar, thus obviating the need for accompanying (non-declarative) static semantic rules:

PROG" ---, DCL0* DCL DCLI* USE* DCL0 ~ dcl other [with USE0] DCL ---, dcl name [with USE] DCL1 ---, dcl other [with USE] USE0 ---. use other* USE ---, use NAME* NAME ---, name

I other

Now, the identity

chr (F') = FLp0 = gen (PROG")

holds, so static semantic analysis (modulo extraction) is able to be performed entirely by context-free parsing.

Incidentally, ambiguity in the FL is not a problem because parsing is purely a recognition process with a yes/no answer.

5. OPTIMIZATION

Parsing the threads of P0 (~ "gen PROG"") corresponds to the execution of the PSA thereof as introduced above (Section 2--"static semantic analysis--introducing a simple example"). The comparatively extra states apparent in the automaton that would be derived for parsing PROG" derive from:

(i) the explicit processing of punctuation tokens (e.g. del) in PROG" that were implicit in control flows in the PSA driver program;

(ii) the explicit presence of other tokens in the thread extracted for a given name.

Page 8: Facet Grammars: Towards static semantic analysis by context-free parsing

258 PAUL A. BAILE$ and TREVOR CHORVAT

Both these drawbacks may be overcome by the employment of more sophisticated extraction techniques. Optimization is basically a more selective form of extraction that, subsequent to discernment, ignores sub-sentences in which there does not appear the pattern upon which the discernment was performed. Moreover, when we view this simplification as operating on the abstract tree facet of the FL, opportunities for relatively simpler corresponding concrete syntaces emerge, thus reducing the amount of punctuation needed.

5.1. Suppressing extraneous detail

The basic idea, with respect to our P0 example, is to ignore those parts of the threads that do not involve n a m e . That is, to ignore sub-threads that refer to other instead. This reduces the length of each thread. We can ignore other parts safely as the static semantics of the thread for every other identifier will be checked by its own separate parse. For example, when parsing the declaration facet of an identifier "x" we need not consider whether the class of other identifiers generally obeys declaration semantics as each identifier in turn (e.g. "y" then "z") will be checked separately.

We can ignore non-identifier tokens not appearing in the facet language FL safely because the general syntax of the putative sentence as a whole has been checked by the basic parse (checking CL). For example we can ignore objects such as while or if tokens in the declaration FG because the correct use of these would already have been verified in the parse prior to (thread) extraction TE.

Consider initially, how (unoptimised) grammar extraction is performed in detail on the abstract syntax, e.g. on prog (mirroring that on P R O G above). The first stage is to discern only one name:

prog' :: dcl* use* dcl :: name use use : : name* name = name l other

The second stage (implementation of static semantic filter by context-free rules) may be detailed as follows. First, identify the meta-variables by which the semantic filter is to be expressed (dclO etc., corresponding to DCLO etc.), and simply re-express the existing meta-variables as alternations thereof, e.g.

dClol d = dcl0 I dcl,ew I dcll

The next step is to supersede the static semantic rules by replacing occurrences of dclola etc. by particular patterns of use of the new meta-variables. After having done so, e.g.

prog" :: dcl0* dcl dell* use* etc . . .

we have achieved the abstract counterpart of the P0 static semantic FG, and are in a position to optimize. Commence by suppressing occurrences of other as well as occurrences of those new meta-variables from which other is derived as alternatives to those from which name is derived instead (e.g. suppress dclO). Subsequently, now vacuous productions, e.g.

dcll :: use

can also be omitted.

5.2. Optimization example

The full treatment of P0 proceeds as follows. The second stage of grammar extraction would give

prog" :: dcl0* dcl dell* use* del0 :: other use0 dcl :: n a m e use dell :: other use use0 :: other* use : : name* name = n a m e l o t h e r

Page 9: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 259

corresponding to PROG" above. Optimize, first by simple suppression of other and related meta-variables yields

prog" :: dcl dcll* use* dcl :: n a m e use dcll : : use use : : name* name = n a m e

I f we additionally remove vacuous rules, and merge occurrences of the form "X* X*" into "X*" we achieve

prog"' :: dcl use* dcl :: name use* use : : n a m e *

Finally, we may consider a corresponding concrete syntactic facet, options in the design of which include promoting the occurrence of the concrete counterpart of use, from the counterpart of dcl to that of prog":

P R O G " --~ DCL USE* D C L --~ n a m e

USE -=, n a m e *

(Note the with keyword is no longer needed.) At this point we realize that an insignificant extra duty for the thread extraction mechanism here would be condensation: to replace occurrences of applications of, e.g. the rule " D C L ~ n a m e " by a single distinguished token dcl, and to replace other occurrences of n a m e (under USE) by a token use, giving a concrete FL with g rammar

P R O G " --~ dcl use*

which is precisely the language of the PSA found in the explicit processing of a symbol table used to implement static semantic rules for P0!

6. E X A M P L E - - H A N D L I N G M U L T I P L E SCOPES

A slightly more complicated example is P1, the essence of a language with nested block structure:

(1) names not declared in a block may not be used unless previously declared in an enclosing block;

(2) if a name is declared in a block, it may not be previously used in that block (including enclosed blocks).

P1 is interesting in that (unlike P0) the FL is inherently context-free. An abstract syntax is:

prog :: dcl* use* dcl :: name use use = name* I prog

A corresponding concrete syntax is:

P R O G --~ DCL* USE* DCL --~ del N A M E [with USE] USE --~ use NAME*

J begin P R O G end

Page 10: Facet Grammars: Towards static semantic analysis by context-free parsing

260 PAUL A. BAILES and TREVOR CHORVAT

Grammar extraction proceeds following the pattern established for P0, first by ignoring the multiplicity of names:

PROG' -~ DCL* USE* DCL --. dcl NAME [with USE] USE --. use NAME*

I begin PROG end NAME --* name

I other

The second stage of grammar extraction is to impose the static semantic filter on the extracted language by tuning the grammar:

PROG" --~ DCL0* USE0* { DCL0* DCL DCLI* USE*

DCL0 --, dcl other [with USE0] DCL ~ del name [with USE] DCLI --, dcl other [with USE] USE0 ---, use other*

l begin PROG" end USE -* USE0

I use NAME* I begin DCLI* USE* end

NAME --. name ] other

Applying the optimization procedure (suppression and condensation as above) gives

P R O G " --~ del USE* USE -~ use

I begin USE + end I begin PROG" end

7. IMPLEMENTATION CONSIDERATIONS

Specifications of the above form can be used to inspire hand-crafted static semantics analysers, or implementations can be automatically generated according to specifications as follows.

Canonically, thread extraction is two-pass: one to determine the set of present names; and a second to generate the thread for each name and "feed" each to the appropriate PSA instance. This process is admittedly context-sensitive, but trivially so. Moreover, this first pass would be standard for all static semantic analysers and can be implemented once (modulo lexical structure) for different languages and their static semantics, thus effectively factoring out the hand-crafted aspect of semantic analysis and implementing it once and for all.

A single combined pass is possible by the device of maintaining a special instance of the PSA for which all names met to date are other, from which for each new name met there may be "peeled off" a new instance. This "dummy" instance represents the set of PSA instances for those names at any stage as yet unmet.

Thread extraction for optimized FL grammars involves mirroring the suppression and conden- sation of material that occurs in the corresponding optimized grammar extraction. The conceptual challenge is that the optimization is to be performed while traversing the original abstract tree, when the optimization process is characterized in terms of the tree derived from the extraction process. To our relief, the nodes of the (grammar-extracted) tree are each derived from those of the original, and when traversing the original information about the node variety, in the derivative to which each original node corresponds, is available via synthesized attributes (e.g. the identities of names that occur in a sub-tree). Likewise, condensation may be performed within the synthesized attribute framework, preserving our goal of keeping as simple as possible any residual context- sensitive processing that is performed during extraction.

Page 11: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 261

7.1. Abstract implementation

Implementation must make the conceptual framework of the FG model executable. It must effect the notions of:

(i) logically separate grammars (e.g. a basic syntax plus one for scoping, plus one for typing) (ii) a means of relating the separate grammars to the overall language,

(iii) multiple instances of the one concept (e.g. identifiers or labels) (iv) transferral of information from one grammar to another, and (v) removal of irrelevant information.

Points (iv) and (v) refer, of course, to the extraction process. The implementation consider- ations will be discussed in terms of processes and communication. These are fairly abstract implementation terms. The actual implementation method of the processes and the communication between them will depend on the relative costs of the various techniques available, and whether it is desired to rely on operating system or programming language facilities. The implementation could also be viewed in different degrees of effectiveness by different paradigms. For example, a standard imperative programming approach using Ada e.g. would need to settle on a uniform way to deal with the various objects involved as it is a very general language. Approaches using logic programming, or a dataflow model may have a more natural correspondence.

7.2. Grammars as processes

Each grammar can be directly implemented as a separate process which recognizes a specific (sub-)language. It has a single input stream (its partial putative sentence) but can send messages (output) to a number of processes. These output messages will consist of tokens in the terminal vocabulary of the grammar of the process to which they are sent. These messages are sent when the relevant production has been recognized. Messages to multiple instance grammars will be qualified by the instance of that particular concept (as a pair (concept class, instance value), e.g. (identifier, "fred")) . The process will terminate when it encounters the end-of-input token on its input stream. It then reports its status (i.e. whether it analyzed a valid sentence or not) to the process which initiated it. The process may also terminate before all its input has been processed if an error is detected (i.e. there is no possible way that the input will ever be a sentence). In this case it reports to its initiating process that an invalid sentence was seen.

Multiple instance grammars may be implemented by a single process which contains the grammar information (table or code) with a number of tasks--one for each instance. Each task will have its specific data (e.g. a stack) which represents at what point in the grammar that particular instance is at. The process will have a mechanism to determine where the information provided by incoming messages should be directed. That is, non-qualified messages will be given to all tasks (perhaps only logically), and qualified messages will be directed to a specific task determined by referring to a symbol table.

Extraction of irrelevant information is implemented by not sending a message. That is, information transference is a positive process--a message has to be sent from the production (rule) of one grammar to the input of another grammar--only relevant information is sent, thus irrelevant information is ignored.

7.3. Coordination of grammar processes

The overall system is managed by a "master grammar" process which initiates the facet grammar processes and sends the input sentence to the relevant grammar(s). It also collates the information gained by the status of each process when it terminates. These boolean values represent whether some aspect of the language was satisfied or not. These boolean values can be combined using conjunction (corresponding to language intersection), disjunction (union), or negation (complementation) depending on the definition of the overall language to arrive at a verdict as to whether the input sequence was a member of the overall language.

The system may terminate early on encountering an error in the input if desired, or some other error-handling mechanism can be developed. Error reporting can be done at two levels---the process level which will report "facet specific" errors, and the overall level which will report which facets of the overall language were not satisfied.

Page 12: Facet Grammars: Towards static semantic analysis by context-free parsing

262 PAUL A. BAILES and TREVOR CHORVAT

The production of code or some other intermediate form from the system can be viewed in a number of ways. A standard approach would involve the basic syntax parser operating as normal and producing output to some file. The other "parser" processes would then more accurately be viewed as recognition processes which merely validate putative sentences. This is only one approach and there is no reason why any process could not output some form of information whether it be code or some other intermediate form or analytical report.

8. AN E X T E N D E D A P P L I C A T I O N - - D I J K S T R A ' S L A N G U A G E

Our culminating example is illustrative of how language design experiments are facilitated by simplifying language definition. Among the innovations of Dijkstra's Guarded Commands Language (GCL) [5] is a sophisticated system of identifier declaration and initialization, the understanding, use, implementation and propagation of which is militated against by the hitherto lack of a succint definition.

For example, an essential concrete syntax for the identifier behaviour facet [6] is

BLOCK --, begin N O M E N C + STMT* end N O M E N C --, CLASS N A M E + CLASS --~ privar

I prieon I virvar I vircon [ glovar I glocon

STMT --, init N A M E + READ [ write N A M E + READ I do [READ STMT* (0 READ STMT*)*] od I if [READ STMT* (0 READ STMT*)*] if I BLOCK

READ ---, read NAME*

This "identifier behaviour" essence is able to dispense with punctuation such as the --, separating guards from guarded statement lists because all that matters to the ultimate semantic facet is whether a variable is updated or merely accessed, and when extraneous information is removed from the threads, less disambiguating punctuation is needed. For example, assignment statements

variables : = expressions

are essentially

write list of names of assigned variables read list of names of accessed variables on rhs

By comparison, the collected, condensed accompanying static semantic rules [6] are approximately six times as long again, and are far-removed from formality, let alone executable form (see appendix B for GCL static semantics summary). The optimized CFG for FLGcL is brief by comparison:

BLOCK --, begin privar V-INIT-SEQ end I begin pricon C-INIT-SEQ end I begin NO-USE* end

V-INIT-SEQ --, NO-USE* V-INIT V-USE* C-INIT-SEQ --, NO-USE* C-INIT C-USE* V-INIT --, if V-INIT-SEQ (0 V-INIT-SEQ)* fi

I C-INIT C-INIT ---, if C-INIT-SEQ (0 C-INIT-SEQ)* fi

I begin virvar V-INIT-SEQ end I begin vircon C-INIT-SEQ end I init

Page 13: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 263

V-USE -* write ] do V-USE*(n V-USE*)* od [ if V-USE*(n V-USE*)* fi [ begin glovar V-USE* end I C-USE

C-USE --~ read I do C-USE*(0 C-USE*)* od I if C-USE*(~ C-USE*)* fi I begin gloeon C-USE* end ] N O - U S E

NO-USE --~ do NO-USE* (g NO-USE*)* od [ if NO-USE*(fl NO-USE*)* fi ] B L O C K

A thread is extracted for each distinct identifier, but may embrace multiple bindings of the identifier in nested blocks. The significances of the symbols are as follows:

• BLOCK: represents part of a thread extracted from a GCL BLOCK • V- INIT-SEQ: part of a thread in which an identifier must be initialized as a variable • C-INIT-SEQ: part of a thread in which an identifier must be initialized as a constant • V-INIT: part of a thread which initializes a variable • C-INIT: part of a thread which initializes a constant or a variable (subset of variable

initialization; N.B. especially that inside nest BLOCK, variable initialization possible) • V-USE: part of a thread in which an identifier is used as a variable • C-USE: part of a thread in which an identifier is used as a constant • NO-USE: part of a thread in which an occurrence of an identifier is forbidden (but in which

a nested distinct occurrence may be declared) • privar etc.: declare discerned identifier with given nomenclature class • init: initialize discerned identifier • read: access discerned identifier • write: update discerned identifier

As an example of how the rules "work", the GCL prohibition on initialization inside a loop (as a consequence of the rule that initialization be performed once only) is effected by the non-appearance of d o . . . od inside V- IN IT and C-INIT. Similarly, the rule that initialization of an identifier in one arm of an i f . . . fi must be matched in all the others is effected by the matching appearances of either V- INIT-SEQ or of V-USE, C - U S E or NO-USE.

Moreover, an implementation may now be relatively easily derived, even automatically as outlined above in Section 7--"implementation considerations". The overall impression is clearly one of clarity (through the simplicity and naturalness of the forms of expression of the semantic rules) and consequent accessible and effective precision.

9. DEVELOPING FACET G R A M M A R S AS L A N G U A G E DESIGN TOOLS

Facet grammars introduce the ability to separate concerns in the language design process. The ability to structure language definitions leads to a number of advantages. These advantages come at the cost of doing the actual structuring, and of providing the mechanisms by which structuring can take place. But, as found in the programming domain, the costs of structuring are outweighed by the benefits. After a brief discussion of the outcomes of providing structure in the language definition domain, the process by which separate facet grammars are derived will be outlined.

9. I. Separation o f concerns in language definition

Separating concerns yields modularity and orthogonality--these can allow language to be expressed more concisely, allow easier understanding of the language, and allow for the re-use of language concepts. In the case of facet grammars, providing for separation of concerns proceeds in a formal way, and so the degree of formality in the definition of language is also increased.

Page 14: Facet Grammars: Towards static semantic analysis by context-free parsing

264 PAUL A. BAILES and TREVOR CHORVAT

9.1. I. Expressiveness and conciseness. By concentrating on one aspect of the language at a time the language designer need only consider those objects and tokens relevant to that aspect. Irrelevant issues and relationships can be ignored. This simplifies and clarifies the definition of that aspect, making it a concise expression of that concept. The combination of the facets is a "multiplicative" rather than "additive" information process, so the combination of the language concepts can be very expressive. The abstraction of a concept where there are a multitude of concrete instances also contributes to conciseness and clarity.

The degree and type of structuring is a design decision taken by the language designer in accord with the design goals. Concepts can be presented separately or together at the discretion of the designer.

9.1.2. Cognition. Structuring decisions can IX taken by the designer with consideration to human understanding of the language. A language consists of a number of language concepts--the language designer needs to consider the most appropriate way in which to present these concepts to the language user. A separate exposition of concepts is more orthogonal and may be easier to learn and understand as only one concept need be grasped at a time. It can also contribute to a better understanding of the common language elements by language users.

The power of ignoring irrelevancies should not be underestimated when it comes to learning a language--especially by novice language users. The technique of learning a new language used by experienced language users is firstly to identify the common and "standard" language concepts and to accept them as "given" (albeit with some syntactic variation), then the novel (to that language user) language concepts can be identified and concentrated on. If there are no new language concepts (to that language user) then it is a simple matter of accommodating the syntactic variation within the logical patterns of the language concepts.

This process is familiar to us all. Using facet grammars allows this process to be formalized (i.e. made explicit) so that it can be more widely accessible and less tedious to discern the known from the unknown. By making the process explicit, the common and "standard" language concepts can be more clearly identified; this should aid in the teaching of the general language concepts and should encourage the "standard" concepts to actually become standard. The syntactic variation within the logical patterns of language concepts can also be made explicit, and would thus speed language familiarity. The power of ignoring irrelevancies should not be underestimated when it comes to locating and understanding errors either. Separating concepts allows errors to be more usefully identified within just those language concepts being violated. It also allows for reporting of multiple locations for errors which may be conceptually local but textually dispersed.

9.1.3. Re-use. Orthogonality facilitates combination of language concepts. This in turn facilitates language definition extension and re-use of language definition components. This would make language definition libraries feasible and help to reduce the cost and duration of language definition. This reduction in cost and duration would allow more responsive and widespread use of language. Language would be expected to be built more out of definition components, with effort concentrated on defining those new concepts the language introduces. Increased standardization of common language concepts would improve reliability and reduce costs, as well as facilitating learning as mentioned above.

9.2. Separating language into facets--process

The monolithic conception of the language, whether it be within the language designer's mind (or language design group's collective mind) or as a monolithic written definition, needs to be structured by separating its concerns. Within the FG model this process is achieved by deriving a number of facet grammars--one for each separate concern.

There are two requirements in the structuring of language. One requirement is to provide for conceptual separation of concerns. In this case the requirement is fulfilled by having a separate context-free grammar for each logical concern. The other requirement is to provide a concrete mechanism whereby the input sentence can be made amenable to processing by a number of conceptually separate concerns. Here, this requirement is satisfied by a more general view of extraction (both grammar- and thread-) than yet made apparent. Hitherto, thread extraction has been employed to yield multiple putative sentences of the one facet grammar, which in turn is presented as the single abstract model of the language for the purposes of semantic processing.

Page 15: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 265

We now, however, envisage the separate components of a language's (semantic) definition separately, each of which can be abstracted into its own FG, in turn requiring separate thread extraction mechanisms for each. Thus, generalized thread extraction provides a number of "versions" of the input sentence where tokens irrelevant to particular concerns are ignored and only those tokens appearing as terminal symbols of the particular FG are passed on for processing by that FG. As before, the analysis succeeds iff all putative sentences belong to their supposed languages, only now the number of facet languages is possibly many. It follows that a language may be characterised by the intersection of its constituent facet languages. However, in order to have a grammar to apply to each of the extracted sequences work must be done by the language design agent (individual or group) to impose structure on the the monolithic definition of the language. The language is usually defined as a context-free grammar defining the concrete syntactic issues in combination with an informal text outlining the semantic issues. Language specialists may also have a (very large and detailed) formal definition. Facet grammars can be derived in a relatively systematic way from the base syntax and the informal textual rules. There are in all four steps in deriving facet grammars for a language (as foreshadowed in the Technical Overview above):

(1) Separation of logically distinct concepts (e.g. concrete syntax (sentence form), typing, access, visibility, expression precedence, local standards, textual layout, path analysis, value syntax, inter-module syntax).

(2) Discernment--if there are multiple instances of the same concept then a single, abstract instance is discerned. This abstraction can then be applied to all concrete instances. This is exemplified by the focus of this paper on multiple instances of the concept of identifier.

(3) Tuning--incorporation of the meaning of the informal static semantic rules into the grammars. Only the allowable patterns of the concept are expressed by the grammar.

(4) Optimization--suppression of extraneous symbols and subsequent condensation of the grammar.

It is expected that a number of the more standard concerns/concepts will have already have been defined. These will provide a vocabulary of language design issues which can be included by the language designer at a high level. This would circumvent steps (2), (3) and (4) and considerably ease the language definition process.

Note that this framework exposes in detail the distinction between the concrete syntactic language/grammar and the (single) semantic facet grammar that existed in the naive view of extraction before this section. In principle, concrete context-free syntax is just another facet (although it has a distinguished role in providing the basis for subsequent extraction of the other facets and their threads).

9.3. Scope for automation within the grammar derivation process

Step (1)--Separation--is where all the real work resides. The complex Structuring process can be done entirely manually (following the accepted guidelines on structuring) or with some automated assistance where a formal language definition already exists. The relationships between objects (or attributes) of a formal definition can be analysed and used to help to identify clusters or groups of related objects. This process can help in making the implicit structure a little more visible before the language designer decides how to separate the language definition. In the case of multiple instances (step (2)--Discernment), once the abstract instance has been identified the matter of discernment of that instance and generating all discerned combinations is straightforward and can be easily automated.

Tuning can never be automated as it involves interpreting the language designer's wishes as recorded by the informal static semantic rules. The Tuning step removes all of those discerned productions which violate the static semantic rules. This will leave only those productions which will lead to valid (in terms of the static semantic rules) sentences--thus forming a true and complete grammar for that particular facet.

Optimization is perhaps not so straightforward but does use well-defined rules so is capable of being automated. However, as this step involves readability of the final grammar, there is a case to say that it should only be semi-automated.

Page 16: Facet Grammars: Towards static semantic analysis by context-free parsing

266 PAUL A. BArnES and TREVOR CHORVAT

The thread extraction transformation is derived totally from the definition of the grammars, and is done once when the parser is generated. This extraction process involves a mechanism to discern symbols (a symbol table)--this is the conceptually essential part of the process. Apart from the symbol table activity it is just a matter of sending the symbol to the relevant parser process. So complexity depends on the type of implementation for the symbol table---typical complexity per each access would be O(log n) for tree-type access to n distinct threads, or O(1) for hashing schemes.

This process, however, for arriving at the final FG is based on the existing method of language design where the designer creates a single basic syntax and informal static semantic rules. It would be the process to use in converting existing language definitions to a FL definition. However, the most exciting application of this new language definition technology is that it provides the language designer with a new framework and new tools for conceiving language. Thus, it would be expected that the old method of language definition would not continue to be used, but rather that the language designer would think in terms of the final facet languages and write them down as the translation of his/her conception of a language. That is, since all of the language designer's conceptions of the syntax (including the static semantics) of a language can be expressed formally in a concise yet perspicuous fashion there is no need to resort to the imprecise mechanism of informal textual specification. Of course, there would still be an explanatory role for text as in any good literate specification.

10. EXTENDED APPLICATION IN LANGUAGE DESIGN--SEPARATING SEMANTIC FACETS

In the "extended application" (Section 8) above it was seen how the use of FG succinctly formalized an hitherto informal textual definition of the rules for Dijkstra's Guarded Command Language, specifically pertaining to identifier behaviour. In this section we consider an alternative structuring of the behaviour rules which separates the three facets of nesting (how identifier visibility transmits across block boundaries), initialization (whether an identifier may be used in any way and what its initialization obligations are) and access (whether a value is writeable and/or readable within a block). This extends the previous extent of facets, in that in the examples so far we have considered only two facets: (context-free) syntactic; and (static) semantic facets. Now, as foreshadowed in the previous section, we consider separate facets within the static semantic sphere.

I0. I. Nesting facet

The nesting aspect can be defined as below. In this case it has been obtained from the combined definition by concentrating on only those things concerned with nested block structure. However, using the FG approach for the creation of the language this would have been designed ab initio to define the syntax of the nesting rules.

BLOCK ~ begin privar ANY-NEST* end [ begin prieon CON-NEST* end I begin BLOCK* end

ANY-NEST--, begin giovar ANY-NEST* end [ CON-NEST

CON-NEST--, begin glocon CON-NEST* end I begin virvar ANY-NEST* end I begin vircon CON-NEST* end [ BLOCK

A N Y - N E S T signifies that its constituents can appear as nested blocks for any declared identifier, C O N - N E S T signifies that its constituents are only the ones that are admissible (but not exclusively) as nested blocks for identifiers declared as constants in the outer scope.

This nesting FG is derivable from the earlier integrated FG by ignoring the distinctions between constructs based on anything but nesting rules (Appendix B. 1 Nomenclatures, excluding Notes).

Page 17: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 267

In the absence of these distinctions, distinctions evaporate between most syntactic terms in the integrated FG.

10.2. Initialization facet

The initialization aspect can be defined as below.

BLOCK --~ begin NEW-INIT INIT-SEQ end J begin NO-USE* end

INIT-SEQ --~ NO-USE* INIT USE* INIT --* if INIT-SEQ (0 INIT-SEQ)* li

I begin DO-INIT INIT-SEQ end I init

USE --~ use

[ do USE*(0 USE*)* od [ if USE*(0 USE*)* ti I begin NO-INIT USE* end [ NO-USE

NO-USE -* do NO-USE*(0 NO-USE*)* od [ if NO-USE*(0 NO-USE*)* fi ] BLOCK

NEW-INIT --, privar [ pricon DO-INIT --* virvar[ vireon NO-INIT --* glovar[ gloeon

The change from the previous integrated FG is that in addition to the factoring-out of the nesting facet, whereas before one of the distinctions between V-USE and C-USE was the recognition of separate read and write operations, now there is a common use operation under USE subsuming them both (the distinction between which is captured in the access facet below).

10.3. Access facet

The access facet can be defined as below. Once again it has been obtained from the combined definition by concentrating on only those things concerned with access. Likewise it could have been designed or specified ab initio.

BLOCK --~ begin V-DECL V-USE* end [ begin C-DECL C-USE* end [ begin BLOCK* end

V-USE ~ write [ C-USE

C-USE ~ read I BLOCK

V-DECL --, privar I virvar I glovar C-DECL --~ pricon I vireon I glocon

Observe also that loops and branches do not appear as, being irrelevant to the access facet, extraction passes over them. This contrasts with their retention in the other facets.

10.4. Discussion

Note also that the separability facilitated by the FG approach would allow a language designer creating a new language to identify the notions relevant to the particular task at hand (e.g. type, visibility, access), define an appropriate context-free grammar for each notion, and allow the orthogonal combination of them as desired. This will allow the language designer to provide a formal language definition with a reduced effort, and with reduced maintenance due to the declarative and orthogonal nature of FG. They also open up the promise of a programming language definition library with all its attendant benefits.

A further benefit of separation of facets is that it raises the possibility of reusable conceptual components on language design and specification. With separated FG, a "good idea" in language

Page 18: Facet Grammars: Towards static semantic analysis by context-free parsing

268 PAUL A. BAILES and T~voR CHORVAT

design can be given a formal definition independent of its context. It becomes possible for the language designer to "mix and match" concepts according to circumstances.

1 1. COMPARISONS

The FG model could simplistically be viewed as a refined and well-behaved type of attribute grammar where effective methods for modularity and information transfer have been introduced. The modularity using context-free grammars has advantages in the understanding of the total language, as well as eliminating problems of attribute grammars such as circularity, and move- ment of small and large (e.g. the "environment") data structures over large distances of the parse tree.

The FG model could also be viewed as taking the notion of Ordered Attribute Grammars (OAG) [7] in checking orderly attribute flow to its logical conclusion by separating the derived partial orders explicitly in the original definition. The FG could be considered a more declarative attribute grammar where the undistinguished mass of attributes are logically separated into coherent subsets of attributes concerned with a particular related set of concepts (or productions). The relationship between attributes within each subset of concepts is defined by a context-free grammar.

Multiple Attribute Grammars (MTAG) [8] were a development of OAG which had explicit separation, but still the motivation for separation was entirely implementation driven. The MTAG have no way to handle multiple instances of the same concept as in FG, so they must resort to the old attribute grammar standard of having internal, explicit symbol tables with large "environment" attributes. In addition, MTAG still suffer from the general attribute grammar approach of representing information dynamically, where the sequences of rules and conditions must effectively be simulated in the head of the reader in order to determine the relationships between the different syntactic entities. In FG the relationships between the syntactic entities are defined statically, where the relationships are represented explicitly (the FG appearing in this paper are clear examples of this).

Two-level grammars [2] can be seen as an extreme example of representing information dynamically. Although two-level grammars were a step along the path of progressive introduction of structuring methods to language definition, its most important contribution was in pointing out the need to have a system which people could understand--theoretical power had to be matched by useability power. This was a prime motivation in the development of the FG model in which a conscious attempt has been made to make this language definition tool accessible to human understanding.

In a more pragmatic sense the handling of multiple instances of the same concept in the FG model can be seen as factoring out the notion of symbol table, which is a constant in all language definitions (whether represented as an "environment" in an attribute grammar definition or as a "list of symbols" in a two-level grammar definition). Factoring out this major and complex concept makes a definition easier to understand as well as easier and more reliable to implement.

The extraction/interleaving inherent in the FG model is its information transfer mechanism. It is implicit and need not be consciously considered by the language designer, implementor or user. It provides a natural model for non-local transfer of information--that is, information from textually separated parts of the sentence can be used in the understanding of the sentence. This implicit, non-local information transfer is in contrast with methods used in two-level or attribute grammars where information must be transferred explicitly; and must be passed through chains of nodes of the parse tree if it is to be used non-locally. Typically, non-local use of information is not controlled and may be used for matters unrelated to its intention. In the words of Sebesta [9, p. 99]

"A significant problem with attribute grammars is the cost of moving attribute information, sometimes significant distances, around a parse tree. For example, to match the identifiers on the procedure and end statements of an Ada procedure, a pointer to the identifier must be passed through the parse tree from the procedure statement node to the end statement node. Some efforts have been to short-circuit this process [10], but they are not part of the attribute grammar formalism."

Page 19: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 269

12. F U R T H E R WORK

The overall usefulness of our proposal depends upon how much of practical static semantic issues can be dealt with by extraction of context-free FL: there is also the certainty that multiple parallel extractions reflecting multiple, independent context-sensitive seeds will be necessary (e.g. adding type-cheeking to the simple visibility analysis exemplified above).

As mentioned in the introduction this paper has concentrated on multiple instances of the same concept, specifically, multiple instances of the concept identifier. It is not envisaged that this technique be applied to type-checking which seems more natural to treat as a logically separate, but single concept. This issue will be taken up in detail in future work.

Moreover, tools to manipulate grammars would need to be developed, e.g. to perform grammar extraction (including optimisation) and to allow advantage to be taken of existing (LALR) parsing technology.

13. C O N C L U S I O N S

This paper has introduced the use of FG and demonstrated its use as a powerful conceptual tool in the design and definition of static semantic issues of language. Its promise to make language definition more comprehensible to language users is an important and worthwhile goal to pursue. The amount of conceptual utility obtained through using Facet Grammars, as seen in the above work, justifies expenditure of further effort (as outlined in Section 7--"implementat ion considerations") needed to achieve a practical system. The property of the FG model which allows relationships between syntactic entities to be clearly seen in the definition of a language sets it apart from attribute grammar approaches to language definition.

The importance of the FG model is its approach to language definition from the perspective of human comprehensibility, rather than a focus on the issues of implementation as exemplified in many previous approaches to language definition.

The specific contribution of this paper has been to define the sophisticated identifier scope and accessibility static semantic rules of Dijkstra's Guarded Command Language in a concise, comprehensible and formal way, solving the problem posed previously by Bailes [6]. The ability to separate concerns in language definition, and to handle multiple instances of the one concept cleanly, are important innovations of the Facet Grammar model.

R E F E R E N C E S

1. Knuth, D. E. Semantics of context-free languages Math. Syst. Theory 2, 127-145; 1968. 2. Van WijnGaarden, A., et al. Revised Report on the Algorithmic Language ALGOL 68. Berlin: Springer; 1976. 3. Denning, P. J., Dennis, J. B. and Qualitz, J. E. Machines, Languages and Computation. Englewood Cliffs, NJ:

Prentice-Hall; pp. 364-365, 1978. 4. Hoperoft, J. E. and Ultman, J. D. Introduction to Automata Theory, Languages and Computation. Reading, MA:

Addison-Wesley; p. 281, 1979. 5. Dijkstra, E. W. A Discipline of Programming. Englewood Cliffs, NJ: Prentice-Hall; 1976. 6. Bailes, P. A. Static checking of variable handling in Dijkstra's Guarded Commands Language. J. Comp. Lang. 11:

123-142; 1986. 7. Kastens, U. Ordered attribute grammars. Acta Inform. 13: 229-256; 1980. 8. Rohrich, J. Graph attribution with multiple attribute grammars. SIGPLAN Not. 22: 55-70; 1987. 9. Sebesta, R. W. On context-free programmed grammars. Comp. Lang. 14: 99-108; 1989.

I0. Johnson, G. F. and Fischer, C. N. Non-Syntactic Attribute Flow in Language Based Editors, Proceedings 9th ACM Symposium on Principles of Programming Languages, pp. 185-195; 1982.

APPENDIX A

Glossary CFG, Context-Free Grammar;

CL, Concrete Language--the concrete "syntactic" facet of a programming language; Discernment, the seeond stage of grammar extraction in which a grammar for a (separate) facet is transformed

to recognise a single thread of that facet; EDS, Executable Declarative Specification (e.g. a CFG);

Facet, an orthogonal, cohesive aspect of a language definition/description; FG, Facet Grammar: a CFG defining a facet of a programming language;

Page 20: Facet Grammars: Towards static semantic analysis by context-free parsing

270 PAUL A. BAlLES and TREVOR CHORVAT

FL, Facet Language: generated by an FG; Grammar Extraction, the process of deriving FG (for use in a PSA) from a definition/description of a programming

language; Instance, a FL but which is focussed on occurrences of a particular token value within a token class,

e.g. identifier "qaz" within the class of identifiers; Interleaving, the inverse of Thread Extraction: the interspersing of tokens from a number of token streams

(i.e. threads) to form one token stream; Optimization, the fourth (final) stage of grammar extraction in which symbols extraneous to the facet are deleted;

PSA, an FG-based language recoguiser, which first extracts each thread of each facet, then parses each according to the appropriate FG;

Separation, the first stage of grammar extraction in which grammars for separate facets are derived; Thread, a putative sentence in a FL, i.e. a stream of tokens pertaining to a Facet Instance;

Thread Extraction, the process of separating out a token stream into the threads for the prevailing Facets, in which the abstract processes of Grammar Extraction are mirrored;

Tuning, the third stage of grammar extraction (after separation, discernment and before optimization) in which static semantic rules are bred into the FG.

A P P E N D I X B

Static Semantics for Guarded Commands Language

Taken from: Bailes, P. A. Static checking of variable handling in Dijkstra's Guarded Commands Language. Camp. Lang. 11: 123-142; 1986.

B.I. Visibility

A variable may not be initialized, updated or used unless a nomenclature for it is given in the current "innermost" block.

B.2o Nomenclatures

The nomenclature given for a variable in a block must correspond with the nomenclature given for it in the immediately enclosing block as follows:

Inner Block Enclosing Block

privar or ptieon no correspondence

glovar privar (A), virvar (A) or giovar

gloeon privar (A), prieon (A), virvar (A),

vircon (A), glovar or glocon

virvar or vircon privar (B), pricon (B),

virvar (B), vircon (B)

Notes: (A) Initialization of the relevant variable must have occurred by entry of the inner block. (B) Initialization of the relevant variable must not have occurred by entry of the inner block.

B.3. Initialization

A variable may be initialized only once.

B.4. Nomenclatures and initialization

A variable is to be initialized inside a block, if and only if its nomenclature for the block is one of privar, pricon, virvar, virean, and the other conditions regarding initialization are satisfied.

B.5. Initialisation and use

A variable may be used only after it is initialized.

B.6. Initialization and update

A variable may only be updated after it is initialized.

B. 7. Nomenclature and update

A variable may be updated if and only if its nomenclature for the current block is one of privar, virvar or giovar, and all other conditions regarding updating are satisfied.

B.8. Extra considerations for static checking

The halting problem manifests itself in that we cannot generally tell statically what the value of an expression will be. This means that we cannot tell which "arm" of an if or do will be selected (aside from any considerations of non-deterministic selection), nor can we tell how many iterations a do will cause.

The following additional rules, which we shall see are easy to check statically, will be of importance in showing that the above conditions may be checked statically.

DO; A variable which exists external to a do statement may not be initialized inside it. IF; A variable which exists external to an if statement, if initialized in one of its "arms", must be initialized in all

of them.

Page 21: Facet Grammars: Towards static semantic analysis by context-free parsing

Facet Grammars 271

About the AUthOr--PAUL BAILES holds a PhD in Computer Science from the University of Queensland, where he is currently a Senior Lecturer and leader of the Key Centre for Software Technology's Language Design Laboratory. Dr. Bailes' research interests cover the fields of programming language design and implementation, with special attention to the use of very-high-level languages for rapid prototyping, and the integration of new software development methods with existing software development technologies. His research is supported by the Australian Research Council, by the Defence Science and Technology Organization, being a foundation member of its Australian Software Engineering Environments Consortium.

About file Autlmr--TREvog CHORVAT received his BMath in Computer Science with Honours from the University of Wollongong, Australia in 1984, and is currently completing a PhD in Computer Science at The University of Queensland. After working for Olivetti technical support, Mr. Chorvat taught at Griflith University and is currently a Lecturer at the Queensland University of Technology. His research interests are in: the application of formal language theory to programming languages design, implementation and use; program derivation; and editors.

CL 18/4--E