crafting an interpreter part 3 - parse trees and syntax trees - vrosnet - codepr

9
Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM] Home » VROSNET » Crafting an interpreter Part 3 - Parse Trees and Syntax Trees Workspace home Co… Ta… Arti… Add app Crafting an interpreter Part 3 - Parse Trees and Syntax Trees by: VROSNET forked from: martinholzherr / crafting-an-interpreter-part-3-parse-trees-and-syn Third article on building a language interpreter describing the generation of parse trees and syntax trees. (Back to article) 0 Share with others Congratulations! Your download is complete You now have a new workspace in the cloud! Your downloaded code is ready to be worked on and shared. You can now access your downloaded code from anywhere at any time. What can I do with this Workspace? Connect directly via Visual Studio Make changes to the code without affecting the author’s original Issue a pull request to the author to alert them of fixes or updates you’ve made View your workspace This workspace will provide you a snapshot of the download. If the author updates or changes the code (uploads zip or push it over git) then you can update this snapshot easily or keep your version locked to ensure stability. Learn More Check out our selection of articles to get you started! Working with ::GitMachine Introduction to ::Tasks Introduction to git for Windows users Migrating a SVN Repository to Git using ::GitMachine Customizing ::Tasks - tips and tricks CREATE A WORKSPACE EXPLORE WORKSPACES My favorites Nothing to display My workspaces Nothing to display Shared with me Nothing to display WCode WTasks Forum VROSNET Sign out S home workspaces premium account

Upload: antony-ingram

Post on 19-Jan-2016

4 views

Category:

Documents


1 download

DESCRIPTION

With special thanks to Martin Holzherr of CODEPROJECTfor making the information available.

TRANSCRIPT

Page 1: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

Home » VROSNET » Crafting an interpreter Part 3 - Parse Trees and Syntax Trees

Workspace home

Co…

Ta…

Arti…

Add app

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees by: VROSNET forked from: martinholzherr / crafting-an-interpreter-part-3-parse-trees-and-syn

Third article on building a language interpreter describing the generation of parse trees and syntax trees. (Back to article)

0 Share with others

Congratulations! Your download is complete

You now have a new workspace in the cloud!

Your downloaded code is ready to be worked on and shared. You can now access your downloaded code from anywhere at any time.

What can I do with this Workspace?

Connect directly via Visual StudioMake changes to the code without affecting the author’s originalIssue a pull request to the author to alert them of fixes or updates you’ve made

View your workspace

This workspace will provide you a snapshot of the download. If the author updates or changes the code (uploads zip or push it over git) then you can update this snapshot easily or keep your version locked to ensure stability.

Learn More

Check out our selection of articles to get you started!

Working with ::GitMachine Introduction to ::Tasks Introduction to git for Windows users Migrating a SVN Repository to Git using ::GitMachine Customizing ::Tasks - tips and tricks

CREATE A WORKSPACE

EXPLORE WORKSPACES

My favorites

Nothing to display

My workspaces

Nothing to display

Shared with me

Nothing to display

WCode WTasks

Forum

VROSNET Sign out Shome workspaces premium account

Page 2: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

Download demo - 66 KbDownload PegWizard source (V0.01) - 19.2 Kb

IntroductionIn this third article on building an interpreter, I will show how to generate a physical tree from the source text, which must be parsed. Building a physical tree from the source text has many advantages: it allows detailed analysis, transformations and optimizations of the original source. A tree together with symbol tables compiled from tree visits can even serve as basis for a simple interpreter.

Tree generation is particularly easy for a recursive descent parser, since a recursive descent parser consists of a set of functions - one for each grammar rule. To generate a tree, the grammar-rule function must additionally add a new child to the already built tree, where the new child represents the source part matched by the corresponding grammar rule.

With a tree, language analysis and code generation becomes much easier. But despite this and the fact, that it is easy to generate such a physical tree with most parsers, early compilers (e.g. the first Pascal or C compilers) did not generate such a tree. This is due to the fact, that the overhead in memory usage can be overwhelming - and memory is a resource which was much more precious in the early days of computing.

In a PEG parser [1][2] as used in this article series, tree generation is complicated by backtracking. In case of a parsing failure, the part of the tree belonging to the failed rule must be removed again. Exceptions during parsing must be handled exactly like backtracking.

As a useful application of tree generation, I will present the core of a parser wizard, which takes a set of grammar rules in source text form and generates the skeleton of a parser for this grammar. Without having a parse tree, such a parser wizard (simple form of parser generator) would be difficult to build.

The Tree advantageThe following table shows some problems which can be easily solved when having a physical tree representing the source text, but which are difficult otherwise.

Task Source Tree Solution Advantage of tree representation

Optimize Code

arithmetic expression:

x= x + 1

arithmetic expression optimized:

x += 1Easy to recognize when looking from the parent node.

Page 3: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

RecognizeOut of Order Elements

PEG grammar rule:

Int( ('+' / '-') Int)*

template implementation of PEG rule:

And< Int, OptRepeat< And< OneOfChars<'+','-'>, Int > > >

To automatically translate Int(('+'/'-') Int)* into the template shown under Tree Solution, a tree representation can help very much. The problem here: * appears at the end, but must be placed as OptRepeat at the front. Easy to handle when looking from the parent node.

Interpret loop

while loop:

while(i>0){ j+= i;i--;}

using tree evaluating function calls

while( ConditionTrue(t) ){ EvalStat(t->child);}

A tree can be easily evaluated by passing sibling and child nodes to evaluation functions.

Multiple Passes

any language source

using tree walks starting at pContext.tree

if(G::Matches(pIn,&pContext)){ pContext.Analyze(); pContext.OptimizeTree(); return pContext.GenCode();}

Multiple passes over the same source without the need for reparsing.

Parse Trees and Syntax TreesA parse tree uses one physical tree node per nonterminal, what usually results in huge trees. A syntax tree, often called abstract syntax tree or abbreviated AST is a parse tree where most nonterminals have been removed. The difference is memory usage as the comparison of the parse and the syntax tree for the following PEG grammar shows:

//PEG grammar with start rule expr. //Note: this grammar does not accept any white spaceexpr : mexpr (('+'/'-') mexpr )*;mexpr: atom ('*' atom )*;atom : [0-9]+ / '(' expr ')';

sample expr Parse Tree Abstract Syntax Tree

2*(3+5)

expr< mexpr< atom<'2'> '*' atom< expr< mexpr<atom<'3'>> '+' mexpr<atom<'5'>> > > >>

mexpr<'2' '*' expr<'3' '+' '5'>>

Storage usage: 8 bytes Storage usage: >= 14*12=168 bytes Storage usage: >= 7*12 =84 bytes

Compared to the source text, any tree representation will use much more memory, but parse trees can be especially huge. Tree building also takes much more time than just recognizing the input as shown by the following output taken from the demo program coming with this article:

Page 4: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

-p>> parse a 100000 char buffer filled with 2*(3+5)>> CPU usage ( Syntax tree with 87504 nodes built) = 0.17 seconds>> CPU usage ( Parse tree with 175003 nodes built) = 0.211 seconds>> CPU usage ( Expression recognizer) = 0.03 seconds

A tenfold increase in parsing time compared to a simple recognizer is not uncommon.

A tree node typically has at least the following members:

two pointer members - one pointing to the first child, other pointing to the sibling.a member used for node identificationa virtual destructor in order to support derived nodes.

My current PEG implementation has the following node interface:

template<typename It >struct TNode{ typedef TNode<It> Self; TNode (int id=0, It beg=0, It end=0, Self* firstChild=0); virtual ~TNode (){} //support derived nodes

virtual Self* AddChild (TNode* pChild); int id; Self *next,*child; BegEnd<It> match; //references the source text matched by this node};

Integrating Tree Generation into a PEG ParserTree generation can be added to the template based implementation of a PEG parser just by the use of a handful of tree generating templates. To use these templates, the Context structure passed to the Matches function must be derived from the predefined TreeContext. Each of these tree generating templates has a type parameter responsible for recognizing a PEG fragment and an optional integer value parameter used as identifying key for the generated tree node. The following table lists the available tree generating templates.

Tree generating template Meaning, Effect

template<int id,typename TRule> struct TreeNTWithId

Builds a tree node representing the rule TRule. This node gets the ID id. All tree nodes created inside TRule will be descendants of this tree node.

template<int id,typename TRule> struct AstNTWithId

Same effect as TreeNTWithId, but with automatic replacement by the child node if the child node has no sibling.

template<typename TRule> struct TreeNT

Same effect as TreeNTWithId, but the id set to eTreeNT.

template<typename TRule> struct AstNT

Same effect as AstNTWithId, with the id set to eTreeNT.

template<int id,typename TTerminalRule> struct TreeCharsWithId

Builds a tree node having the source text recognized by TTerminalRule as match string. This node gets the ID id.

template<int id,typename TTerminalRule> struct TreeChars

Same effect as TreeCharsWithId, with the id set to eTreeSpecChar.

template<typename T0, typename T1, ...> struct TreeSafeAnd

Replacement for And in case of using tree templates. The use of TreeSafeAnd guarantees correct behaviour in case of backtracking.

The practical usage of this tree generating templates will be shown with a grammar for expressions supporting the operations '+', '-', '*', where parentheses can be used around subexpressions.

Page 5: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

Topic Recognizer Grammar Parse Tree Grammar

Grammar Specification

expr : mexpr (('+'/'-') mexpr )*;mexpr: atom ('*' atom )*;atom : [0-9]+ / '(' expr ')';

// ^ means: generate tree// node for next terminalexpr : mexpr (^('+'/'-') mexpr )*;mexpr: atom (^'*' atom )*;atom : ^([0-9]+) / '(' expr ')';

Rules as Types

using namespace peg_templ;

struct expr{typedef //atom : [0-9]+ / '(' expr ')';

Or<

PlusRepeat<In<'0', '9' > >,

And<Char<'('>, expr, Char<')'> >

> atom;typedef //mexpr: atom ('*' atom )*;

And< atom, OptRepeat< And<Char<'*'>, atom > >

> mexpr;typedef //expr : mexpr (('+'/'-') mexpr )*;

And< mexpr, OptRepeat< And< Or<Char<'+'>, Char<'-'> >, mexpr > > > expr_rule; template< typename It, typename Context > static bool Matches(It& p,Context* pC) { return expr_rule::Matches(p,pC); }};

using namespace peg_tree;enum{eAtom,eMExpr,eInt,eExpr};struct exprT{typedef

TreeNTWithId<eAtom, Or<

TreeCharsWithId<eInt, PlusRepeat<In<'0', '9' > > >, And<Char<'('>, exprT, Char<')'> > > > atom; typedef

TreeNTWithId<eMExpr, And< atom, OptRepeat<

TreeSafeAnd<TreeChars <Char<'*'> >, atom > > > > mexpr; typedef

TreeNTWithId<eExpr, And< mexpr, OptRepeat<

TreeSafeAnd<

TreeChars<Or<Char <'+'>, Char<'-'> > >, mexpr > > > > exprT_rule; template< typename It, typename Context > static bool Matches(It& p,Context* pC) { return exprT_rule::Matches(p,pC); }};

struct EmptyContext{};bool CallRecognizingParser(){ EmptyContext c; typedef const

bool CallTreeBuildingParser(){ typedef const unsigned char* Iterator; //either use TreeContext

Page 6: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

Calling the Parser unsigned char* Iterator; const char* sTest= "2*(3+5)"; Iterator it= (Iterator)sTest; return expr::Matches(it,&c);}

//or derived class

peg_tree::TreeContext <Iterator> tc; const char* sTest= "2*(3+5)"; Iterator it= (Iterator)sTest; return exprT::Matches(it,&tc);}

The following chapters will introduce Parser Generators and Parser Wizards as examples of complex tree building parsers. A Parser Generator reads a grammar specification and outputs a Parser. A Parser generator typically builds a tree representation of the grammar specification provided by the user, because it must analyze and transform the user input before it can generate the appropriate parser.

Parser GeneratorsA well known product category in the parser business are the so called Parser Generators like e.g., YACC and its successors or to name a more recent development ANTLR. They take an annotated grammar in text form and generate a parser for it. These annotated grammars contain not only the grammar rules (using a notation close to what I introduced to explain PEG parsers), but also the code (the so called semantic actions), which shall be executed by the generated parser when it passes the annotation point. This is shown in the following excerpt from the ANTLR documentation [3] for a grammar which corresponds to the following PEG grammar:

//PEG grammar for expressions supporting //the operations '+','-','*' on integers.expr : S mexpr S(('+'/'-') S mexpr S)*;mexpr: atom S('*' S atom S)*;atom : [0-9]+ / '(' expr ')';S : [ \t\v\r\n]*;

ANTLR parser specification for expression recognizer (semantic actions bold)

ANTLR parser specification for expression parser with semantic actions (semantic actions bold)

//Note, that ANTLR uses a separate scanner, PLUS below therefore corresponds to '+' which has been defined in the scanner specification.

class ExprParser extends Parser;

expr: mexpr ((PLUS|MINUS) mexpr)* ;

mexpr : atom (STAR atom)* ;

atom: INT | LPAREN expr RPAREN

class ExprParser extends Parser;

expr returns [int value=0]{int x;}

: value=mexpr ( PLUS x=mexpr {value += x;} | MINUS x=mexpr {value -= x;}

)* ;

mexpr returns [int value=0]{int x;}

: value=atom ( STAR x=atom {value *= x;} )* ;

atom returns [int value=0]

: i:INT {value= Integer.parseInt(i.getText());} | LPAREN value=expr RPAREN ;

Typically, a parser generator copies the code annotations (the bold parts in the ANTLR spec. above) to the appropriate places in the generated parser. The next step is to compile the generated code (using e.g., your Java compiler). In case of a compile error or an unexpected behaviour when testing the parser, you either go back editing the original grammar specification or you edit the generated parser code directly. In the second case, you'll break the relation between the specification and the generated parser. But sooner or later you will do so and when using ANTLR you will have the luck, that the generated code is quite readable. Unfortunately the same is not true for YACC type parsers, which generate really impenetrable code. Stroustrap, the inventor of C++, experienced this when using YACC for his Cfront C++ front-end. "In retrospect, for me and C++ it was a bad mistake." [to use YACC].. "Cfront has

Page 7: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

a YACC parser supplemented by much lexical trickery relying on recursive descent techniques."[4]

But I believe, that even with integrated parsers like boost::spirit or Perl 6, something similar to a parser generator can still be useful. Such a tool, which generates a starter parsing application, has a similar purpose as e.g., the MFC application wizard, namely to generate the initial objects and functions, which then will be edited by the programmer.

The PegWizard Parser WizardThe PegWizard parser wizard has a similar job profile as the MFC Application Wizard:

The MFC Application Wizard generates an application having built-in functionality, that, when compiled, implements the basic features of a Windows executable application. The MFC starter application includes C++ source (.cpp), resource files, header (.h) files, and a project (.vcproj) file. The code generated in these starter files is based on MFC.[5]

Translated to the case of the PegWizard Parser Wizard:

The PegWizard Parser Wizard generates an application having built-in functionality, that, when compiled, implements the basic features of a parser application. The PegWizard starter application includes C++ source (.cpp) and header (.h) files. The code generated in these starter files is based on the PEG parsing methodology.

The PegWizard parser wizard should do the following:

Meaningful Parser Wizard Functionality Implementation Status

Check completeness of grammar: for each used Nonterminal, there should be a corresponding rule.

Implemented in PegWizard 0.01

Generate Recognizers for arbitrary PEG grammar. Implemented in PegWizard 0.01

Check for left recursive grammar rules (serious error). Planned for version 0.02

Generate Tree Builder for arbitrary PEG grammar. Planned for version 0.02

Support error annotations in PEG grammar. Planned for version 0.02

Support File Iterators, Unicode Iterators.. Planned for version 1.00

Support for automatic error annotations. Planned for version 1.00

The following table shows a sample usage of the PegWizard parser wizard V0.01.00:

command line activation PegWizard -g TestGrammar\expr.gram -o TestGrammar

program output on command line

Peg0Wizard 0.01.00 (alpha) [20050508]INFO: file content of 'TestGrammar\expr.gram' read inINFO: saved the generated parser in 'TestGrammar\expr.cpp'program ended successfully

content of 'expr.gram' file

//PEG grammar for expressions supporting //the operations '+','-','*' on integers.expr : S mexpr S(('+'/'-') S mexpr S)*;mexpr: atom S('*' S atom S)*;atom : [0-9]+ / '(' expr ')';

Page 8: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

S : [ \t\v\r\n]*;

code excerpt of generated parser

#include "peg_templ.h"#include <iostream>using namespace peg_templ;struct expr; //forward declarationsstruct mexpr;struct atom;struct S;struct expr{ //grammar rule implementation typedef And< S, mexpr, S, ...

test run output '2*(3+5)' is matched by grammar

The input grammar file must adhere to the following PEG grammar:

Grammar: S GrammarRule (S GrammarRule)* S ;GrammarRule: Nonterminal S ':' S RSideOfRule ;RSideOfRule: Alternative (S '/' S Alternative)* S ';' ;Alternative: Term (S Term)* ;Term: ('&' / '!')? S Factor S ( '*' / '+' / '?')? ;Factor: (Nonterminal / Terminal / '(' S RSideOfRule S ')' ) ;Nonterminal: [a-zA-Z0-9_]+ ;Terminal: LiteralString | Charset | '.';Charset: '[' (CharsetRange / EscapeChar / CharsetChar )+ ']' ;LiteralString: '\'' ( EscapeChar / !'\'' PrintChar )+ '\'' ;CharsetChar: !']' PrintChar ;EscapeChar: '\\' ( HexChar / PrintChar ) ;CharsetRange: ( EscapeChar / CharsetChar ) '-' ( EscapeChar / CharsetChar ) ;HexChar: '(x|X)(' Hexdigit+ ')' ;S: ([ \v\t\r\n] | '//' [ \v\r\t]* '\n')* ;PrintChar: //determined by 'isprint(c)' ;

Why the PegWizard internally builds a Parse TreeThe PegWizard reads a PEG grammar file and generates the corresponding C++ source file. This would be difficult without having a parse tree, because a PEG grammar uses postfix notation (e.g. the quantifiers '*', '?', '+' appear at the end of the construct to which they apply), but the generated code uses prefix operators. Furthermore, code generation is only one of many tasks of the PegWizard. Other tasks are checking for left recursive rules, checking for completeness of the grammar, automatic insertion of error handling code and so on. Without having a tree, this would require reparsing the grammar file.

Generally speaking, a parse or syntax tree makes almost anything easier. The only disadvantages of a physical tree are the high runtime and memory costs. But in the case of the PegWizard this does not matter, because grammar files are in general quite small. These days a language compiler builds almost certainly a physical tree out of the source text. Cfront [4], Stroustrap's C front end, e.g., builds tree structures for all global elements and for the current function.

Future DirectionsThe PegWizard presented in this article as a tree building parser will be improved and shall soon be a useful tool. Currently, it only generates recognizing parsers.

Besides teaching the subject - namely parsing - a goal of this ongoing article series is to build highly efficient parsers. This will be investigated by comparing hand crafted parsers with schematically constructed ones.

Take-Home PointsIt is easy to build syntax trees or parse trees during parsing. A syntax tree makes life (parsing and the later steps) much easier, but has its costs. When parsing speed matters and it seems feasible to go without a syntax tree, this should be preferred.

Page 9: Crafting an Interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodePr

Crafting an interpreter Part 3 - Parse Trees and Syntax Trees - VROSNET - CodeProject

https://workspaces.codeproject.com/vrosnet/crafting-an-interpreter-part-3-parse-trees-and-syn[19/06/2014 4:46:28 AM]

Advertise | Privacy | Pricing | Report problem | Forum | Twitter Copyright © CodeProject, 1999-2014

All Rights Reserved. Terms of Use

layout: fixed | fluid

History2005 May: initial version.

References1. Parsing Expression Grammars, Bryan Ford, MIT, January 2004 [^]2. Parsing Expression Grammars, Wikipedia, the free encyclopedia [^]3. ANTLR Parser Generator [^]4. Bjarne Stroustrap, The Design and Evolution of C++, Addison Wesley 1995.5. MFC Application Wizard [^]