Scanners := a program that reads a string and brakes them into symbols Parsers := a program that reads symbols and checks if they are in the correct order. that is, if the syntax is correct In this section you will find tutorials about compiler design. That is, I will teach you how to design your own programming languages using C/C++, Java and compiler construction tools such as Flex, Bison and SableCC. If you are into games programming, then the ones you want to look into are Flex and Bison because they generate scanners (a program that reads a string and brakes them into symbols) and parsers (a program that reads symbols and checks if they are in the correct order. that is, if the syntax is correct) for C/C++. SableCC is a compiler framework for Java. It generates scanners and parsers for the Java language, as well as a tree representation of the programs along with the programs to process them. To begin with, you could start reading regular expressions because that is the language we use to define scanners. You then follow the tutorials on context-free grammars, which are used to describe the correctness of the language. That is, what forms valid sentences and what does not. After you have introduced yourself to both regular expressions and context-free grammars, You may follow the tutorials on SableCC, Flex and Bison. At the moment, you will only find a tutorial on how to write a compiler with SableCC. But I will add a Flex and Bison one soon. I hope you find this section useful.

Regular expressions

Regular expressions are a convenient means of specifying certain simple (though possibly infinite) sets of strings. They are of practical interest because regular expressions can be used to specify the structure of the tokens used in a programming language. In particular, regular expressions can be used to program a scanner generator. The sets of strings defined by regular expressions are termed regular sets. We start with a finite character set, or vocabulary (denoted V). This vocabulary is normally the character set used to form tokens. An empty or null string is allowed (denoted λ, "lambda"). Strings are built from characters in V via concatenation (for example, :=, begin, 123).The null string, when concatenated with any string S, yields S. That is, Sλ ≡ λS ≡ S.

Concatenation can be extended to sets of strings as follows: Let P and Q be sets of strings. Then string S ∈ (P Q) if and only if S can be broken into two pieces: S = S1 S2, such that S1 ∈ P and S2 ∈ Q. Small finite sets are veniently represented by listing their elements, which can be individual characters or strings of charaters. Parentheses are used to delimit expressions, and |, the alternation operator, is used to separate alternatives.

The characters (, ), ‘, and | are meta-characters (punctuation and regular expression operators). Meta-characters can be quoted when used as ordinal characters to avoid ambiguity. (Any character or string may be quoted, but unnecessary quotation is avoided to enhance readability.) For example:

Delim = (‘(’ | ‘)’ | := | ; | , | ‘+’ | - | ‘*’ | / | = / $$$$)

Alternation can be extended to sets of strings. Let P and Q be sets of strings. Then string S ∈ (P | Q) if and only if S ∈ P or S ∈ Q. Large (or infinite) sets are conveniently represented by operations on finite sets of characters and strings. Concatenation and alternation may be used. A third operation, Kleene closure, is also allowed. The operator * will represent the postfix Kleene closure operator. Let P be a set of strings. String S ∈ P* if and only if S can be broken into zero or more pieces: S = S1 S2 S3 … Sn such that each Si ∈ P. A string in P* is the concatenation of zero or more selections (possibly repeated) from P. (We explicitly allow n = 0, so that λ is always in P*.)

Regular expressions are defined as follows. Each regular expression denotes a set of strings ( a regular set).

  • Ø is a regular expression denoting the empty set (the set containing no strings).
  • λ is a regular expression denoting the set that contains only the empty string. Note that this set is not the same as the empty set, because it contains one element.
  • A string S is a regular expression denoting a set containing only S. If S contains meta-characters, S can be quoted to avoid ambiguity.
  • If A and B are regular expressions, denoting the alternation, concatenation, and Kleene closure of the corresponding regular sets.

Any finite set of strings can be represented by a regular expression of the form (S1 | S2 | … | Sn).

We often utilize the following operations as notational convenience. They are not strictly necessary, because their effect can be obtained (albeit somewhat clumsily) using the three standard regular operators (alternation, concatenation, kleene closure):

  • P+ denotes the strings consisting of one or more strings in P concatenated together: P+ = (P+ | λ) and P+ = P P*.
  • If A is a set of characters, Not(A) denotes (V – A); that is, all characters in V not included in A. Since Not(A) is finite, it is trivially regular. It is possible to extend Not to strings, rather that just V. That is, if S is a set or strings, we can define Note(S) to be (V* – S). Although it may be infinite, this set is also regular.
  • If k is a constant, the set Ak represents all strings formed by concatenating k (not necessarily different) strings from A. That is, Ak = (A A A …) (k copies).

We now illustrate how regular expression can be used to specify tokens. Let

D = (0 | … | 9) L = (A | … | Z)

Then
  • A comment that begins with –– and ends with Eol (end of line) can be defined as:

    Comment = –– Not(Eol)* Eol

  • A fixed decimal literal can be defined as:

    Lit = D+ . D+

  • An identifier, composed of letters, digits, and underscores, that begins with a letter, ends with a letter or digit, and contains no consecutive underscores, can be defined as:

    ID = L (L | D)* (_ (L | D)+ )*

  • A more complicated example is a comment delimited by ## markers which allows single #’s within the comment body:

    Comment2 = ## ((# | λ) Not(#))* ##

All finite sets and many infinite sets, such as those just listed, are regular. But not all infinite sets are regular. For example, consider { [k]k | I ≥ 1}, which is the set of balanced brackets of the form [[[[[ … ]]]]]. This is a well-known set that is not regular. The problem is that any regular set either does not get all balanced nesting or it includes extra, unwanted strings.

It is easy to write a context-free grammar (described in the next section) that defines balanced brackets precisely. All regular sets can be defined by context-free grammars; thus, the bracket example shows that context-free grammars are more powerful descriptive mechanism that regular expressions. Regular expressions are, however, quite adequate for specifying token-level syntax.



If you have any queries, please do not hesitate to contact me.


Traditionally, context–free grammars have been used as a basis of the syntax analysis phase of compilation. A context–free grammar is sometimes called a BNF (Backus–Naur form) grammar.

Informally, a context–free grammar is simply a set of rewriting rules or productions. A production is of the form

A → B C D … Z


A is the left–hand–side (LHS) of the production. B C D … Z constitute the right–hand–side (RHS) of the production. Every production has exactly one symbol in its LHS; it can have any number of symbols (zero or more) on its RHS. A production represents the rule that any occurrence of its LHS symbol can be represented by the symbols on its RHS. The production

<program> → begin <statement list> end

states that a program is required to be a statement list delimited by a begin and end.

Two kinds of symbols may appear in a context–free grammar: nonterminal and terminals. In this tutorial, nonterminals are often delimited by < and > for ease of recognition. However, nonterminals can also be recognized by the fact that they appear on the left–hand sides of productions. A nonterminal is, in effect, a placeholder. All nonterminals must be replaced, or rewritten, by a production having the appropriate nonterminal on its LHS. In constrast, terminals are never changed or rewritten. Rather, they represent the tokens of a language. Thus the overall purpose of a set of productions (a context–free grammar) is to specify what sequences of terminals (tokens) are legal. A context–free grammar does this in a remarkably elegant way: We start with a single nonterminal symbol called the start or goal symbol. We then apply productions, rewriting nonterminals until only terminals remain. Any sequence of terminals that can be produced by doing this is considered legal. To see how this works, let us look at a context–free grammar for a small subset of Pascal that we call Small Pascal. λ will represent the empty or null string. Thus a production

A → λ


states that A can be replaced by the empty string, effectively erasing it.

Programming language constructs often involve optional items, or lists of items. To cleanly represent such features, an extended BNF notation is often utilised. An optional item sequence is enclosed in square brackets, [ and ]. For example, in

<program> → begin <statement sequence> end


a program can be optionally labelled. Optional sequences are enclosed by braces, { and }. Thus in

<statement sequence> → <statement> {<statement>}


a statement sequence is defined to be a single a statement, optionally followed by zero or more additional statements.

An extended BNF has the same definitional capability as ordinary BNFs. In particular, the following transforms can be used to map extended BNFs into standard form. An optional sequence is replaced by a new nonterminal that generates λ or the items in the sequence. Similarly, an optional sequence is replaced by a new nonterminal that genereates λ or the sequence of followed by the nonterminal. Thus our statement sequence can be transformed into

<statement sequence> → <statement> <statement tail>
<statement tail> → λ
<statement tail> → <statement> <statement tail>

The advantage of extended BNFs is that they are more compact and readable. We can envision a preprocessor that takes extended BNFs and produces standard BNFs.

Figure 3.1 shows a context–free grammar for a Small Pascal. An augmenting production

<system goal> → <program> EOF


appears in the grammar to make sure that the string matched by <system goal> includeds all the input. It specifies that the end–of–file marker, EOF, must follow after the last valid token of a program.

                                        1. <system goal> → <program> EOF
                                        2. <program> → begin <statement sequence> end .
                                        3. <statement sequence> → <statement> {<statement>}
                                        4. <statement> → ID := <expression> ;
                                        5. <statement> → readln ( <identifier list> ) ;
                                        6. <statement> → writeln ( <expression list> ) ;
                                        7. <identifier list> → ID {, ID}
                                        8. <expression list> → <expression> {, <expression>}
                                        9. <expression> → <factor> <addition operator> <factor>
                                        10. <factor> → ( <expression> )
                                        11. <factor> → ID
                                        12. <factor> → INTLITERAL
                                        13. <addition operator> → PLUS
                                        14. <addition operator> → MINUS

Figure 3.1 – Extended BNF defining Small Pascal

To see how this grammar defines legal Small Pascal programs, let’s follow the derivation of one such program, begin ID := ID + ( INTLITERAL – ID ) ; end EOF, starting from the nonterminal <system goal>.

<system goal>
<program> EOF                                                                 (rule 1)
begin <statement sequence> end EOF                                 (rule 2)
begin <statement> {, <statement} end EOF                         (rule 3)
begin <statement> end EOF                                                (Choose 0 repetitions)
begin ID := <expression> ; end EOF                                    (rule 4)
begin ID := <factor> <add op > <factor> ; end EOF             (rule 9)
begin ID := <factor> + <factor> ; end EOF                          (rule 13)
begin ID := ID + <factor> ; end EOF                                    (rule 11)
begin ID := ID + ( <expression> ) ; end EOF                         (rule 10)
begin ID := ID + ( <factor> {<add op> <factor>} ) end EOF  (rule 9)
begin ID := ID + ( <factor> <add op> <factor> ) end EOF     (Choose 1 repetition)
begin ID := ID + ( <factor> – <factor> ) end EOF                  (Apply rule 14)
begin ID := ID + ( INTLITERAL – <factor> ) end EOF              (Apply rule 12)
begin ID := ID + ( INTLITERAL – ID ) end EOF                        (Apply rule 11)

At this point no nonterminals remain, so our derivation of a Small Pascal program is completed.

A context–free grammar defines a language, which is a set of sequences of tokens. Any sequence of tokens that can be derived using the grammar is valid; any sequence that cannot be derived is not valid. Actually, to be precise, any token sequence derivable from a context–free grammar is considered syntactically valid. When static semantics are checked by the semantic routines, semantic errors in a syntactically valid program may be discovered. For example, in Pascal the statement

A := ‘X’ + True;


has no syntax errors but it does have a semantic error: the operator + is not defined for adding a character to a boolean value.

Structure as well as syntax can be defined in a context–free grammar. For expressions, this includes associctivity and operator precedence. Associativity is concerned with the order in which consecutive instances of an operator are applied (as in A–B–C). Operator precedence refers to the relative priority of operators. For example, we expect A+B*C to mean A+(B*C), since * is usually considered to be a higher precedence operator than +. Small Pascal has only one level of precedence, because it does not include multiplication nor addition. If multiplication were included, it would be at a higher level of precedence than addition. The following grammar fragment defines such a precedence relationship.

                                        <expression> → <term> {<add op> <term>}                                         <term> → <factor> {<mult op> <factor>}
                                        <factor> → ( <expression> )
                                        <factor> → ID
                                        <factor> → INTLITERAL

Examining a derivation tree for the expression A+B*C, which is derived from this grammar fragment, illustrates how this definition works. A derivation tree is formed by showing nonterminals expansions as subtrees. Figure 3.2 shows the tree for this expression.


Figure 3.2 – Derivation Tree for A+B*C

The tree shows that * has a higher precedence than + because the second and third Ids are grouped together (with *) in a subtree, and then this subtree is grouped together with the first ID the first ID (using +) in the main derivation tree.

Now we may ask: can a derivation tree with wrong precedence ever be formed in this grammar? No – the production rules don’t allow it. Try to build ID+ID*ID with + applied first. Since a * can only be generated by a <term>, ID+ID must appear unless the subexpression is enclosed in parentheses. With parentheses, the desired grouping can be forced.

We have so far introduced regular expressions and context–free grammars, which are respectively used for defining the lexer and parser of a grammar. There is one more thing the reader must know, which is the type of grammar used in SableCC. In this paper we will give just a brief idea of a particular grammar, LALR(1), a subset of LR(1), which is commonly used in the description of almost any programming language. If you would like to know more about other types of grammars, such as LL(k) or LR(k), then you should refer to [1] or [2].

Instead of describing the LALR(1) type of grammar in full, we will just outline the conflicts that may occur in this type of grammar. There are two types of conflicts that may occur in this type of grammar: Reduce⁄Reduce and Shift⁄Reduce conflicts. A reduce⁄reduce conflict occurs when the parser does not know which production to reduce. For example, the grammar

A → Bc | d B →


is not LALR(1) because if the input to the parser is d, the parser will not know if it reduces to B or if it reduces to A, because both A and B can be replaced by d. To solve this conflicts, we eliminate production B, which is a subset of A, and change the grammar to

A → Ac | d


Now if the input to the parser is a d, it will reduce to A.

As for shift⁄reduce conflicts, they occur when the parser does not know if it reduces or if it shifts (pushes the token or nonterminal to the stack) the current input. For instance, let’s analyse the grammar

                                        A → BaDe | BaDeCe
                                        B → c
                                        D → b
                                        C → s

Suppose that the input is cabe. When the parser receives this input, it first reduces c to B, then it shifts it (pushes it to the stack). It then reads a and shifts it, next it reads b, reduces it to D and shifts it, finally it reads e and know it does not know if it shifts e or if it reduces the string cabe to A.

Now that you know the theory behind regular expressions and grammars you can dive into the Sablecc tutorial, where I show you how to write a small compiler using a compiler generator tool called SableCC.

If you have any queries regarding this tutorial, just drop me an email.

References:

  1. Andrew W. Appel, “Modern compiler implementation in Java”, Cambridge University Press, 1998.
  2. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers Principles, Techniques, and Tools”, Addison-Wesley, 1986.

Writing a compiler with SableCC

To create a compiler with SableCC we follow the following steps:

  1. We create a SableCC specification file containing the lexical definitions and the grammar for the language being designed.
  2. After creating the SableCC specification file, we generate the framework by launching SableCC on the specification file. .
  3. At this stage, we generate the working classes. This is the only code we have to write in JavaTM. It is in this step that we write the semantic analyzer, the code generator, and possibly the code optimizer. In the case of an interpreter, we write a single class. These working classes may be subclasses of one of the classes from the analysis subfolder generated by SableCC in the previous step.
  4. It is during this step that we create the driver class of the compiler. It is used to activate the lexer, parser and working classes.
  5. Finally, in this step, we compile the compiler with a JavaTM compiler.

To illustrate this, we are going to implement a subset of Pascal by following the steps outlined above. We are going to implement a subset instead of the full language, because we want to give the reader an idea of how SableCC works.

We are now going to start the implementation

SableCC specification file

Before we start implementing the SableCC specification file for the subset of Pascal, we are going to describe the syntax of SableCC using BNF. I am assuming that you have read the introduction to regular expressions and the context–free grammars.

The syntax of SableCC is as follows:

<grammar> → [<package declaration>] [<helper declarations>]
[<states declarations>] [<token declarations>]
[<ignored tokes>] [<productions>]

As we can see from the grammar, a SableCC specification file may by an empty file. Remember the meaning of anything between [ and ] ? This means that anything between [ and ] can only be the empty string or a single occurrence of whatever is between the brackets.

From the allowed sections in a SableCC specification file, we will not take into account the <state declarations> because we will not use it. But if the reader is interested to know how it works, he should refer to [1]. We will now describe the syntax of the other productions. The <package declaration> is used to name the destination root, that is, the directory where all the generated files will be placed. We declare the name of the package as follows:

Package uk.co.brainycreatures.smallpascal;

In this example, the root directory will be uk\co\brainycreatures\smallpascal. Meaning that all the genarated subfolders will be placed under this directory. A package name may be any sequence of identifiers, starting with a letter followed by zero or more letters or digits, separated by a dot. The <helper declarations> works like constants in Pascal. And, as its name imply, it is used as a helper of something, in this case the <token declarations> section (more about this next). To see how a <helper declarations> is helpful let’s examine the following regular expression:

id = [‘a’ .. ‘z’] ([‘a’ .. ‘z’] | [‘0’ .. ‘9’])*

In this example we could simplify this regular expression by defining digit = [‘0’ .. ‘9’] and letter = [‘a’ .. ‘z’], then our regular expression ID becomes

id = letter (letter | digit)*

Letter and digit are our helpers. Here’s how we declare helpers in SableCC:

Helpers
letter = [‘a’ .. ‘z’];
digit = [‘0’ .. ‘9’];

After the <helper declarations> follows the <token declarations>, which is where we define our terminals or tokens. Lets use the example above to show how to declare tokens in SableCC:

Tokens
id = letter (letter | digit)*;

This is very similar to regular expressions as described in the Regular Expressions tutorial. A token defined by a string of characters is declared between quotes, e.g. program = ‘program’, and every declaration ends with a semicolon. Following the <token declarations> is the <ignored tokens>, in other words, the section where we declare the tokens to be ignored by the parser. For example, comments, blanks, carriage return, etc.

Here’s an example:

Helpers
any_charater = [0x0 .. 0xfffff];
nl = ‘\n’;
Tokens
comment = ‘⁄⁄’ any_character nl
blank = 10 | 10 13 | 9;
Ignored Tokens
comment,
blank;

In this case, comment and blank will be ignored by the parser.

Finally, in the <productions> section, we declare the productions grammar for the language. The productions are defined in BNF or EBNF (Extended Backus–Naur Form).

Here’s an example of how to declare productions:

Tokens
identifier = letter (letter | digit)*;
number = digit+;
plus = ‘+’;
Productions
expression = identifier plus identifier minus number ;

As we can see it is similar to the context-free grammars presented in the grammars tutorial. The only difference is that in SableCC we don‘t declare a nonterminal surrounded by < and >, and we replace the → by =. In the productions section, it is sometimes required to precede a token by T. and a production by a P. This happens when we have a token and a production with the same name. For example if we have a production program and a token program, then in the production below we must precede program by T., because will not know if it is the token program or the production program.

Tokens
program = ‘program’;
semicolon = ‘;’ ;
Productions
program = T.program identifier semicolon ;

Now that we are familiar with the syntax of SableCC, let’s implement our subset.

The root of our package will be uk.co.brainycreatures.smallpascal. Hence in the package declaration we will have

Package uk.co.brainycreatures.smallpascal;

After defining the package, we are going to declare our helpers, which are shown below:

Helpers
          a = ‘a’ | ‘A’ ;
          b = ‘b’ | ‘B’ ;
          e = ‘e’ | ‘E’ ;
          g = ‘g’ | ‘G’ ;
          i = ‘i’ | ‘I’ ;
          m = ‘m’ | ‘M’;
          n = ‘n’ | ‘N’ ;
          o = ‘o’ | ‘O’ ;
          p = ‘p’ | ‘P’ ;
          r = ‘r’ | ‘R’ ;
          t = ‘t’ | ‘T’;
          v = ‘v’ | ‘V’ ;
          w = ‘w’ | ‘W’;
          cr = 13 ; ⁄⁄ carriage return
          lf = 10 ; ⁄⁄ line feed
          tab = 9 ; ⁄⁄ tab char
          ascii_char = [32 .. 127] ;
          blank = ‘’ ;
          digit = [‘0’ .. ‘9’] ;
          letter = [[‘a’ .. ‘z’] + [‘A’ .. ‘Z’]] ;
          l_brace = ‘{’ ;
          r_brace = ‘}’ ;
          l_paren_star = ‘(*’ ;
          r_paren_star = ‘*)’ ;

Note the space between the letters in the definition of the keywords. In this case we need them, because each letter is a helper token. So, in defining our tokens, we have a sequence of helpers separated by a space.

Finally, we will describe the grammar for the language. Here it is:

Productions
          program =
                    program_heading
                    global_declaration_part
                    begin
                    main_statement_part
                    end dot;
          program_heading =
                    {non_empty} program identifier semicolon |
                    {empty} ;
          global_declaration_part =
                    var_declaration;
          var_declaration =
                    var var_decl+;
          var_decl =
                    identifier_list colon type;
          identifier_list =
                    {single} identifier |
                    {sequence} identifier_list comma identifier;
          type =
                    integer;
          main_statement_part =
                    statement_sequence;
          statement_sequence =
                    {single} statement |
                    {sequence} statement_sequence semicolon statement ;
          statement =
                    {assign} identifier assignop expression |
                    {writeln} writeln_stmt ;
          writeln_stmt =
                    {simple}writeln |
                    {arguments} writeln l_paren expression_list r_paren ;
          expression_list =
                    {single} expression |
                    {sequence} expression_list comma expression ;
          expression =
                    {term} term |
                    {plus} expression plus term |
                    {minus} expression minus term ;
          term =
                    {factor} factor |
                    {mult} term mult factor |
                    {div} term div factor;
          factor =
                    {identifier} identifier |
                    {integer} integer;

Generating the working classes

In this section we will implement the semantic analyser and the code generators for the language. We start by implementing the semantic analyser below.

The semantic analyser

The semantic analysis phase completes the static analysis (parsing) phase of the source program, by ensuring that the grammatically correct statements passed to it by the parser ‘make sense’. For instance, the syntax of assignment statements in Pascal does not require that the identifiers have been declared, that they are variables, that they are of similar type, and that the operations can be performed on those types. All these restrictions are defined by the static semantics of Pascal, and the semantic analyser performs the corresponding checks. To perform this checking, the semantic analyser completely processes the declarations and creates equivalent information known as property information. This information is stored in a data structure known as symbol table. The symbol table is used so that given any identifier the associated information may be found.

For this small subset, the only checking that needs to be performed is that the identifiers are declared before use and that they are declared only once. There is no need for type checking because all the identifiers are of the same type. The semantic analysis is processed by a depth first traversal on the abstract syntax tree, storing information about identifiers in the symbol table after having visited the subtrees of the production containing the variables declaration. These information will then be fetched as we encounter identifiers in productions factor and variable, respectively.

Now that we have the required information to implement the semantic analyser, we will now describe how to implement it. Returning to the grammar above, we are now going to explain the effect of the names between { and } in the alternatives of each production. The name of each alternative in the productions of the grammar above are prefixed by A and concatenated by the name of the production to produce a type name for the alternative. For example, for the production below

Factor =           {identifier} identifier | {integer} integer_literal | … ;

a type named AIdentifierFactor and a type name AIntegerFactor will be produced.

Given the information above, we design our semantic analyser by defining the methods for the alternatives including identifiers. That is, alternatives ASingleIdentifierList, ASequenceIdentifierList, AAssignStatement and AIdentifierFactor. Also, we need to process alternative AVarDecl. Note that for the alternative for var_decl doe not have a name. This is allowed in SableCC if the production is defined by one alternative only, and the resulting type is the name of the production prefixed by A.

Here is the implementation of our semantic analyser:

package uk.co.brainycreatures.smallpascal;

import uk.co.brainycreatures.smallpascal.node.*;
import uk.co.brainycreatures.smallpascal.analysis.*;
import java.util.*; ⁄⁄ for the Hashtable

public class SemanticAnalyzer extends DepthFirstAdapter {
     ⁄⁄ stores the identifiers being defined
     Hashtable symbol_table = new Hashtable(); ⁄⁄ create a new table

     ⁄⁄ check if the identifier is already in the table and report an error
     ⁄⁄ if it is
     public void outASingleIdentifierList(AidentifierList node) {
          ⁄⁄ identifier to be stored in the symbol table
          TIdentifier ident = node.getIdentifier();
          ⁄⁄ name of the identifier to be stored in the table
          String key = ident.getText().toUpperCase();

          ⁄⁄ is the identifier in the table?
          if (symbol_table.containsKey(key)) { ⁄⁄ report an error
               System.out.println(“Identifier already defined.”);
               System.exit(0);
          }
          else {
               symbol_table.put(key, key);
          }
     }

     ⁄⁄ same as above
     public void outASingleIdentifierList(AidentifierList node) {
          ⁄⁄ identifier to be stored in the symbol table
          TIdentifier ident = node.getIdentifier();
          ⁄⁄ name of the identifier to be stored in the table
          String key = ident.getText().toUpperCase();

          ⁄⁄ is the identifier in the table?
          if (symbol_table.containsKey(key)) { ⁄ report an error
               System.out.println(“Error: [“ ident.getLine() + “,” + ident.getPos() + “] Identifier already defined.”);
               System.exit(0);
          }
          else {
               symbol_table.put(key, key);
          }
     }

     ⁄⁄ checks if the identifier in the assignment statement was previously
     ⁄⁄ declared, and report an error if it wasn’t
     public void outAAssignStatement(AAssignStatement node) {
          Tidentifier ident = node.getIdentifier();
          String key = ident.getText().toUpperCase();

          ⁄⁄ Is the identifier in the table?
          ⁄⁄ if not report error
          if (!symbol_table.containsKey(key)) {
               System.out.println(“Error: [“ + ident.getLine() + “,”
               ident.getPos() + “] Unknown identifier.”);
               System.exit(0);
          }
     }

     ⁄⁄ same as above
     public void outAIdentifierFactor(AIdentifierFactor node) {
          Tidentifier ident = node.getIdentifier();
          String key = ident.getText().toUpperCase();

          ⁄⁄ Is the identifier in the table?
          ⁄⁄ if not report error
          if (!symbol_table.containsKey(key)) {
               System.out.println(“Error: [“ + ident.getLine() + “,” + ident.getPos() + “] Unknown identifier.”);
               System.exit(0);
          }
     }
}

We define the class SemanticAnalyser as extending class DepthFirstAdapter.

This automatically provides a depth–first traversal of the AST. To check the declarations of identifiers we just need methods outASingleIdentifierList and outASequenceIdentifierList. To check if they were declared before use we just need methods outAAssignStatement and outAIdentifierFactor. For more about how SableCC generates this methods refer to chapters 5 and 6 of [1]. We are now done with the semantic analyser. For our code generation we do the same thing, but we generate code as well as we process the identifiers in the symbol table.

The Class Generator

The ClassGenerator is also a subclass of class DepthFirstAdapter. To implement the code generator we need to do a depth first traversal of the tree, just as we did with the semantic analyser. But for the code generator we will use alternatives from other productions as well. For example, we will use the program_heading to define the name of the class to be generated. That is, we will use the name of the identifier that follows the program keyword, if the program heading is present in the program. If the program heading is not present in the program, the name of the class will be a.out, just as the executables generated by a C compiler in a Unix based system. We will now start the description of the design of our code generator. For example, from the production

program_heading =
          {non_empty} T.program identifier semicolon |
          {empty} ;

we will define methods outANonEmptyProgramHeading and outAEmptyProgramHeading. To generate a class file using the Jas API we need to create an instance of class ClassEnv first. Then we set the class attributes: its name, its super class, the source from which it is being compiled and its access modifiers. This is all done in the methods outlined above. We also need to create a default constructor. Although we don’t realise, the Java compiler generates a default constructor for us. This is how it looks in Java assembly:

.method public <init>()V
     aload_0
     invokevirtual java⁄lang⁄Object⁄<init>()V
     return
.end method

To define this method, we can either do it in production program_heading, or even before production program traverses its subtrees. More precisely, on the definition of its inAProgram method. It is up to the programmer’s preference. In our case, we are going to define the constructor in method inAProgram. To define a method we need to define an instance of class CodeAttr, then we add the instructions with its addInsn method. We also need to create a new instance of class CodeAttr for object main_code. Here is how we implement it:

public void inAProgram(AProgram node) {
     ⁄⁄ create an new instance of CodeAttr
     init = new CodeAttr();
     main_code = new CodeAttr(); ⁄⁄ create a new object for main program body
     try {
          ⁄⁄ define the method
          init.addInsn(new Insn(opc_aload_0));
          init.addInsn(new Insn(opc_invokevirtual, new MethodCP(“java/lang/Object”, “<init>”, “()V”)));
          init.addInsn(new Insn(opc_return));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

After defining the method, we must add it to the class. This will be done in either method outAEmptyProgramHeading or outANonEmptyProgramHeading. Here is its implementation:

public void outANonEmptyProgramHeading(ANonEmptyProgramHeading node) {
     ⁄⁄ name of the class to be generated
     class_name = node.getIdentifier().getText();

     try {
          ⁄⁄ set class attributes
          main_class.setClass((new ClassCP(class_name));
          main_class.setSuper(new ClassCP(“java/lang/Object”));
          ⁄⁄ source from which it is compiled
          main_class.setSource(new SourceAttr(source_file_name));
          ⁄⁄ make class public
          main_class.setClassAccess((short) ACC_PUBLIC);
          ⁄⁄ add the constructor method to the class
          main_class.addMethod((short) ACC_PUBLIC, , “<init>”, “()V”, init, null);
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

The code is self explanatory, so we won’t go into more details.

The code for method outAEmptyProgramHeading is similar to the code above, except for the name of the class. It sets class_name to “a.out”. Proceeding with the grammar, we are now going to define the code that processes identifiers; that is, the code that defines the identifiers. Pascal variables will be defined as Java fields. To define fields in the Jas API, we use method addField of class ClassEnv. The name of the generated variables will be in lowercase. We don’t need a symbol table in order to fetch the signature of a field. Rather, we use the name of the class to which it belongs. Here is the implementation:

public void outASingleIdentifierList(AsingleIdentifierList node) {
     ⁄⁄ get the name of the name of the variable in lower case
     String var_name = node.getIdentifier().getText().toLowerCase();

     try {
⁄⁄ add the field to the class
          main_class.addField(new Var((short) (ACC_STATIC | ACC_PUBLIC)), new AsciiCP(var_name), new AsciiCP(“I”), new ConstAttr(new IntegerCP(0)))));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

The code above is equivalent to the following Java declaration:

public static int <var_name> = 0;

where <var_name> is the same as the variable being declared.

The code for method outASequenceIdentifierList is the same as the code for outASingleIdentifierList above. So, there is no need to describe it again.

Lets now describe how we implement the statements. Before going any further lets analyse how we translate Pascal statements into Java assembly. Because the Java Virtual Machine is stack based, we need to think in terms of stack operations only. So to translate an assignment statement we push all the factors on the right hand side to the stack and then we pop the result into the variable on the left hand side. For example, the assignment statement

a := b + c

is translated to the following Java assembly code:

getstatic <class name>⁄b I
getstatic <class name>⁄c I
iadd
putstatic <class name>⁄a I

where <class name> is the name of the class to which each of the fields a, b and c belong. In Java we use the ‘.’, but in Java assembly we use the ‘/’ to access either methods or fields. The I following the name of the variable means that this variable is of type int. The instruction getstatic is used to get the value of static fields of a class, and the instruction putstatic is used to store values into static fields of a class. The other arithmetic instructions that will be used in the code generation of the compiler for this subset are:

isub – used to subtract integer numbers. This instruction expects to find two numbers on
     the top of the stack, and if they are there it pops them, subtracts them and pushes
     the result onto the top of the stack.

imul – the same as isub, but it multiplies instead

idiv – used to divide integers

In order to generate the appropriate code, we need to traverse the subtrees of the corresponding operator, and after visiting them we generate the opcode corresponding to this same operator. We need to push numbers and identifiers as we encounter them in production factor, and then generate the opcode for the arithmetic instruction in the alternatives of productions expression and term. Here are the implementations:

public void caseAIdentifierFactor(AidentifierFactor node) {
     String var_name = node.getIdentifier().getText().toLowerCase();
     try {
          ⁄⁄ getstatic <class_name>⁄<var_name> I
          code.addInsn(new Insn(opc_getstatic, new FieldCP(class_name, var_name, “I”)));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

Code for integer factors:

public void caseAIntegerFactor(AIntegerFactor node) {
     ⁄⁄ get the string value
     String num_image = node.getIdentifier().getText();
     int value = Integer.parseInt(num_image);

     try {
          switch(value) {
               case –1 :
                    code.addInsn(new Insn(opc_iconst_m1)); break;
               case 0:
                    code.addInsn(new Insn(opc_iconst_0)); break;
               case 1 :
                    code.addInsn(new Insn(opc_iconst_1)); break;
               case 2:
                    code.addInsn(new Insn(opc_iconst_2)); break;
               case 3:
                    code.addInsn(new Insn(opc_iconst_3)); break;
               case 4:
                    code.addInsn(new Insn(opc_iconst_4)); break;
               case 5:
                    code.addInsn(new Insn(opc_iconst_5)); break;
               default:
                    if (value => –128 && value <= 127) {
                         code.addInsn(new Insn(opc_bipush, value));
                    }
                    else if (value >= –65536 && value <= 65535) {
                    code.addInsn(new Insn(opc_sipush, value));
                    }
                    else {
                         code.addInsn(new Insn(opc_ldc, value));
                    }
                    break;
          }
     }
}

Code for addition operator:

public void outAPlusExpression(APlusExpression node) {
     try {
          code.addInsn(new Insn(opc_iadd));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

Code for subtraction operator:

public void outAMinusExpression(AMinusExpression node) {
     try {
          code.addInsn(new Insn(opc_isub));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

Code for multiplication operator:

public void outAMultTerm(AMultTerm node) {
     try {
          code.addInsn(new Insn(opc_imul));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

Code for division operator:

public void outADivTerm(ADivTerm node) {
     try {
          code.addInsn(new Insn(opc_idiv));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

Code for assignment statement:

public void outAAssignStatement(AassignStatement node) {
     String var_name = node.getIdentifier().getText().toLowerCase();

     try {
          code.addInsn(new Insn(opc_putstatic, new FieldCP(class_name, var_name, “I”)));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

Code for writeln statement:

public void inAWritelnStatement(AwritelnStatement node) {
     try {
          code.addInsn(new Insn(opc_getstatic, new FieldCP(“java/lang/Object”, “out”, “Ljava/io/PrintStream;”)));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}


public void outAWritelnStatement(AwritelnStatement node) {
     try {
          code.addInsn(new Insn(opc_invokevirtual, new MethodCP(“java/io/PrintStream”, “println”, “(I)V”)));
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

After generating all the code, we add an extra return instruction, we add the method to the class, and we dump the class. This is done in method outAProgram, that is at the exit of a program. The code is a follows:

public void outAProgram(AProgram node) {
     try {
          ⁄⁄ add return instruction to the end of the method
          code.addInsn(new Insn(opc_return);
          ⁄⁄ add main method to main_class
          main_class.addMethod((short)( ACC_PUBLIC | ACC_STATIC), “main”, “([Ljava/lang/String;)V”, code, null));

          ⁄⁄ generate class file
          main_class.write(new DataOutputStream(new FileOutputStream(class_name + “.class”)));

          ⁄⁄ output status
          System.out.println(“Wrote “ + class_name + “.class”]”);
     }
     catch (Exception e) {
          System.out.println(e);
     }
}

The Main class

Now that we have implemented the working classes, it time to implement the Main class. This class simply activates the lexer, parser, semantic analyser and the class generator. This is always the same for every compiler or interpreter implemented using SableCC. The code for the main class is shown below:

package uk.co.brainycreatures.smallpascal;

import uk.co.brainycreatures.smallpascal.semantic.SemanticAnalyser;
import uk.co.brainycreatures.smallpascal.code.ClassGenerator;
import uk.co.brainycreatures.smallpascal.parser.*;
import uk.co.brainycreatures.smallpascal.lexer.*;
import uk.co.brainycreatures.smallpascal.node.*;

import java.io.*;

public class Main {
     public static void main(String[] args) {
          long start_time, stop_time; ⁄⁄ times compilation

          if (args.length < 1) {
               System.out.println(“Usage:”);
               System.out.println(“ java uk.co.brainycreatures.smallpascal.Main <filename>”);
          }

          try {
               start_time = System.currentTimeMillis();

               ⁄⁄ create lexer
               Lexer lexer = new Lexer (new PushbackReader(new BufferedReader(new FileReader(args[0])), 1024));

               ⁄⁄ parser program
               Parser parser = new Parser(lexer);

               Start ast = parser.parse();

               ⁄⁄ check program semantics
               ast.apply(new SemanticAnalyser());

               ⁄⁄ generate class file
               ast.apply(new ClassGenerator());
          }
          catch (Exception e) {
               System.out.println(e);
          }
     }
}

Now that we have implemented everything, it is time to compile all the source code. Assuming that the current working directory is in our CLASSPATH, we compile the compiler as follows:

C:\mycompilers\javac uk\co\brainycreatures\Main.java

If no errors were found in the source, we will have all the sources compiled. We then just have to debug the compiler. That is, test it with small pascal programs. The best way to test a compiler is by inputig erroneous data to it. That is, invalid programs.

If you have any queries, please do not hesitate to contact me.

Best Regards
Fidel.

References:

  1. Etienne Gagnon,"SableCC, An Object–Oriented Compiler Framework", Master’s thesis, McGill University, Montreal, Quebec, March 1998.




Source : http://www.brainycreatures.co.uk/compiler/intro.asp
Hosted by www.Geocities.ws

1