Project Documentation JCOMP Subset of Java Compiler

IMPLEMENTATION OF JCOMP

   The Jcomp – Subset of java compiler is implemented in six different phases. They are:
            •   Grammar Generation Phase
            •   Lexical Analysis Phase
            •   Syntax Analysis Phase
            •   Semantic Analysis Phase
            •   Symbol Table Management
            •   Byte Code Generation
            Apart from these phases, error management is integrated with almost all these phases. The actual relationship of these phases is as shown in the figure 4.

We know that the standard java compiler produces a class file that is able to run on a standard JVM. Our compiler also generates a class file in the same manner. The process is illustrated in the Figure 3.

FIGURE 3 – COMPILATION PROCESS

All these phases (except the first) are implemented as separate java files in a package called compiler. Apart from these files there are 4 more files which help in the compilation process. They are:

            •   Token.java
            •   TokenConstants.java
            •   ParserUtil.java
            •   ASTVisitor

FIGURE 4 – PHASES OF THE COMPILER

The first two files form the basis for the tokenizing of the input file. The first two files are created by applying the logic specified by Dr.Beverly Sanders, University of Florida. The ParserUtil.java is used for loading the classes in the context checking phase. The last ASTVisitor is an interface which is also used in the context checking phase. In this way all these phases are separated clearly and implemented separately and they are tested individually. Note that the symbol table management enters the scene only at the context checking phase and not at the parsing phase as expected. Now let us discuss the implementation of each phase individually.

3.1 GRAMMAR GENERATION

            The syntax of programming language constructs can be described by context-free grammars or BNF (Backus-Naur Form) notation. Grammars offer significant advantage to both language designers and compiler writers.
    • A grammar gives precise, yet easy-to-understand, syntactic specification of a programming language.
    •From certain classes of grammars we can automatically construct an efficient parser that determines if a source program is syntactically well formed. As an additional benefit, the parser construction process can reveal syntactic ambiguities and other difficult-to-parse constructs that might otherwise go undetected in the initial design phase of a language and its compiler.
    • A properly designed grammar imparts a structure to a programming language that is useful for the translation of source programs into correct object code and for the detection of errors. Tools are available for converting grammar-based descriptions of translations into working programs.
    • Languages evolve over a period of time, acquiring new constructs and performing additional tasks. These new constructs can be added to a language more easily when there is an existing implementation based on a grammatical description of the language.

TABLE 1 – GRAMMAR OF THE JCOMP

+ - 1 or more times
* - 0 or more times

<compilationunit>    ::=    <classdeclaration> | e
<classdeclaration>    ::=    <public> class identifier <classbody>
<public> := public | e
<classbody>              ::=    { <classbodydeclaration>* }
<classbodydeclaration>    ::= <modifier> <static> <final> <type> identifier ;
                       | <modifier> <static> <type> identifier ( <formalparamaterlist> ) <block>
                       | <modifier> identifier ( <formalparamaterlist> ) <block>
<modifier>                       ::=    public | private | protected
<static> := static | e
<final> := final | e
<formalparameterlist>       ::=    <vardecs> | e
<vardecs>        ::=    <vardec> (, <vardec> )*
<vardec>         ::=     <type> identifier
<block>                            ::=    {   (<vardec> ; )* <statement> * }
<statement>                     ::=   if ( <expression> ) { <statement> * }
                                            | if ( <expression> ) { <statement>* } else { <statement>* }
                                            | while ( <expression> ) { <statement> * }
                                            | <statementwithouttrailingsubstatement>
<statementwithouttrailingsubstatement>    ::=        <expressionstatement>
                                                                           |      <returnstatement>
<returnstatement>        ::=    return <returntail> ;
<returntail> ::= <_expression> | e
<expressionstatement>    ::=    <statementexpression> ;
<statementexpression>    ::= <statementexpr-prefix> <statementexpr-suffix>
<statementexpr-prefix> ::= <this-and-name>

<statementexpr-suffix>    ::= <assignment-suffix>
                                           |    <methodinvocation-suffix>
<methodinvocation-suffix>    ::=    ( <argumentList> )
<assignment-suffix>        ::=        = <expression>
<argumentlist>        ::=     <expression> ( , <expression> ) * | e
<name>                ::=    identifier ( . identifier ) *
<this-and-name>      ::= <name> | this.<name>
<expression> : = <cond-or-expr> <cond-or-expr-tail>
<cond-or-expr-tail> ::=    || <cond-or-expr> <cond-or-expr-tail> | e
<cond-or-expr>   ::= <cond-and-expr> <cond-and-expr-tail>
<cond-and-expr-tail> ::= && <cond-and-expr> <cond-and-expr-tail> | e
<cond-and-expr> ::=    <equal-op> <equal-op-tail>
<equal-op-tail> ::=     <equalityoperator> <equal-op> <equal-op-tail> | e
<equalityoperator> ::=   == | !=
<equal-op> ::= <rel-op> <rel-op-tail>
<rel-op-tail> ::= <relationaloperator> <rel-op> <rel-op-tail> | e
<relationaloperaor> : = < | > | >= | <=
<rel-op>   ::= <add-op> <add-op-tail>
<add-op-tail> ::= <additionoperator> <add-op> <add-op-tail> | e
<additionoperator> := + | -
<add-op>    ::= <mult-op> <mult-op-tail>
<mult-op-tail> ::= <multiplicationoperator> <mult-op> <mult-op-tail> | e
<multiplicationoperator> ::= * | / | %
<mult-op>              ::=     + <mult-op>
                                       | - <mult-op>
                                       | ! <mult-op>
                                       | <primary>
<primary>        ::=         <primarywithliteral>
                            |        <primarywithoutliteral>

<primarywithliteral>    :=    intliteral
                                |        boolliteral
                                |        nullliteral
<primarywithoutliteral>    :=       ( <_expression> )
                                        |        <classinstancecreationexpression>
                                        |       <this-and-name>
                                        |       <this-and-name> ( <argumentList> )
                                        |        this
<classInstancecreationexpression>    ::=    new <name> ( <argumentlist> )
<type>        ::=    <primitivetype>
                       |    ( <referencetype> )

<primitivetype>    ::=    int
                                |    boolean
                                | void

<referencetype>    ::= <name>

In the specified context – free grammar there are terminals, nonterminals, a start symbol and productions.
        1. Terminals are the basic symbols from which strings are formed. The word “token” is a synonym for “terminal” when we are talking about grammars for programming languages. For eg: if, else, + etc are terminals.
        2. Nonterminals are syntactic variables that denote sets of strings. The nonterminals define sets of strings that help define the language generated by the grammar. They also impose a hierarchical structure on the language that is useful for both syntax analysis and translation.
       3. In a grammar one nonterminal is distinguished as the start symbol, and the set of strings it denotes is the language defined by the grammar. In out grammar compilationunit is the start symbol.
        4. The productions of a grammar specify the manner in which the terminals and nonterminals can be combined to form strings. Each production consists of a nonterminal, followed by an arrow (or ::=), followed by strings of nonterminals and terminals.

            We have developed a non left-recursive unambiguous grammar. The non left recursive makes the task of parsing very easier by the implementation of RDP (Recursive Descent Parsers). Those details are given in Syntax Analysis phase. Also our grammar is unambiguous because it does not want to give rise for 2 parse trees for a single expression. The notational semantics of the grammar are all the terminals are indicated by bold faced letters, while the non - terminals are indicated by normal fonts. ::= indicates the production rule. The symbol * after a nonterminal indicates that the particular nonterminal can occur zero or more times. The symbol € indicates the null value.

            To avoid left recursion, we have included a tail value in several productions For example, in the case of conditional expressions there exists a conditional tail expressions. This tail expression is added only to remove the left recursion. Also it could be noted that the reference types are enclosed with the braces. This is to facilitate non-left recursion. Or else that matching parameter will cause left recursion in the rule 12. So by applying all these methods we remove the left recursion completely from the grammar.

            Also as this is a subset of the actual JLS, only a limited set of features are added in the grammar. The inheritance properties and float, string, cha data types are avoided. These features are planned as the future developments. A detailed discussion of these features is included in the Future Enhancements section. In this way a consolidated grammar has been generated for our subset of Java compiler.

3.2 LEXICAL ANALYSIS

The lexical analyzer is the first phase of the compiler. Its main task is to read the input characters and produce as output a sequence of token that the parser uses for syntax analysis.

FIGURE 5 – INTERACTION OF LEXICAL ANALYZER WITH PARSER

This is implemented by making the lexical analyzer a separate class, whose object is used by the parser. (Refer Source- Lexical Analyzer) Upon receiving the getnextToken command from the parser, the lexical analyzer reads input characters until it can identify the next token.

Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain secondary tasks at the user interface. One such task is stripping out from the source program white space in the form of blank, tab, and newline characters. Another is correleating error messages from the compiler with the source program. The lexical analyzer keeps track of the number of newline characters seen and the error messages can be related to that of the line numbers. Also the number of characters is also tracked down and be related to that of the error messages.

In our implementation of Lexical analyzer, 3 files are used. The first file is token.java (Refer Source- Token java). The second file is tokenconstants.java (Refer Source- TokenConstants java). The last file is the actual lexical analyzer. The token.java file defines the actual structure of token. This token is returned by the lexical analyzer to the parser. Each token has attribute values like kind, word, val and position. The kind is an integer value. It indicates the type of the token. The types of the token are defined in the tokenconsants.java. The types such as error, EOF are also indicated. The parser could check for the kind of the token and take appropriate action based on their value. This foms the primary basis of the error management. The word is a string value which indicates the spelling of the token word. The val is an integer value. This is valid only for integer literal tokens and Boolean literal tokens, true has the value 1 and false has the value 0. In the case of Integer literals the val has the numeric value of the literal token. In this way a token is formed by the lexical analyzer and returned to the parser.

The tokenconstants file is an interface which defines all possible tokens in the specified grammar.

TABLE 2 – VALID TOKENS OF JCOMP

List of Modifiers
            public, private, protected

List of Keywords
            class, static, new, this, return, final

List of Types
            void, int, char, boolean

List of Control Statement Keywords
            if, else, while

List of Relational Operators
            = =, !=, <=, >=, <, >, !, &&, ||

List of Arithmetic Operators
            +, -, *, / , %

List of Miscellaneous Operators
            { } ; , = ( ) .

List of Literals
            IntegerLiteral(numbers), BooleanLiteral(true or false), NullLiteral(null).

List of Other Tokens
            identifiers, end of file, error

The possible tokens range from scope modifiers, type modifiers, control statement keywords, normal keywords, relational operators, arithmetic operators, miscellaneous operators and literal values. Apart from these usual tokens it also provides tokens to indicate the identifiers, end of file and error. Any set of alphanumeric character, which is not a keyword, can form a valid identifier. ‘_’ and ‘$’ are the only two special characters allowed in the identifiers. The eof indicates the end of the output. The parser is able to understand, on receiving this token, that the input file is completely tokenized. Any token string which does not fit into the above set of token is a valid candidate for error. All these tokens are given an integer values in the tokenconstants interface. These integer values are used to denote the respective tokens in the following phases.

The Jlex file is the important file that actually performs the tokenizing operation. This class gets the filename of the input program from the parser phase. It uses FileReader class to open the file and PushbackReader class to read the file. Both these classes are the part of java.io package. If the file is not found then a FileNotFoundException is thrown. Also it initializes the location of the current character to 1000. This value gets incremented by 1 for each new character. When a new line is encountered, it is incremented to next nearest 1000 value. This is used to indicate the error. This is accomplished by just dividing the location value by 1000, to get the line number and get the modulo value by 1000 to indicate the character location. In this way error management is started in this phase.

This class implements the TokenConstants interface and so the tokens could be indicated by their respective integer values. Also it uses the Token class to form the token and return it to the parser. It has a function getToken, which is responsible for returning the next token. As a part of tokenizing, the whitespaces must be removed. While each character is read from the file, it is checked for whitespace. It is possible by the class Character of java.lang package. It has the function isWhitespace( ). If it is a newline or tabstop, appropriate action is taken as mentioned previously. Now the character is appended to a temporary string, if it has a valid identifier start. For this purpose, the Character class has a function called isJavaIdentifierStart( ). The important point is that the first letter of the identifier must be an alphabet. The special characters and numerical characters could be included only in the center of the identifier. So after checking the first character, a function called isJavaIdentifierPart( ) is used to check whether the character is valid to be in the middle of the identifier. This appending process continues until a breaking character is encountered. Now the temporary string must be checked for keyword. We use a function isKeyWord( ), to check whether the string is a keyword. If it is a keyword then it returns the integer value respective to the keyword, as defined in the TokenConstants interface. If not, then it returns the integer value for the identifier. If any mischar is reached inbetween then error value is returned.

The character must be checked for integer literals. The ascii values for numbers is used for comparison. It is appended to the temporary string. Finally the parseInt( ) function of java.lang.Integer class is used to extract the numerical value of the sting and filled in the val field of the resulting token. It is also checked for each special character separately. If the value of the character read from the file is -1, then it indicates the end of file. If the parser calls this function even after sending this last token an erroneous condition results. If the character does not fit into any of the above tokens, it could be concluded that the particular character is not part of the character set allowed by the specified grammar. So an error token is returned automatically.

The most significant used in this phase is unread( ). This is used to return the character, which is read, back to the file stream. This is required because the breaking character must be returned back while reading the identifier string. This helps to keep track of the characters properly and separate the tokens correctly. This is conceptually equivalent to the ungetc( ) function in C. When the parser calls the function each time, the lexical analyzer continues from the place where it read previously. The Pushback reader takes care of those activities automatically. In this way a lexical analyzer is implemented in this project.

3.3 SYNTAX ANALYSIS PHASE

Parsing is the process of determining if a string of tokens can be generated by a grammar. A parser can be constructed for any grammar. Parsers, in general, are of two types: top-down parsers and bottom-up parsers. In our compiler model, the parser obtains a string of tokens from the lexical analyzer, and verifies that the string can be generated by the grammar for the source language.

FIGURE 6 – POSITION OF PARSER IN COMPILER MODEL

        The parser is expected to report the errors in an intelligibile fashion. In our compiler the parser reports the error by giving string of messages. The approximate location of the error is also indicated by the logic discussed in the lexical analyzer phase.

   Our parser belongs to the group of LL (1) parsers. Our parser is a topdown parser implemented for non-left recursive unambiguous CFG. The top-down construction of a parse tree is done by starting with the root, labeled with the starting nonterminal, and repeatedly performing the following two steps:
            1.   At node n, labeled with nonterminal A, select one of the productions for A and construct children at n for the symbols on the right side of the production.
            2.   Find the next node at which a subtree is to be constructed.
    We follow a technique called RDP. It is a technique of top-down parsing method of syntax analysis in which we execute a set of recursive procedures to process the input. A procedure is associated with each non-terminal of a grammar. Here we use a special type of parsing called predictive parsing.

        Our compiler is implemented in two different phases. (Refer – Source Parser)The first phase is simple. In the first phase of the parser it simply checks for the syntax errors, if there are no syntax errors then it exits the program without giving any output. If there are any errors then those errors are indicated along with their location and type and the program exits.

        In the second phase of the parser the abstract syntax tree is developed and this abstract syntax tree is integrated with the parser. Now the return type of the parser is the abstract syntax tree is itself. The abstract syntax tree has so many classes, each corresponding to a single non terminal. Our parser also has so many functions. Each function corresponds to the non terminal of the grammar. So it is easy to map each class of the abstract syntax tree to each function of the parser. The return type of the each function of the parser is corresponding mapped class of the parser. Thus all these classes are appended and thus an abstract syntax tree is formed and finally returned. This abstract syntax tree is used by the following phases for context checking, symbol table entry, and code generation phases.

        As we are using the non left recursive unambiguous grammar, the parsing becomes as straight forward activity.For any rule P=ABC, there are four functions P, A, B, C. In the first function P, all the three functions A, B, C are invoked in sequence.The return type of all three functions are the classes A, B, C respectively. Finally a new object P is created with the objects A, B, C as its constructor parameters. All the above specified classes are defined in the abstract syntax tree class. As the abstract syntax tree is defined in tne same package and all the classes are public it is possible to use them directly, without importing each of the classes explicitly.

        In our example grammar, Let A=ab. Then in the function ‘A’ it initially checks whether the current token word is ‘a‘. If it is not so, then an error message specifying”Expected a” is specified and the system exits. If the token is equal to ‘a’, then getToken () function of the lexical analyzer is called to receive the next token. Again the received token word is checked for ‘b’ and the same case for ‘a’ is repeated. If the grammar is correct then a new object of class A is created with the terminals a, b as its constructor parameters and that new object is returned. The error management feature of our parser is built in such a way that it catches only the first error occurring in the program. Advanced error recovery schemes are planned as a future enhancement of the project.

        Consider another sample grammar rule B=d|€. This rule means that the rule B may have d or else it may be empty. So in the function B the current token is checked to be equal to d. If it is equal, then a new class B is created with the terminal d as its constructor parameter and this new object is returned. If the token is not equal to d then the error condition is not raised as in the case of the previous rule. Instead a null valued object is returned. In this way the null values in the grammar rules are handled appropriately.

      The most significant part of the construction of the parser is the one dealing with sequencedeclarations, sequenceexpresions and sequencecommands. The common logic used in all these three cases is to commands are initialized and when more than one command is encountered than the are passed as constructor argument and the new sequencecommand object is created. If only one command is encountered than only an appropriate command object is created. As the sequencecommand class is the subclass of abstract class command the return type of the object may be a sequencecommand object or an individual appropriate command object. This piece of logic uses the inheritance property of the java languages. The above logic is applicable for the other to cases as well.

        Another notable point in the construction of the parser regards with the abstract classes. The expressions, commands and declarations are all abstract classes. So while parsing the AST, returned by the parser, in the following phases may be difficult. The identification of type of commands, for example, may be difficult to avoid this difficulty each abstract has a string value attribute which clearly indicates the type of the object.

        In general, the symbol table entries enter the scene at the parsing stage itself. But in our compiler the symbol table entries begin at the context checking phase only. This technique has both pros and cons. The favorable side is that the parser has already a complex after the integration of the AST. We do not want to over burden the parser logic by introducing the symbol table entries in this place. The cons are that the functions must be defined before they are called. This resembles the C syntax but anyway this could be accepted as one of the constraints imposed by our grammar. In this way a LL (1) RDP is implemented by interacting with the lexical analyzer and it returns the Abstract syntax tree to the context checking phase.

3.4 ABSTRACT SYNTAX TREE

        The Abstract Syntax tree is the sub part of the construction of the parser. This abstract syntax tree is created and integerated with the first phase of the compiler and so the parser is modified to return the Abstract Syntax tree. This AST is the most significant part in the construction of the compiler. This is because this AST is used during the contextchecking and for symbol table entries. In later stages, this AST is used to generate the final code.

        An AST is a pictorial (conceptually) representation of how the start symbol of the grammar are represented as a parent – child relationship in the program. A useful starting point for thinking about the translation of an input string is an abstract syntax tree in which each node represents an operator and the children of the node represent the operands. By contrast a parse tree is called a Concrete Syntax Tree and the underlying grammar is called a concrete syntax for the language. Abstract Syntax trees or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.

        It is desirable for a translation scheme to be based on a grammar whose parse trees are as close to syntax trees as possible. The grouping of subexpressions by the grammar should be similar to their grouping in syntax trees. The use of syntax trees as an intermediate representation allows translation o be decoupled from parsing. Translation routines that are invoked during parsing must live with two kinds of restrictions. First, a grammar that is suitable for parsing may not reflect the natural hierarchical structure of the constructs in the language.

        However analysis may be easier if we use a tree representation that reflects the natural hierarchies. Second, the parsing method constrains the order in which node in a parse are considered. This order may not match the order in which information about a construct becomes available.

        An AST is a condensed form of the parse tree useful for representing language constructs. In a syntax tree, operators and keywords do not appear as leaves, but rather are associated with the interior node that would be the parent of those leaves in the parse tree. Another simplification found in the syntax trees is that chains of single productions may be collapsed. Syntax-directed translation can be based on syntax trees as well as parse trees. This approach is the same in each case. We attach attributes to the nodes as in a parse tree.

        Those qualities of a BNF definition that make parsing possible also create a resulting derivation tree containing far more information than necessary for a semantic specification. For example, the categories of terms and elements are required for accurate parsing, but when ascribing meaning to an expression, only its basic structure is needed. Although concrete syntax is essential to implementing programming languages, it is the abstract syntax that lies at the heart of semantic definitions. The concrete syntax is incidental to language specification, but it is important to users since it in seuences the way they think about a language. This aspect of pragmatics is not of direct concern to us in constructing the AST.

        The derivation tree retains all the information used in parsing including details that only the parser needs. On the other hand, an abstract syntax tree captures the syntactic structure of the expression completely in a much simpler form. After all, the crucial property of the expression “5*a – (b+1)” is that it is a difference of a product and a sum of certain numbers and variables. Any other information is redundant. In transforming a derivation tree into an abstract syntax tree, we generally pull the terminal symbols representing operations and commands up to the root nodes of subtrees, leaving the operands as their children. Using this approach, this expression can be thought of as a binary operation and two subexpressions. The choice of the left subtree for the binary operation is arbitrary; it seems to suggest a prefix notation for binary operations, but we are not talking about concrete syntax here, only an abstract representation of certain language constructs. We may choose any representation that we want as long as we can describe the constructs of the language adequately and maintain consistency.

        The point of abstract syntax is simply to communicate the structure of phrases in terms of their semantics in a programming language as trees. Semantics can be defined in terms of derivation trees and actually is with attribute grammars, but most semantic methods are far more understandable when based on a cleaner representation of the phrases in a language. We want uniform templates for the various parts of a language. The blueprints for the abstract syntax trees of a programming language are specified by a collection of syntactic categories or domains and a set of rules telling how categories are decomposed into other categories or tokens.

        To design the abstract syntax for a programming language, we need to determine which notions (nonterminals) are fundamental to the language and which basic forms the constructs of the language may take. The definition of abstract syntax tolerates more ambiguity since the concrete syntax has already determined the correct interpretation of the symbols in the program text. As a consequence of adhering to the BNF syntax of a language, any parsing algorithm must at least implicitly create a derivation tree. But in fact we usually design a parser to generate an abstract syntax tree instead of a derivation tree. Therefore the syntax of “parse” is given by

Parse: Token* → Abstract Syntax Tree

It can be argued that when designing a new programming language, we need to formulate the abstract syntax along with the semantics so that the meaning of a program emanates from a mapping

meaning: Abstract Syntax Trees → Semantic Objects

where the semantic objects provide meaning for the various language constructs. Different approaches to semantics depend on the disparate kinds of objects used to define meaning. Given the abstract syntax of a programming language, the concrete syntax can be defined by an operation

unparse: Abstract Syntax Trees → Concrete Syntax

where Concrete Syntax refers to derivation trees, to lists of tokens, or to lists of characters representing program texts. Since two different phrases following the concrete syntax may produce the same abstract syntax tree, unparse may not be a function at all. To ensure that unparse is a well-defined function, some canonical representation of concrete phrases must be specified for example, taking expressions having the fewest parentheses. The correctness of a parsing algorithm can be demonstrated by showing that it is the inverse, in some sense, of the unparse function.

In our project, the AST is implemented by considering all the above factors. The abstract syntax tree returned by your parse method should be an instance of (a subclass of) the following class (Refer – Source AST)

abstract class AST { }

We have defined an appropriate class hierarchy for the nodes of the tree as defined in the language specification. The class for each node should provide a toString method that returns, enclosed in the brackets, the strings corresponding to the subnodes, or if the symbol is a terminal symbol, the spelling. The empty string € is represented by null.

Note that we also have a null literal. So to differeniate between these two null values, the null literal values will be enclosed by the brackets (since they are terminals). Whereas the null values will not be enclosed by the brackets. If the production in abstract syntax contains terminal symbols, they also appear in the final string. Our abstract syntax tree makes use of the inheritance properties of the Java language. Each node in the syntax tree is implemented as a class with several fields (Refer: Source-Abstract Syntax tree). For example, in a BinaryOpExpression it has 3 fields. One is the operator, which is an object of the Terminal class. The other two may be the objects of the Terminal class or the objects of the Expression class. In this way several expressions can be combines. Also the expressions, command, declarations may be chained to form a sequenced ones. As these sequenced ones are derived from the same abstract class as of he individuals both can be returned without ambiguity. Also each abstract class has string valued attribute to indicate, precisely, the type of the individual nodes. Later each subclass is added a visit function. But those aspects are discussed in the Semantic checking phase. In this way the abstract syntax tree is constructed and returned by the parser to be used by the following phases of the compiler.

3.5 SYMBOL – TABLE GENERATION

        A compiler uses a symbol atble to keep track of scope and binding information about names. The symbol table is searched every time a name is encountered in the source text. Changes to the table occur if a new name or new information about an existing name is discovered.

            A symbol table mechanism must allow us to add new entries and fine existing entries efficiently. We use a symbol table mechanism called LeBlanc Cook Symbol table mechanism. This symbol table generation uses Hashtables and stacks for keeping track of the program. In general, Hashing schemes provide better performance for somewhat greater programming effort and space overhead. Even linear lists could be adapted readily to handle the most closely nested scope rule.

            It is useful for a compiler to be able to grow the symbol table dynamically, if necessary, at compile time. If the size of the symbol table is fixed when the compiler is written, then the size must be chosen large enough to handle any source program that might be presented. Such a fixed size is likely to be too large for most, and inadequate for some, programs. Our symbol table uses dynamic mechanisms for growth so it is possible to accept any size of programs.

            Variations of the searching technique known as hashing have been implemented in many compilers. Here we consider a rather simple variant known as open hashing, where “open” refers to the property that there need be no limit on the number of entries that can be made in the table. Even this scheme gives us the capability of performing ‘e’ inquiries on ‘n’ names in time proportional to n(n+e)/m, for any constant ‘m’. Since ‘m’ can be made as large as we like, up to n, this method is generally more efficient than linear lists and is the method of choice for symbol tables in most situations. As might be expected, the space taken by the data structure grown with ‘m’, so a time-space tradeoff is involved.

            Our symbol table uses the Hashtable class and Stack class of java.util package. When the control enters a function, it conceptually means that the control has entered a new scope. For this a new scope id is generated. This global scope id is unique and it is generated newly for each scope. Invalidated scopes ids will not be used again. These scopes must be tracked down. So we use the Stack. This stack is called as the scope stack. When a new scope is encountered then this scope id value is pushed into the stack. When the scope is left, the current scope id is reduced by 1 but the value stays in the table. This is because during the debugging activities all the values must be in the table. This makes the debugging activities easier. When a variable is encountered in the declaration part, that variable is entered in the symbol table with the current scope id. If it is encountered elsewhere then the scope stack is consulted and the scope which is the most nearest to the top of the stack is considered to be the appropriate scope of that particular variable. This would make sense after the complete discussion of the design of the symbol table.

            Our design of the symbol table uses two hashtables (Refer: Source- Symbol Table). The first hashtable consists of the FQN as the key values. The object of this hashtable is the second hash table. The second hash table has the key value as the scope id and the object of this hash table is the object of the varObject class, which is created as an abstract class in the same file. This abstract class has varSymObject class and methodSymObject class as its subclasses. These classes stand for the objects denoting the variables and the functions respectively. The only difference between both these derived classes is that the mthodSymObject has a field for holding the argumental parameters. Otherwise, both the varSymObject and methodSymObject are the same.

            The second hash table is embedded into each row of the first hash table. When more objects with same name is declared in various scopes, and then the problem arises. We have to determine which object to choose from the available set of objects. So the scope stack is used. All available scopes from the available object are collected. We determine then which scope among them is at or atleast nearer to top of the stack. Than that scope is assumed to be the most recently declared scope and the object value at that scope is used.

FIGURE 7 – LEBLANC COOK SYMBOL TABLE

            There is a function called enterScope ( ). This is called when a new scope is encountered. When this function is called, the global scope id is incremented by one and the current scope id is set to the value of the global scope id. This scope id is also pushed into the scope stack. There is a function called leaveScope ( ). This is called when an existing scope becomes invalidated i.e the control comes out of the scope. When this function is called the stack is popped. This means that the current scope id, which is at the top of the stack, is removed. So the invalid scope leaves the scope stack but stays in the table. As mentioned earlier, this is used for the debugging activities.

            The function insert is used to insert the object into the hash tables. Here the functional polymorphism property of the Java Programming language is used. There are two insert function but with varying number of arguments. The first insert function is used to insert the variable objects and the second insert function is used to insert the method objects. During the time of insertion, the scope ide is checked for 0 and the type is checked whether it is equivalent to class. At that instant the classname must be null. Or else error condition arises. Also the name of the variable is checked with the classname and the methodname. If they match then the error is notified. Else the process continues smoothly. Now an appropriate type of the object is created with the arguments. The arguments are scope modifiers, static modifiers, and types or return values, and parameters. Now the symTable (first) is checked for the key value.

            If the key value does not exist then a new entry is made in the table and then it is inserted. If the key value already exists, the hashtable embedded in the object column is obtained. It is checked whether the same scope value exists. If it existes, then it means that the variable with the same name is declaraed in the same scope exists. So it arises the erroneous condition and exits. If there is no current scope id then the new object is also added to the second hash table and it is again embedded into the first hashtable in its previous position itself. In this way the insert operation is performed.

            The function lookup is used to lookup for the object values in the hashtables. The return type of this function is the varObject since the returned object may be either varSymObject or methodSymObject. During the lookup, the object with that name as key is obtained. If it does not exist then the null valued object is returned. If the object exists, then it is returned. The problem is more than one object would exist as they could be defined in various scopes. So to choose the best object all their scope ids are collected and then the scope stack is checked. By checking with the scope stack the invalidated scopes could be eliminated. Now with the valid scopes available, the one which is at or atleast the most nearer to the top of the stack is considered as the best and its corresponding object is choosen for the use.

            Then there is another function getScopeID ( ). This is just a helping function. It uses the same logic as that of the lookup operation. But instead of returning the object value it returns the scope value. This may be useful in simple logics and also in the context checking phase. Generally, in the compiler design principles the symbol table will enter the scene in the parsing phase. But in our project we introduce the symbol table only in the contextual constraints checking phase. Actually this symbol table generation is a preliminary activity for the following phase. There is another function called contains( ) which returns a Boolean value. The array of objects and the integer values are its parameters. In this function actually the scope values are checked for existence in the scope stack. This piece of control logic is used to invalidate the invalid scope ids that exist in the symbol table.

            This symbol table is used in the context checking phase. During the context checking phase, the control walks through the AST that is it moves its control through each child node of the tree. As it moves through the tree it may enter the scope or leave the scopes. It may encounter the declaration of variables and functions. It may encounter many function call commands. It may come across many command that make use of the variables for which it has to insert and lookup the values in the symbol table.In this way the LeBlanc cook symbol table is used for the construction of our symbol table.

3.6 SEMANTIC ANALYSIS

In the semantic analysis phase certain checks are performed to ensure that the components of the program fit together meaningfully. The semantic analysis phase checks the source program for semantic errors and gathers type information for the subsequent code-generation phase. It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators and operands of expressions and statements.

FIGURE 8 – POSITION OF CONTEXT CHECKER IN COMPILER MODEL

         An important component of semantic analysis is type checking. Here the compiler checks that each operator has operands that are permitted by the source language specification. For example, many programming language definitions require a compiler to report an error every time a real number is used to index an array. However, the language specification may permit some operand coercions, for example, when a binary arithmetic operator is applied to an integer and real. In this case, the compiler may need to convert the integer to a real.

            Semantic analysis uses the visitor pattern to traverse the AST, decorating the AST nodes and checking conditions specified below. It is necessary to add members to our AST classes to hold the attributes we have designed and will use our symbol table class. For example, since each expression has a type, we would add a member to hold the type in your abstract Expression class. Errors should be handled as before, by printing the position of the token involved in the error, and terminating. Thus if check returns, the AST has been successfully type checked.

            Our AST is parsed or the control just traverses through the node of the abstract syntax tree and at each node as it encounters the various declarations, expressions and commands the corresponding semantic rules are checked and appropriate actions are taken. When the declarations are encountered they are entered into the symbol table and when expressions or command are encountered then lookup operations take place and the statements are validated. You may notice that there are additional context constraints that it might make sense to check, but the ones given below are the only ones you should implement.
    •   Identifiers must be declared before they are used. CAN are not allowed.
• The program name cannot be redefined anywhere in the program. A name can only be declared once in the same scope. If a name is declared more than once, then the usage is bound to the closest declaration. A procedure cannot contain a variable with the same name as the procedure. A variable cannot have type void.
    •   Every expression has a type determined as follows:
            -   The type of a VarExpression is the declared type of the identifier.
            -   The type of a NumLitExpression is int
            -   The type of a BoolLitExpression is boolean.
            -   The type of a FunctionExpression is the return type of its procedure
    •   The expression in an AssignCommand must have the same type as the variable.
    •   The expression in a WhileCommand must have boolean type.
    •   The expression in an IfElse command must have boolean type.
    •   The types of the actual parameters to a procedure or function call must match the types of the formal parameters.
    •   The return type of the function must match with the type of the function.
   •   In the functions with the void return type the return command must not have any expression i.e the return type of the expression must be null.
    •   The functions must be declared and defined before a call is made.
    •   No variable should be of type void.
    •   If the functions of the other classes are accesed then they must be public functions.
    •   The external class to which a reference is made must be available in the classpath.
    •   The return type of the function must match with that of the type on the invoking side.
• The variable in the other class which we are trying to access must be a public variable. AIF & MIF fields must be handled appropriately.
    •   The variable whose value we are trying to modify must not be a final variable.

            To implement our context checking phase we use 4 files. The first is the change made to the already built AST. A new function called the visitor with the arguments ASTVisitor and Object is included in the abstract class AST. So this function must be implemented by all the classes that inherit this abstract AST class. This makes use of the inheritance features of the Java language. Now the second file is an interface. This interface is the ASTVisitor which is used as the argument for the visit function in the AST class. This ASTVisitor interface has several functions to denote the visit functions of each node. So these functions are defined by our actual context checker class and each node in the abstract syntax tree actually implements its corresponding functions and thus while traversing the node all the semantic analysis check could be made.

        The last file is the ParserUtil class. This class is used to load the any other class specified in the Java APIs. If the other user defined classes has to be referenced then those classes must be added to the classpath. This class uses the ClassLoader, Class classes and Method, Field, and Modifier classes of the java.lang.reflect and java.util packages. It uses the the functions like loadClass which loads the actual class into the memory space. There is another function called getDeclaredMethods which gives all the declared methods in the class. But this function does not give the constructor declarations. The constructor declarations are also necessary to check for scope modifiers and also the parameter types. So we use another function getConstructors to give all the declared constructors of the class. There is another function called getReturnType to determine the return type of the class. Also we use another function getParameterType which gives the type of the parameters of the functions. We also another functions getModifiers to determine the modifiers of the function or variable. It returns a integer value which is then applied individually to the functions like isPublic etc to determine its modifier value precisely. In this way by using all these functions and classes the control shifts through each node of the AST and performs the semantic checking in each node and uses the symbol table for entries and lookup and notifies the error, if any. Thus semantic checking is completed.

3.7 BYTE CODE GENERATION

In order to execute the programs in our language, we will generate code for the JVM (Java Virtual Machine). Recall that a java program is translated by the java compiler (command javac) into a set of class files containing Java Byte Codes. These are then interpreted by the JVM (command java) to execute the program. Our compiler will also translate the source programs into class files containing java byte code, which are then interpreted by the JVM.

Overview of the class file format

The format of a java class file is defined a stream of 8-bit bytes consisting of a single ClassFile structure (where the structure is as in the C programming language):

ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}

The magic, minor_version, and major_version are fixed values. The constant_pool_count indicates how many entries are in the contant_pool + 1. This constant pool is a table of structures representing various string constants, class names, field names, etc. Access flags indicate the access permissions and properties of the class (public, final, abstract, etc). The item this_class refers to the class described by this classfile, and super_class, its superclass. In our case, this_class will be set to the program name, and super_class to "java/lang/Object". Interfaces and interface_count will be set to 0 and null respectively. fields_count and fields contain information about the fields of the class. In our case, these will hold information about the global variables of our program. methods_count and methods_info will contain information about procedures, including the name, superclass, bytecode, max number of local variables, etc. The attributes_count and attribute-info items will be 0 and null, respectively.

Instruction Summary

The Java virtual machine is stack based. A Java virtual machine instruction consists of a one-byte opcode specifying the operation to be performed, followed by zero or more operands supplying arguments or data that are used by the operation. Many instructions have no operands and consist only of an opcode.

Ignoring exceptions, the inner loop of a Java virtual machine interpreter is effectively

do {
     fetch an opcode;
     if (operands) fetch operands;
     execute the action for the opcode;
} while (there is more to do);

The number and size of the operands are determined by the opcode. Each method invocation causes the creation of a new frame. The frame contains an array of local variables for the method, and an operand stack where most of the work is done. The first items in the local variable array are the parameters of the method. The remaining elements of the array are the variables declared locally in the method. These variables are accessed by their index in the array. The load and store instructions transfer values between the local variables and the operand stack of a Java virtual machine frame:

•   Load a local int variable onto the operand stack: iload.    (Example, iload 2 loads the local int variable at position 2 onto the stack)
•   Store an int value from the operand stack into a local variable: istore (Example, istore 2 stores the int value on top of the stack to the local variable at location 2.)
•   Load a reference in a local variable onto the operand stack: aload
•   Store a reference form the operand stack into a local variable: astore

The instructions to load a constant onto the operand stack are

• Load a constant (literal) onto the operand stack: ldc (Example ldc 0 loads the value 0 onto the stack)

These instructions take the name of the static variable as an operand.

•   Load the contents of a static variable onto the stack: getstatic
•   Store a value from the operand stack into a static variable: putstatic
The JVM instruction set contains separate instructions for the operations on each primitive type. For example, there are multiple instructions that add the two values on top of the stack: iadd, ladd, fadd, dadd. These operate on ints, longs, floats, and doubles, respectively. Boolean types are represented within the JVM as int values with true = 1 and false = 0.
•   Add: iadd
•   Subtract: isub
•   Multiply: imul
•   Divide: idiv
•   Remainder: irem
•   Negate: ineg
•   Bitwise or: ior
•   Bitwise and: iand
•   Bitwise exclusive or: ixor
•   The control transfer instructions conditionally or unconditionally cause the Java virtual machine to continue execution with an instruction other than the one following the control transfer instruction. Some of theses are:
•   Conditional branch: ifeq, iflt, ifle, ifne, ifgt, ifge, ifnull, ifnonnull, if_icmpeq, if_icmpne, if_icmplt, if_icmpgt, if_icmple, if_icmpge, if_acmpeq, if_acmpne.
•   Unconditional branch: goto
The following instructions invoke methods:
•   invokevirtual invokes an instance method of an object, dispatching on the (virtual) type of the object. This is the normal method dispatch in the Java programming language.
•   invokestatic invokes a class (static) method in a named class. This will be the instruction for the procedure and function calls in our language.
•   Method return instructions, are ireturn, (used to return values of type int) areturn, (used to return references) and return, to return from void methods.

Before a method is invoked, its arguments are placed on the operand stack. The invoke instruction then places these in the local variable array of the frame for the invoked method where they can be accessed as local variables. The return instruction leaves the result, if any on top of the stack of the caller.

Implementation

In order to implement our compiler, we need to map the various kinds of constructs in our language into the JVM constructs. So far, we have mapped a program to a class, with the global variables of the program mapped into static variables of the class, and the body of the program, to the class's main method. Procedures will be mapped to static methods of the class, and the local variables of the procedure to local variables of the method.

In the JVM, method calls cause a new frame to be created. The frame is destroyed when the method call terminates. Each frame has an array of local variables, an operand stack where computations are carried out, and a reference to the constant pool of the class of the method. The local variable array contains parameters that are passed to the method, plus the local variables declared in the method.

Example:

int P(int x, int y){
   int z;
   int w;
   z := x + y;
return z;
   }

Here, the local variable array has length 4, with parameters x and y mapped to slots 0 and 1, and local variables z and w, mapped to slots 2 and 3. Store the slot number of local variables in the symbol table.

The instructions for this method are the

iload(0)    //load x, which is in slot 0
iload(1)    //load y, whichis in slot 1
iadd          //add, leave result on top of stack
istore(2)   //store result in z, which is in slot 2
iload(2)    //get value in z and leave it on top of the stack
ireturn      //return from method, returning int value on top of the stack.

At the end of the method, a return_ (for void) or ireturn (for procedures that return an int) or areturn (for procedures that return an object). Although for semantic correctness, every procedure needs a return statement, our compiler doesn’t check this. It won't detect missing or type incorrect return statements. These errors will be detected by the JVM verifier, however, and will give an error when the generated class file is executed.

Calling a method.
A method is called by pushing the parameters on the stack, and then using the invokestatic instruction.

For example: suppose a is a global int variable, and invoke procedure P (above) with parameters a and constant 2.

getstatic("thisclassname","a","I") //pushes the value stored in global variable a on the stack
ldc(2) //pushes the value 2 on the stack
invokestatic("thisclassname","P", "(II)I"); //after return, the returned int is on top of the stack
The value returned by a method is found on top of the caller’s stack after the method returns.

If and while loops

To compile if statements and while loops, we use the
ifeq (branch if top of stack is equal to 0) and ifneq (branch if not equal to zero) instructions along with labels.

The method label ("x") inserts a label "x" at that point in the code sequence.
The branch instructions take a label as an argument. For example ifeq("x") jumps to the instruction following label x if the top of the stack is equal to zero (and consumes the top of the stack).

"if b then c else d " would be compiled to
code to compute b and leave on top of stack

ifeq("else_part")
code for c
goto_("end_if")
label("else_part")
code for d
label("end_if")

and while b do s

goto_("test")
label("body")
code for s
label("test")
code to evaluate b
ifneq("body)

Note that every label must be unique as they are eventually mapped to addresses. Thus, every time you generate code for a while or if loop, you must generate unique labels.

Booleans and relational expressions

Our compiler gives booleans their own type Z, but they are still manipulated with int instructions. To evaluate relational expressions, the following instructions are used:

if_icmpeq       (equal)
if_icmpne       (not equal)
if_icmpge        (greater than or equal)
if_icmpgt        (greater than)
if_icmple         (less than or equal)
if_icmplt          (less than)
goto_              (unconditional jump)

These methods take a string, representing a label as an argument. The corresponding JVM instruction compares the top two elements of the stack and if successful, jumps to the labeled instruction. If not successful, then the next instruction in the sequence is executed. The goto_ method always jumps to the labeled instruction.
To evaluate, for example, the expression a < b and leave the result on top of the stack, one could generate the following code

code to push the value of a on the stack
code to push the value of b on the stack
if_icmplt("true0")     /*This does the less-than comparison /
ldc(0)                     /The computation continues here if result of the comparison is false. Push 0, representing false, on the stack. /
goto_("done0")       /*Jump over code that handles the true case. */
label("true0")          /*Mark the following instruction with the label "true0" /
ldc(1)                      /*Computation goes here after if_icmplt if result of comparision is true. Push 1, representing true, on the stack /
label("done0")        /*Marks the following instruction with label "done0"
......                              */Now the value of a<b is on the stack.    Continue with whatever
comes next

Use iand and ior instructions for && and ||.
In this way while parsing the AST the byte code could be generated by using the software. The tools for constructing class files were written by Joshua Engel and are provided with his book Programming for the Java Virtual Machine, Addison Wesley, 1999. This software is available under the GNU Library General Public License. The file jvm.jar must be added to the classpath and our program imports the following packages to use the tool:

import COM.sootNsmoke.jvm.*;
import COM.sootNsmoke.instructions.*;

The class JavaFile provides methods to support the construction of classfiles. In particular, it provides convenient methods for adding items such as fields and methods to the classfile and takes care of updating the constant pool.

Thus the byte code is generated in the file.class and this class file is able to run on any standard JVM. This is because of the Java programming architecture that the bytecode is portable to any machine. Thus the compiler construction is completed with this phase.

<< Previous

Next >>

Hosted by www.Geocities.ws