IMPLEMENTATION OF JCOMP
The Jcomp – Subset of java compiler is implemented
in six different phases. They are:
•
Grammar Generation Phase
•
Lexical Analysis Phase
•
Syntax Analysis Phase
•
Semantic Analysis Phase
•
Symbol Table Management
•
Byte Code Generation
Apart from
these phases, error management is integrated with almost all these phases.
The actual relationship of these phases is as shown in the figure 4.
We know that the standard java compiler produces a class file that is able
to run on a standard JVM. Our compiler also generates a class file in
the same manner. The process is illustrated in the Figure 3.
FIGURE 3 – COMPILATION PROCESS
All these phases (except the first) are implemented as separate java files
in a package called compiler. Apart from these files there are 4 more files
which help in the compilation process. They are:
•
Token.java
•
TokenConstants.java
•
ParserUtil.java
•
ASTVisitor
FIGURE 4 – PHASES OF THE COMPILER
The first two files form the basis for the tokenizing of the input file.
The first two files are created by applying the logic specified by Dr.Beverly
Sanders, University of Florida. The ParserUtil.java is used for loading
the classes in the context checking phase. The last ASTVisitor is an interface
which is also used in the context checking phase. In this way all these
phases are separated clearly and implemented separately and they are tested
individually. Note that the symbol table management enters the scene only
at the context checking phase and not at the parsing phase as expected.
Now let us discuss the implementation of each phase individually.
3.1 GRAMMAR GENERATION
The syntax of programming language constructs can be described by context-free
grammars or BNF (Backus-Naur Form) notation. Grammars offer significant
advantage to both language designers and compiler writers.
• A grammar gives precise, yet easy-to-understand,
syntactic specification of a programming language.
•From certain classes of grammars we can
automatically construct an efficient parser that determines if a source
program is syntactically well formed. As an additional benefit, the parser
construction process can reveal syntactic ambiguities and other difficult-to-parse
constructs that might otherwise go undetected in the initial design phase
of a language and its compiler.
• A properly designed grammar imparts a structure
to a programming language that is useful for the translation of source
programs into correct object code and for the detection of errors. Tools
are available for converting grammar-based descriptions of translations
into working programs.
• Languages evolve over a period of
time, acquiring new constructs and performing additional tasks. These
new constructs can be added to a language more easily when there is an
existing implementation based on a grammatical description of the language.
TABLE 1 – GRAMMAR OF THE JCOMP
+ - 1 or more times
* - 0 or more times
<compilationunit> ::= <classdeclaration>
| e
<classdeclaration> ::=
<public> class identifier <classbody>
<public> := public | e
<classbody>
::= { <classbodydeclaration>* }
<classbodydeclaration> ::= <modifier>
<static> <final> <type> identifier ;
| <modifier> <static> <type> identifier ( <formalparamaterlist>
) <block>
| <modifier> identifier ( <formalparamaterlist> ) <block>
<modifier>
::= public | private | protected
<static> := static | e
<final> := final | e
<formalparameterlist> ::=
<vardecs> | e
<vardecs> ::=
<vardec> (, <vardec> )*
<vardec> ::=
<type> identifier
<block>
::= { (<vardec> ; )* <statement>
* }
<statement>
::= if ( <expression> ) { <statement> * }
| if ( <expression> ) { <statement>* } else { <statement>*
}
| while ( <expression> ) { <statement> * }
| <statementwithouttrailingsubstatement>
<statementwithouttrailingsubstatement> ::=
<expressionstatement>
| <returnstatement>
<returnstatement>
::= return <returntail> ;
<returntail> ::= <_expression> | e
<expressionstatement> ::=
<statementexpression> ;
<statementexpression> ::= <statementexpr-prefix>
<statementexpr-suffix>
<statementexpr-prefix> ::= <this-and-name>
<statementexpr-suffix> ::= <assignment-suffix>
| <methodinvocation-suffix>
<methodinvocation-suffix> ::=
( <argumentList> )
<assignment-suffix>
::= = <expression>
<argumentlist> ::=
<expression> ( , <expression> ) * | e
<name>
::= identifier ( . identifier ) *
<this-and-name> ::= <name>
| this.<name>
<expression> : = <cond-or-expr> <cond-or-expr-tail>
<cond-or-expr-tail> ::= || <cond-or-expr>
<cond-or-expr-tail> | e
<cond-or-expr> ::= <cond-and-expr>
<cond-and-expr-tail>
<cond-and-expr-tail> ::= && <cond-and-expr>
<cond-and-expr-tail> | e
<cond-and-expr> ::= <equal-op> <equal-op-tail>
<equal-op-tail> ::= <equalityoperator>
<equal-op> <equal-op-tail> | e
<equalityoperator> ::= == | !=
<equal-op> ::= <rel-op> <rel-op-tail>
<rel-op-tail> ::= <relationaloperator> <rel-op>
<rel-op-tail> | e
<relationaloperaor> : = < | > | >= | <=
<rel-op> ::= <add-op> <add-op-tail>
<add-op-tail> ::= <additionoperator> <add-op>
<add-op-tail> | e
<additionoperator> := + | -
<add-op> ::= <mult-op> <mult-op-tail>
<mult-op-tail> ::= <multiplicationoperator> <mult-op>
<mult-op-tail> | e
<multiplicationoperator> ::= * | / | %
<mult-op>
::= + <mult-op>
| - <mult-op>
| ! <mult-op>
| <primary>
<primary> ::=
<primarywithliteral>
| <primarywithoutliteral>
<primarywithliteral> :=
intliteral
| boolliteral
| nullliteral
<primarywithoutliteral> :=
( <_expression> )
| <classinstancecreationexpression>
| <this-and-name>
| <this-and-name> ( <argumentList>
)
| this
<classInstancecreationexpression> ::=
new <name> ( <argumentlist> )
<type> ::=
<primitivetype>
| ( <referencetype> )
<primitivetype> ::= int
| boolean
| void
<referencetype> ::= <name>
|
In the specified context – free grammar there are terminals,
nonterminals, a start symbol and productions.
1. Terminals are the basic symbols
from which strings are formed. The word “token” is a synonym for “terminal”
when we are talking about grammars for programming languages. For eg: if,
else, + etc are terminals.
2. Nonterminals are syntactic
variables that denote sets of strings. The nonterminals define sets of
strings that help define the language generated by the grammar. They also
impose a hierarchical structure on the language that is useful for both
syntax analysis and translation.
3. In a grammar one nonterminal is
distinguished as the start symbol, and the set of strings it denotes is
the language defined by the grammar. In out grammar compilationunit is the
start symbol.
4. The productions of a grammar
specify the manner in which the terminals and nonterminals can be combined
to form strings. Each production consists of a nonterminal, followed by
an arrow (or ::=), followed by strings of nonterminals and terminals.
We have developed
a non left-recursive unambiguous grammar. The non left recursive makes
the task of parsing very easier by the implementation of RDP (Recursive
Descent Parsers). Those details are given in Syntax Analysis phase. Also
our grammar is unambiguous because it does not want to give rise for 2 parse
trees for a single expression. The notational semantics of the grammar are
all the terminals are indicated by bold faced letters, while the non - terminals
are indicated by normal fonts. ::= indicates the production rule. The symbol
* after a nonterminal indicates that the particular nonterminal can occur
zero or more times. The symbol € indicates the null value.
To avoid
left recursion, we have included a tail value in several productions For
example, in the case of conditional expressions there exists a conditional
tail expressions. This tail expression is added only to remove the left
recursion. Also it could be noted that the reference types are enclosed
with the braces. This is to facilitate non-left recursion. Or else that
matching parameter will cause left recursion in the rule 12. So by applying
all these methods we remove the left recursion completely from the grammar.
Also as this
is a subset of the actual JLS, only a limited set of features are added
in the grammar. The inheritance properties and float, string, cha data
types are avoided. These features are planned as the future developments.
A detailed discussion of these features is included in the Future Enhancements
section. In this way a consolidated grammar has been generated for our
subset of Java compiler.
3.2 LEXICAL ANALYSIS
The lexical analyzer is the first phase of the compiler. Its main task
is to read the input characters and produce as output a sequence of token
that the parser uses for syntax analysis.
FIGURE 5 – INTERACTION OF LEXICAL ANALYZER
WITH PARSER
This is implemented by making the lexical analyzer a separate class, whose
object is used by the parser. (Refer Source- Lexical Analyzer) Upon receiving
the getnextToken command from the parser, the lexical analyzer reads input
characters until it can identify the next token.
Since the lexical analyzer is the part of the compiler that reads the source
text, it may also perform certain secondary tasks at the user interface.
One such task is stripping out from the source program white space in the
form of blank, tab, and newline characters. Another is correleating error
messages from the compiler with the source program. The lexical analyzer
keeps track of the number of newline characters seen and the error messages
can be related to that of the line numbers. Also the number of characters
is also tracked down and be related to that of the error messages.
In our implementation of Lexical analyzer, 3 files are used. The first
file is token.java (Refer Source- Token java). The second file is tokenconstants.java
(Refer Source- TokenConstants java). The last file is the actual lexical
analyzer. The token.java file defines the actual structure of token. This
token is returned by the lexical analyzer to the parser. Each token has
attribute values like kind, word, val and position. The kind is an integer
value. It indicates the type of the token. The types of the token are defined
in the tokenconsants.java. The types such as error, EOF are also indicated.
The parser could check for the kind of the token and take appropriate action
based on their value. This foms the primary basis of the error management.
The word is a string value which indicates the spelling of the token word.
The val is an integer value. This is valid only for integer literal tokens
and Boolean literal tokens, true has the value 1 and false has the
value 0. In the case of Integer literals the val has the numeric value of
the literal token. In this way a token is formed by the lexical analyzer and
returned to the parser.
The tokenconstants file is an interface which defines all possible tokens
in the specified grammar.
TABLE 2 – VALID TOKENS OF JCOMP
List of Modifiers
public, private, protected
List of Keywords
class, static, new, this, return, final
List of Types
void, int, char, boolean
List of Control Statement Keywords
if, else, while
List of Relational Operators
= =, !=, <=, >=, <, >, !, &&, ||
List of Arithmetic Operators
+, -, *, / , %
List of Miscellaneous Operators
{ } ; , = ( ) .
List of Literals
IntegerLiteral(numbers), BooleanLiteral(true or false), NullLiteral(null).
List of Other Tokens
identifiers, end of file, error
|
The possible tokens range from scope modifiers, type modifiers, control
statement keywords, normal keywords, relational operators, arithmetic operators,
miscellaneous operators and literal values. Apart from these usual tokens
it also provides tokens to indicate the identifiers, end of file and error.
Any set of alphanumeric character, which is not a keyword, can form a valid
identifier. ‘_’ and ‘$’ are the only two special characters allowed in the
identifiers. The eof indicates the end of the output. The parser is able
to understand, on receiving this token, that the input file is completely
tokenized. Any token string which does not fit into the above set of token
is a valid candidate for error. All these tokens are given an integer values
in the tokenconstants interface. These integer values are used to denote
the respective tokens in the following phases.
The Jlex file is the important file that actually performs the tokenizing
operation. This class gets the filename of the input program from the parser
phase. It uses FileReader class to open the file and PushbackReader class
to read the file. Both these classes are the part of java.io package. If
the file is not found then a FileNotFoundException is thrown. Also it initializes
the location of the current character to 1000. This value gets incremented
by 1 for each new character. When a new line is encountered, it is incremented
to next nearest 1000 value. This is used to indicate the error. This is
accomplished by just dividing the location value by 1000, to get the line
number and get the modulo value by 1000 to indicate the character location.
In this way error management is started in this phase.
This class implements the TokenConstants interface and so the tokens could
be indicated by their respective integer values. Also it uses the Token
class to form the token and return it to the parser. It has a function getToken,
which is responsible for returning the next token. As a part of tokenizing,
the whitespaces must be removed. While each character is read from the
file, it is checked for whitespace. It is possible by the class Character
of java.lang package. It has the function isWhitespace( ). If it is a newline
or tabstop, appropriate action is taken as mentioned previously. Now the
character is appended to a temporary string, if it has a valid identifier
start. For this purpose, the Character class has a function called isJavaIdentifierStart(
). The important point is that the first letter of the identifier must
be an alphabet. The special characters and numerical characters could be
included only in the center of the identifier. So after checking the first
character, a function called isJavaIdentifierPart( ) is used to check whether
the character is valid to be in the middle of the identifier. This appending
process continues until a breaking character is encountered. Now the temporary
string must be checked for keyword. We use a function isKeyWord( ), to
check whether the string is a keyword. If it is a keyword then it returns
the integer value respective to the keyword, as defined in the TokenConstants
interface. If not, then it returns the integer value for the identifier.
If any mischar is reached inbetween then error value is returned.
The character must be checked for integer literals. The ascii values for
numbers is used for comparison. It is appended to the temporary string.
Finally the parseInt( ) function of java.lang.Integer class is used to extract
the numerical value of the sting and filled in the val field of the resulting
token. It is also checked for each special character separately. If the
value of the character read from the file is -1, then it indicates the end
of file. If the parser calls this function even after sending this last token
an erroneous condition results. If the character does not fit into any
of the above tokens, it could be concluded that the particular character
is not part of the character set allowed by the specified grammar. So an
error token is returned automatically.
The most significant used in this phase is unread( ). This is used to return
the character, which is read, back to the file stream. This is required
because the breaking character must be returned back while reading the
identifier string. This helps to keep track of the characters properly
and separate the tokens correctly. This is conceptually equivalent to the
ungetc( ) function in C. When the parser calls the function each time,
the lexical analyzer continues from the place where it read previously.
The Pushback reader takes care of those activities automatically. In this
way a lexical analyzer is implemented in this project.
3.3 SYNTAX ANALYSIS PHASE
Parsing is the process of determining if a string of tokens can be generated
by a grammar. A parser can be constructed for any grammar. Parsers, in
general, are of two types: top-down parsers and bottom-up parsers.
In our compiler model, the parser obtains a string of tokens from the lexical
analyzer, and verifies that the string can be generated by the grammar for
the source language.
FIGURE 6 – POSITION OF PARSER IN COMPILER
MODEL
The parser is
expected to report the errors in an intelligibile fashion. In our compiler
the parser reports the error by giving string of messages. The approximate
location of the error is also indicated by the logic discussed in the lexical
analyzer phase.
Our parser belongs to the group of LL (1) parsers.
Our parser is a topdown parser implemented for non-left recursive unambiguous
CFG. The top-down construction of a parse tree is done by starting with
the root, labeled with the starting nonterminal, and repeatedly performing
the following two steps:
1.
At node n, labeled with nonterminal A, select one of the productions
for A and construct children at n for the symbols on the right side of
the production.
2.
Find the next node at which a subtree is to be constructed.
We follow a technique called RDP. It is a technique
of top-down parsing method of syntax analysis in which we execute a set
of recursive procedures to process the input. A procedure is associated
with each non-terminal of a grammar. Here we use a special type of parsing
called predictive parsing.
Our compiler is implemented
in two different phases. (Refer – Source Parser)The first phase is simple.
In the first phase of the parser it simply checks for the syntax errors,
if there are no syntax errors then it exits the program without giving
any output. If there are any errors then those errors are indicated along
with their location and type and the program exits.
In the second phase of the parser
the abstract syntax tree is developed and this abstract syntax tree is
integrated with the parser. Now the return type of the parser is the abstract
syntax tree is itself. The abstract syntax tree has so many classes, each
corresponding to a single non terminal. Our parser also has so many functions.
Each function corresponds to the non terminal of the grammar. So it is
easy to map each class of the abstract syntax tree to each function of
the parser. The return type of the each function of the parser is corresponding
mapped class of the parser. Thus all these classes are appended and thus
an abstract syntax tree is formed and finally returned. This abstract syntax
tree is used by the following phases for context checking, symbol table
entry, and code generation phases.
As we are using the non left
recursive unambiguous grammar, the parsing becomes as straight forward
activity.For any rule P=ABC, there are four functions P, A, B, C. In the
first function P, all the three functions A, B, C are invoked in sequence.The
return type of all three functions are the classes A, B, C respectively.
Finally a new object P is created with the objects A, B, C as its constructor
parameters. All the above specified classes are defined in the abstract
syntax tree class. As the abstract syntax tree is defined in tne same package
and all the classes are public it is possible to use them directly, without
importing each of the classes explicitly.
In our example grammar, Let
A=ab. Then in the function ‘A’ it initially checks whether the current
token word is ‘a‘. If it is not so, then an error message specifying”Expected
a” is specified and the system exits. If the token is equal to ‘a’, then
getToken () function of the lexical analyzer is called to receive the next
token. Again the received token word is checked for ‘b’ and the same case
for ‘a’ is repeated. If the grammar is correct then a new object of class
A is created with the terminals a, b as its constructor parameters and that
new object is returned. The error management feature of our parser is built
in such a way that it catches only the first error occurring in the program.
Advanced error recovery schemes are planned as a future enhancement of the
project.
Consider another sample grammar
rule B=d|€. This rule means that the rule B may have d or else it may
be empty. So in the function B the current token is checked to be equal
to d. If it is equal, then a new class B is created with the terminal d
as its constructor parameter and this new object is returned. If the token
is not equal to d then the error condition is not raised as in the case
of the previous rule. Instead a null valued object is returned. In this
way the null values in the grammar rules are handled appropriately.
The most significant part of the construction
of the parser is the one dealing with sequencedeclarations, sequenceexpresions
and sequencecommands. The common logic used in all these three cases is
to commands are initialized and when more than one command is encountered
than the are passed as constructor argument and the new sequencecommand
object is created. If only one command is encountered than only an appropriate
command object is created. As the sequencecommand class is the subclass of
abstract class command the return type of the object may be a sequencecommand
object or an individual appropriate command object. This piece of logic uses
the inheritance property of the java languages. The above logic is applicable
for the other to cases as well.
Another notable point in the
construction of the parser regards with the abstract classes. The expressions,
commands and declarations are all abstract classes. So while parsing the
AST, returned by the parser, in the following phases may be difficult.
The identification of type of commands, for example, may be difficult to
avoid this difficulty each abstract has a string value attribute which
clearly indicates the type of the object.
In general, the symbol table
entries enter the scene at the parsing stage itself. But in our compiler
the symbol table entries begin at the context checking phase only. This
technique has both pros and cons. The favorable side is that the parser
has already a complex after the integration of the AST. We do not want to
over burden the parser logic by introducing the symbol table entries in this
place. The cons are that the functions must be defined before they are called.
This resembles the C syntax but anyway this could be accepted as one of
the constraints imposed by our grammar. In this way a LL (1) RDP is implemented
by interacting with the lexical analyzer and it returns the Abstract syntax
tree to the context checking phase.
3.4 ABSTRACT SYNTAX TREE
The Abstract Syntax
tree is the sub part of the construction of the parser. This abstract syntax
tree is created and integerated with the first phase of the compiler and
so the parser is modified to return the Abstract Syntax tree. This AST
is the most significant part in the construction of the compiler. This is
because this AST is used during the contextchecking and for symbol table
entries. In later stages, this AST is used to generate the final code.
An AST is a pictorial (conceptually)
representation of how the start symbol of the grammar are represented as
a parent – child relationship in the program. A useful starting point for
thinking about the translation of an input string is an abstract syntax
tree in which each node represents an operator and the children of the node
represent the operands. By contrast a parse tree is called a Concrete Syntax
Tree and the underlying grammar is called a concrete syntax for the language.
Abstract Syntax trees or simply syntax trees, differ from parse trees because
superficial distinctions of form, unimportant for translation, do not
appear in syntax trees.
It is desirable for a translation
scheme to be based on a grammar whose parse trees are as close to syntax
trees as possible. The grouping of subexpressions by the grammar should
be similar to their grouping in syntax trees. The use of syntax trees as
an intermediate representation allows translation o be decoupled from parsing.
Translation routines that are invoked during parsing must live with two
kinds of restrictions. First, a grammar that is suitable for parsing may
not reflect the natural hierarchical structure of the constructs in the
language.
However analysis may be easier
if we use a tree representation that reflects the natural hierarchies. Second,
the parsing method constrains the order in which node in a parse are considered.
This order may not match the order in which information about a construct
becomes available.
An AST is a condensed form of
the parse tree useful for representing language constructs. In a syntax
tree, operators and keywords do not appear as leaves, but rather are associated
with the interior node that would be the parent of those leaves in the
parse tree. Another simplification found in the syntax trees is that chains
of single productions may be collapsed. Syntax-directed translation can
be based on syntax trees as well as parse trees. This approach is the same
in each case. We attach attributes to the nodes as in a parse tree.
Those qualities of a BNF definition
that make parsing possible also create a resulting derivation tree containing
far more information than necessary for a semantic specification. For example,
the categories of terms and elements are required for accurate parsing,
but when ascribing meaning to an expression, only its basic structure is
needed. Although concrete syntax is essential to implementing programming
languages, it is the abstract syntax that lies at the heart of semantic
definitions. The concrete syntax is incidental to language specification,
but it is important to users since it in seuences the way they think about
a language. This aspect of pragmatics is not of direct concern to us in constructing
the AST.
The derivation tree retains
all the information used in parsing including details that only the parser
needs. On the other hand, an abstract syntax tree captures the syntactic
structure of the expression completely in a much simpler form. After all,
the crucial property of the expression “5*a – (b+1)” is that it is a difference
of a product and a sum of certain numbers and variables. Any other information
is redundant. In transforming a derivation tree into an abstract syntax
tree, we generally pull the terminal symbols representing operations and
commands up to the root nodes of subtrees, leaving the operands as their
children. Using this approach, this expression can be thought of as a binary
operation and two subexpressions. The choice of the left subtree for the
binary operation is arbitrary; it seems to suggest a prefix notation for
binary operations, but we are not talking about concrete syntax here, only
an abstract representation of certain language constructs. We may choose
any representation that we want as long as we can describe the constructs
of the language adequately and maintain consistency.
The point of abstract syntax
is simply to communicate the structure of phrases in terms of their semantics
in a programming language as trees. Semantics can be defined in terms of
derivation trees and actually is with attribute grammars, but most semantic
methods are far more understandable when based on a cleaner representation
of the phrases in a language. We want uniform templates for the various
parts of a language. The blueprints for the abstract syntax trees of a programming
language are specified by a collection of syntactic categories or domains
and a set of rules telling how categories are decomposed into other categories
or tokens.
To design the abstract syntax
for a programming language, we need to determine which notions (nonterminals)
are fundamental to the language and which basic forms the constructs of
the language may take. The definition of abstract syntax tolerates more
ambiguity since the concrete syntax has already determined the correct interpretation
of the symbols in the program text. As a consequence of adhering to the
BNF syntax of a language, any parsing algorithm must at least implicitly
create a derivation tree. But in fact we usually design a parser to generate
an abstract syntax tree instead of a derivation tree. Therefore the syntax
of “parse” is given by
Parse: Token* → Abstract Syntax Tree
It can be argued that when designing
a new programming language, we need to formulate the abstract syntax along
with the semantics so that the meaning of a program emanates from a mapping
meaning: Abstract Syntax Trees → Semantic Objects
where the semantic objects provide meaning for the various language
constructs. Different approaches to semantics depend on the disparate
kinds of objects used to define meaning. Given the abstract syntax of
a programming language, the concrete syntax can be defined by an operation
unparse: Abstract Syntax Trees → Concrete Syntax
where Concrete Syntax refers to derivation trees, to lists of tokens,
or to lists of characters representing program texts. Since two different
phrases following the concrete syntax may produce the same abstract syntax
tree, unparse may not be a function at all. To ensure that unparse is a
well-defined function, some canonical representation of concrete phrases
must be specified for example, taking expressions having the fewest parentheses.
The correctness of a parsing algorithm can be demonstrated by showing that
it is the inverse, in some sense, of the unparse function.
In our project, the AST is implemented
by considering all the above factors. The abstract syntax tree returned
by your parse method should be an instance of (a subclass of) the following
class (Refer – Source AST)
abstract class AST
{ }
We have defined an appropriate
class hierarchy for the nodes of the tree as defined in the language specification.
The class for each node should provide a toString method that returns,
enclosed in the brackets, the strings corresponding to the subnodes, or
if the symbol is a terminal symbol, the spelling. The empty string
€ is represented by null.
Note that we also have a null
literal. So to differeniate between these two null values, the null literal
values will be enclosed by the brackets (since they are terminals). Whereas
the null values will not be enclosed by the brackets. If the production
in abstract syntax contains terminal symbols, they also appear in the final
string. Our abstract syntax tree makes use of the inheritance properties
of the Java language. Each node in the syntax tree is implemented as a class
with several fields (Refer: Source-Abstract Syntax tree). For example, in
a BinaryOpExpression it has 3 fields. One is the operator, which is an
object of the Terminal class. The other two may be the objects of the Terminal
class or the objects of the Expression class. In this way several expressions
can be combines. Also the expressions, command, declarations may be chained
to form a sequenced ones. As these sequenced ones are derived from the
same abstract class as of he individuals both can be returned without ambiguity.
Also each abstract class has string valued attribute to indicate, precisely,
the type of the individual nodes. Later each subclass is added a visit
function. But those aspects are discussed in the Semantic checking phase.
In this way the abstract syntax tree is constructed and returned by the
parser to be used by the following phases of the compiler.
3.5 SYMBOL – TABLE GENERATION
A compiler
uses a symbol atble to keep track of scope and binding information about
names. The symbol table is searched every time a name is encountered in
the source text. Changes to the table occur if a new name or new information
about an existing name is discovered.
A symbol table
mechanism must allow us to add new entries and fine existing entries efficiently.
We use a symbol table mechanism called LeBlanc Cook Symbol table mechanism.
This symbol table generation uses Hashtables and stacks for keeping track
of the program. In general, Hashing schemes provide better performance
for somewhat greater programming effort and space overhead. Even linear
lists could be adapted readily to handle the most closely nested scope
rule.
It is useful
for a compiler to be able to grow the symbol table dynamically, if necessary,
at compile time. If the size of the symbol table is fixed when the compiler
is written, then the size must be chosen large enough to handle any source
program that might be presented. Such a fixed size is likely to be too
large for most, and inadequate for some, programs. Our symbol table uses
dynamic mechanisms for growth so it is possible to accept any size of programs.
Variations
of the searching technique known as hashing have been implemented in many
compilers. Here we consider a rather simple variant known as open hashing,
where “open” refers to the property that there need be no limit on the
number of entries that can be made in the table. Even this scheme gives
us the capability of performing ‘e’ inquiries on ‘n’ names in time proportional
to n(n+e)/m, for any constant ‘m’. Since ‘m’ can be made as large as we
like, up to n, this method is generally more efficient than linear lists
and is the method of choice for symbol tables in most situations. As might
be expected, the space taken by the data structure grown with ‘m’, so a
time-space tradeoff is involved.
Our symbol
table uses the Hashtable class and Stack class of java.util package. When
the control enters a function, it conceptually means that the control has
entered a new scope. For this a new scope id is generated. This global scope
id is unique and it is generated newly for each scope. Invalidated scopes
ids will not be used again. These scopes must be tracked down. So we use
the Stack. This stack is called as the scope stack. When a new scope is encountered
then this scope id value is pushed into the stack. When the scope is left,
the current scope id is reduced by 1 but the value stays in the table. This
is because during the debugging activities all the values must be in the
table. This makes the debugging activities easier. When a variable is encountered
in the declaration part, that variable is entered in the symbol table with
the current scope id. If it is encountered elsewhere then the scope stack
is consulted and the scope which is the most nearest to the top of the stack
is considered to be the appropriate scope of that particular variable.
This would make sense after the complete discussion of the design of the
symbol table.
Our design
of the symbol table uses two hashtables (Refer: Source- Symbol Table). The
first hashtable consists of the FQN as the key values. The object of this
hashtable is the second hash table. The second hash table has the key value
as the scope id and the object of this hash table is the object of the varObject
class, which is created as an abstract class in the same file. This abstract
class has varSymObject class and methodSymObject class as its subclasses.
These classes stand for the objects denoting the variables and the functions
respectively. The only difference between both these derived classes is
that the mthodSymObject has a field for holding the argumental parameters.
Otherwise, both the varSymObject and methodSymObject are the same.
The second
hash table is embedded into each row of the first hash table. When more
objects with same name is declared in various scopes, and then the problem
arises. We have to determine which object to choose from the available set
of objects. So the scope stack is used. All available scopes from the available
object are collected. We determine then which scope among them is at or
atleast nearer to top of the stack. Than that scope is assumed to be the
most recently declared scope and the object value at that scope is used.
FIGURE 7 – LEBLANC COOK SYMBOL TABLE
There is a
function called enterScope ( ). This is called when a new scope is encountered.
When this function is called, the global scope id is incremented by one
and the current scope id is set to the value of the global scope id. This
scope id is also pushed into the scope stack. There is a function called
leaveScope ( ). This is called when an existing scope becomes invalidated
i.e the control comes out of the scope. When this function is called the
stack is popped. This means that the current scope id, which is at the
top of the stack, is removed. So the invalid scope leaves the scope stack
but stays in the table. As mentioned earlier, this is used for the debugging
activities.
The function
insert is used to insert the object into the hash tables. Here the functional
polymorphism property of the Java Programming language is used. There are
two insert function but with varying number of arguments. The first insert
function is used to insert the variable objects and the second insert function
is used to insert the method objects. During the time of insertion, the
scope ide is checked for 0 and the type is checked whether it is equivalent
to class. At that instant the classname must be null. Or else error condition
arises. Also the name of the variable is checked with the classname and
the methodname. If they match then the error is notified. Else the process
continues smoothly. Now an appropriate type of the object is created with
the arguments. The arguments are scope modifiers, static modifiers, and
types or return values, and parameters. Now the symTable (first) is checked
for the key value.
If the key
value does not exist then a new entry is made in the table and then it is
inserted. If the key value already exists, the hashtable embedded in the
object column is obtained. It is checked whether the same scope value exists.
If it existes, then it means that the variable with the same name is declaraed
in the same scope exists. So it arises the erroneous condition and exits.
If there is no current scope id then the new object is also added to the
second hash table and it is again embedded into the first hashtable in its
previous position itself. In this way the insert operation is performed.
The function
lookup is used to lookup for the object values in the hashtables. The return
type of this function is the varObject since the returned object may be
either varSymObject or methodSymObject. During the lookup, the object with
that name as key is obtained. If it does not exist then the null valued object
is returned. If the object exists, then it is returned. The problem is
more than one object would exist as they could be defined in various scopes.
So to choose the best object all their scope ids are collected and then
the scope stack is checked. By checking with the scope stack the invalidated
scopes could be eliminated. Now with the valid scopes available, the one
which is at or atleast the most nearer to the top of the stack is considered
as the best and its corresponding object is choosen for the use.
Then there
is another function getScopeID ( ). This is just a helping function. It
uses the same logic as that of the lookup operation. But instead of returning
the object value it returns the scope value. This may be useful in simple
logics and also in the context checking phase. Generally, in the compiler
design principles the symbol table will enter the scene in the parsing
phase. But in our project we introduce the symbol table only in the contextual
constraints checking phase. Actually this symbol table generation is a
preliminary activity for the following phase. There is another function
called contains( ) which returns a Boolean value. The array of objects
and the integer values are its parameters. In this function actually the
scope values are checked for existence in the scope stack. This piece of
control logic is used to invalidate the invalid scope ids that exist in
the symbol table.
This symbol
table is used in the context checking phase. During the context checking
phase, the control walks through the AST that is it moves its control
through each child node of the tree. As it moves through the tree it may
enter the scope or leave the scopes. It may encounter the declaration of
variables and functions. It may encounter many function call commands.
It may come across many command that make use of the variables for which
it has to insert and lookup the values in the symbol table.In this way the
LeBlanc cook symbol table is used for the construction of our symbol table.
3.6 SEMANTIC ANALYSIS
In the semantic analysis phase certain checks are performed to ensure
that the components of the program fit together meaningfully. The semantic
analysis phase checks the source program for semantic errors and gathers
type information for the subsequent code-generation phase. It uses the
hierarchical structure determined by the syntax-analysis phase to identify
the operators and operands of expressions and statements.
FIGURE 8 – POSITION OF CONTEXT CHECKER IN COMPILER
MODEL
An important component
of semantic analysis is type checking. Here the compiler checks that each
operator has operands that are permitted by the source language specification.
For example, many programming language definitions require a compiler
to report an error every time a real number is used to index an array.
However, the language specification may permit some operand coercions,
for example, when a binary arithmetic operator is applied to an integer
and real. In this case, the compiler may need to convert the integer to
a real.
Semantic analysis
uses the visitor pattern to traverse the AST, decorating the AST nodes
and checking conditions specified below. It is necessary to add members
to our AST classes to hold the attributes we have designed and will use
our symbol table class. For example, since each expression has a
type, we would add a member to hold the type in your abstract Expression
class. Errors should be handled as before, by printing the position
of the token involved in the error, and terminating. Thus if check
returns, the AST has been successfully type checked.
Our AST is
parsed or the control just traverses through the node of the abstract syntax
tree and at each node as it encounters the various declarations, expressions
and commands the corresponding semantic rules are checked and appropriate
actions are taken. When the declarations are encountered they are entered
into the symbol table and when expressions or command are encountered then
lookup operations take place and the statements are validated. You may notice
that there are additional context constraints that it might make sense to
check, but the ones given below are the only ones you should implement.
• Identifiers must be declared
before they are used. CAN are not allowed.
• The program name cannot be redefined anywhere
in the program. A name can only be declared once in the same scope.
If a name is declared more than once, then the usage is bound to the closest
declaration. A procedure cannot contain a variable with the same
name as the procedure. A variable cannot have type void.
• Every expression has a type
determined as follows:
-
The type of a VarExpression is the declared type of the identifier.
-
The type of a NumLitExpression is int
-
The type of a BoolLitExpression is boolean.
-
The type of a FunctionExpression is the return type of its procedure
• The expression in an AssignCommand
must have the same type as the variable.
• The expression in a WhileCommand
must have boolean type.
• The expression in an IfElse
command must have boolean type.
• The types of the actual parameters
to a procedure or function call must match the types of the formal parameters.
• The return type of the function
must match with the type of the function.
• In the functions with the void return
type the return command must not have any expression i.e the return type
of the expression must be null.
• The functions must be declared
and defined before a call is made.
• No variable should be of type
void.
• If the functions of the other
classes are accesed then they must be public functions.
• The external class to which
a reference is made must be available in the classpath.
• The return type of the function
must match with that of the type on the invoking side.
• The variable in the other class which we are
trying to access must be a public variable. AIF & MIF fields must
be handled appropriately.
• The variable whose value we
are trying to modify must not be a final variable.
To implement
our context checking phase we use 4 files. The first is the change made
to the already built AST. A new function called the visitor with the arguments
ASTVisitor and Object is included in the abstract class AST. So this function
must be implemented by all the classes that inherit this abstract AST
class. This makes use of the inheritance features of the Java language.
Now the second file is an interface. This interface is the ASTVisitor
which is used as the argument for the visit function in the AST class.
This ASTVisitor interface has several functions to denote the visit functions
of each node. So these functions are defined by our actual context checker
class and each node in the abstract syntax tree actually implements its
corresponding functions and thus while traversing the node all the semantic
analysis check could be made.
The last file is the ParserUtil
class. This class is used to load the any other class specified in the Java
APIs. If the other user defined classes has to be referenced then those
classes must be added to the classpath. This class uses the ClassLoader,
Class classes and Method, Field, and Modifier classes of the java.lang.reflect
and java.util packages. It uses the the functions like loadClass which loads
the actual class into the memory space. There is another function called
getDeclaredMethods which gives all the declared methods in the class. But
this function does not give the constructor declarations. The constructor
declarations are also necessary to check for scope modifiers and also the
parameter types. So we use another function getConstructors to give all the
declared constructors of the class. There is another function called getReturnType
to determine the return type of the class. Also we use another function
getParameterType which gives the type of the parameters of the functions.
We also another functions getModifiers to determine the modifiers of the
function or variable. It returns a integer value which is then applied individually
to the functions like isPublic etc to determine its modifier value precisely.
In this way by using all these functions and classes the control shifts through
each node of the AST and performs the semantic checking in each node and
uses the symbol table for entries and lookup and notifies the error, if any.
Thus semantic checking is completed.
3.7 BYTE CODE GENERATION
In order to execute the programs in our language, we will generate code
for the JVM (Java Virtual Machine). Recall that a java program is translated
by the java compiler (command javac) into a set of class files containing
Java Byte Codes. These are then interpreted by the JVM (command java) to
execute the program. Our compiler will also translate the source programs
into class files containing java byte code, which are then interpreted by
the JVM.
Overview of the class file format
The format of a java class file is defined a stream of 8-bit bytes
consisting of a single ClassFile structure (where the structure is as in
the C programming language):
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}
The magic, minor_version, and major_version are fixed values. The constant_pool_count
indicates how many entries are in the contant_pool + 1. This constant pool
is a table of structures representing various string constants, class names,
field names, etc. Access flags indicate the access permissions
and properties of the class (public, final, abstract, etc). The item this_class
refers to the class described by this classfile, and super_class, its
superclass. In our case, this_class will be set to the program name, and
super_class to "java/lang/Object". Interfaces and interface_count will
be set to 0 and null respectively. fields_count and fields contain information
about the fields of the class. In our case, these will hold information
about the global variables of our program. methods_count and methods_info
will contain information about procedures, including the name, superclass,
bytecode, max number of local variables, etc. The attributes_count and attribute-info
items will be 0 and null, respectively.
Instruction Summary
The Java virtual machine is stack based. A Java virtual machine
instruction consists of a one-byte opcode specifying the operation to
be performed, followed by zero or more operands supplying arguments or
data that are used by the operation. Many instructions have no operands
and consist only of an opcode.
Ignoring exceptions, the inner loop of a Java virtual machine interpreter
is effectively
do {
fetch an opcode;
if (operands) fetch operands;
execute the action for the opcode;
} while (there is more to do);
The number and
size of the operands are determined by the opcode. Each method invocation
causes the creation of a new frame. The frame contains an array of local
variables for the method, and an operand stack where most of the work is
done. The first items in the local variable array are the parameters of
the method. The remaining elements of the array are the variables
declared locally in the method. These variables are accessed by their
index in the array. The load and store instructions transfer
values between the local variables and the operand stack of a Java virtual
machine frame:
• Load a local int variable onto the operand
stack: iload. (Example, iload 2 loads the local int
variable at position 2 onto the stack)
• Store an int value from the operand stack into
a local variable: istore (Example, istore 2 stores the int value
on top of the stack to the local variable at location 2.)
• Load a reference in a local variable onto the
operand stack: aload
• Store a reference form the operand stack into
a local variable: astore
The instructions to load a constant onto the operand stack are
• Load a constant (literal) onto the operand
stack: ldc (Example ldc 0 loads the value 0 onto the stack)
These instructions take the name of the static variable as an operand.
• Load the contents of a static variable onto
the stack: getstatic
• Store a value from the operand stack into a static
variable: putstatic
The JVM instruction set contains separate instructions for the operations
on each primitive type. For example, there are multiple instructions that
add the two values on top of the stack: iadd, ladd, fadd, dadd. These operate
on ints, longs, floats, and doubles, respectively. Boolean types are represented
within the JVM as int values with true = 1 and false = 0.
• Add: iadd
• Subtract: isub
• Multiply: imul
• Divide: idiv
• Remainder: irem
• Negate: ineg
• Bitwise or: ior
• Bitwise and: iand
• Bitwise exclusive or: ixor
• The control transfer instructions conditionally
or unconditionally cause the Java virtual machine to continue execution
with an instruction other than the one following the control transfer instruction.
Some of theses are:
• Conditional branch: ifeq, iflt, ifle, ifne, ifgt,
ifge, ifnull, ifnonnull, if_icmpeq, if_icmpne, if_icmplt, if_icmpgt, if_icmple,
if_icmpge, if_acmpeq, if_acmpne.
• Unconditional branch: goto
The following instructions invoke methods:
• invokevirtual invokes an instance method of an
object, dispatching on the (virtual) type of the object. This is the normal
method dispatch in the Java programming language.
• invokestatic invokes a class (static) method in
a named class. This will be the instruction for the procedure and function
calls in our language.
• Method return instructions, are ireturn, (used
to return values of type int) areturn, (used to return references)
and return, to return from void methods.
Before a method
is invoked, its arguments are placed on the operand stack. The invoke
instruction then places these in the local variable array of the frame
for the invoked method where they can be accessed as local variables.
The return instruction leaves the result, if any on top of the stack of
the caller.
Implementation
In order to implement our compiler, we need to map the various kinds of
constructs in our language into the JVM constructs. So far,
we have mapped a program to a class, with the global variables of the program
mapped into static variables of the class, and the body of the program,
to the class's main method. Procedures will be mapped
to static methods of the class, and the local variables of the procedure
to local variables of the method.
In the JVM,
method calls cause a new frame to be created. The frame is destroyed
when the method call terminates. Each frame has an array of local
variables, an operand stack where computations are carried out, and a reference
to the constant pool of the class of the method. The local variable
array contains parameters that are passed to the method, plus the local
variables declared in the method.
Example:
int P(int x, int y){
int z;
int w;
z := x + y;
return z;
}
Here, the local variable array has length 4, with parameters
x and y mapped to slots 0 and 1, and local variables z and w, mapped to
slots 2 and 3. Store the slot number of local variables in the symbol
table.
The instructions for this method are the
iload(0) //load x, which is in slot 0
iload(1) //load y, whichis in slot 1
iadd //add,
leave result on top of stack
istore(2) //store result in z, which is in slot 2
iload(2) //get value in z and leave it on top of
the stack
ireturn //return from method, returning
int value on top of the stack.
At the end of the method, a return_ (for void) or ireturn (for procedures
that return an int) or areturn (for procedures that return an object).
Although for semantic correctness, every procedure needs a return statement,
our compiler doesn’t check this. It won't detect missing or type
incorrect return statements. These errors will be detected by
the JVM verifier, however, and will give an error when the generated class
file is executed.
Calling a method.
A method is called by pushing the parameters on the stack, and then
using the invokestatic instruction.
For example: suppose a is a global int variable, and invoke
procedure P (above) with parameters a and constant 2.
getstatic("thisclassname","a","I") //pushes the value stored
in global variable a on the stack
ldc(2)
//pushes the value 2 on the stack
invokestatic("thisclassname","P", "(II)I"); //after return,
the returned int is on top of the stack
The value returned by a method is found on top of the caller’s stack
after the method returns.
If and while loops
To compile if statements and while loops, we use the
ifeq (branch if top of stack is equal to 0) and
ifneq (branch if not equal to zero) instructions along with labels.
The method label ("x") inserts a label "x" at that point in the code
sequence.
The branch instructions take a label as an argument. For example
ifeq("x") jumps to the instruction following label x if the top of the
stack is equal to zero (and consumes the top of the stack).
"if b then c else d " would be compiled to
code to compute b and leave on top of stack
ifeq("else_part")
code for c
goto_("end_if")
label("else_part")
code for d
label("end_if")
and while b do s
goto_("test")
label("body")
code for s
label("test")
code to evaluate b
ifneq("body)
Note that every label must be unique as they are eventually
mapped to addresses. Thus, every time you generate code for a while
or if loop, you must generate unique labels.
Booleans and relational expressions
Our compiler gives booleans their own type Z, but they are still manipulated
with int instructions. To evaluate relational expressions, the following
instructions are used:
if_icmpeq (equal)
if_icmpne (not equal)
if_icmpge (greater than
or equal)
if_icmpgt (greater than)
if_icmple (less than
or equal)
if_icmplt (less
than)
goto_
(unconditional jump)
These methods take a string, representing a label as an argument.
The corresponding JVM instruction compares the top two elements of the
stack and if successful, jumps to the labeled instruction. If not
successful, then the next instruction in the sequence is executed.
The goto_ method always jumps to the labeled instruction.
To evaluate,
for example, the expression a < b and leave the result
on top of the stack, one could generate the following code
code to push the value of a on the stack
code to push the value of b on the stack
if_icmplt("true0") /*This does the less-than
comparison /
ldc(0)
/The computation continues here if result of the comparison is false. Push
0, representing false, on the stack. /
goto_("done0") /*Jump over code
that handles the true case. */
label("true0")
/*Mark the following instruction with the label "true0" /
ldc(1)
/*Computation goes here after if_icmplt if result of comparision
is true. Push 1, representing true, on the stack /
label("done0") /*Marks the
following instruction with label "done0"
......
*/Now the value of a<b is on the stack. Continue
with whatever
comes next
Use iand and ior instructions for && and ||.
In this way while parsing the AST the byte code
could be generated by using the software. The tools for constructing class
files were written by Joshua Engel and are provided with his book Programming
for the Java Virtual Machine, Addison Wesley, 1999. This software is available
under the GNU Library General Public License. The file jvm.jar must be
added to the classpath and our program imports the following packages to
use the tool:
import COM.sootNsmoke.jvm.*;
import COM.sootNsmoke.instructions.*;
The class JavaFile provides methods to support the construction of classfiles.
In particular, it provides convenient methods for adding items such as
fields and methods to the classfile and takes care of updating the constant
pool.
Thus the byte
code is generated in the file.class and this class file is able to run
on any standard JVM. This is because of the Java programming architecture
that the bytecode is portable to any machine. Thus the compiler construction
is completed with this phase.