| Scanners := a program that reads a string and brakes them into symbols Parsers := a program that reads symbols and checks if they are in the correct order. that is, if the syntax is correct In this section you will find tutorials about compiler design. That is, I will teach you how to design your own programming languages using C/C++, Java and compiler construction tools such as Flex, Bison and SableCC. If you are into games programming, then the ones you want to look into are Flex and Bison because they generate scanners (a program that reads a string and brakes them into symbols) and parsers (a program that reads symbols and checks if they are in the correct order. that is, if the syntax is correct) for C/C++. SableCC is a compiler framework for Java. It generates scanners and parsers for the Java language, as well as a tree representation of the programs along with the programs to process them. To begin with, you could start reading regular expressions because that is the language we use to define scanners. You then follow the tutorials on context-free grammars, which are used to describe the correctness of the language. That is, what forms valid sentences and what does not. After you have introduced yourself to both regular expressions and context-free grammars, You may follow the tutorials on SableCC, Flex and Bison. At the moment, you will only find a tutorial on how to write a compiler with SableCC. But I will add a Flex and Bison one soon. I hope you find this section useful. |
| Regular expressions | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Regular expressions are a convenient means of specifying certain simple (though possibly infinite) sets of strings. They are of practical interest because regular expressions can be used to specify the structure of the tokens used in a programming language. In particular, regular expressions can be used to program a scanner generator. The sets of strings defined by regular expressions are termed regular sets. We start with a finite character set, or vocabulary (denoted V). This vocabulary is normally the character set used to form tokens. An empty or null string is allowed (denoted λ, "lambda"). Strings are built from characters in V via concatenation (for example, :=, begin, 123).The null string, when concatenated with any string S, yields S. That is, Sλ ≡ λS ≡ S. Concatenation can be extended to sets of strings as follows: Let P and Q be sets of strings. Then string S ∈ (P Q) if and only if S can be broken into two pieces: S = S1 S2, such that S1 ∈ P and S2 ∈ Q. Small finite sets are veniently represented by listing their elements, which can be individual characters or strings of charaters. Parentheses are used to delimit expressions, and |, the alternation operator, is used to separate alternatives. The characters (, ), , and | are meta-characters (punctuation and regular expression operators). Meta-characters can be quoted when used as ordinal characters to avoid ambiguity. (Any character or string may be quoted, but unnecessary quotation is avoided to enhance readability.) For example: Alternation can be extended to sets of strings. Let P and Q be sets of strings. Then string S ∈ (P | Q) if and only if S ∈ P or S ∈ Q. Large (or infinite) sets are conveniently represented by operations on finite sets of characters and strings. Concatenation and alternation may be used. A third operation, Kleene closure, is also allowed. The operator * will represent the postfix Kleene closure operator. Let P be a set of strings. String S ∈ P* if and only if S can be broken into zero or more pieces: S = S1 S2 S3 Sn such that each Si ∈ P. A string in P* is the concatenation of zero or more selections (possibly repeated) from P. (We explicitly allow n = 0, so that λ is always in P*.) Regular expressions are defined as follows. Each regular expression denotes a set of strings ( a regular set).
Any finite set of strings can be represented by a regular expression of the form (S1 | S2 | | Sn). We often utilize the following operations as notational convenience. They are not strictly necessary, because their effect can be obtained (albeit somewhat clumsily) using the three standard regular operators (alternation, concatenation, kleene closure):
We now illustrate how regular expression can be used to specify tokens. Let Then
All finite sets and many infinite sets, such as those just listed, are regular. But not all infinite sets are regular. For example, consider { [k]k | I ≥ 1}, which is the set of balanced brackets of the form [[[[[ ]]]]]. This is a well-known set that is not regular. The problem is that any regular set either does not get all balanced nesting or it includes extra, unwanted strings. It is easy to write a context-free grammar (described in the next section) that defines balanced brackets precisely. All regular sets can be defined by context-free grammars; thus, the bracket example shows that context-free grammars are more powerful descriptive mechanism that regular expressions. Regular expressions are, however, quite adequate for specifying token-level syntax. If you have any queries, please do not hesitate to contact me. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Traditionally, context–free grammars have been used as a basis of the syntax analysis phase of compilation. A context–free grammar is sometimes called a BNF (Backus–Naur form) grammar. Informally, a context–free grammar is simply a set of rewriting rules or productions. A production is of the form
A is the left–hand–side (LHS) of the production. B C D … Z constitute the right–hand–side (RHS) of the production. Every production has exactly one symbol in its LHS; it can have any number of symbols (zero or more) on its RHS. A production represents the rule that any occurrence of its LHS symbol can be represented by the symbols on its RHS. The production states that a program is required to be a statement list delimited by a begin and end. Two kinds of symbols may appear in a context–free grammar: nonterminal and terminals.
In this tutorial, nonterminals are often delimited by < and > for ease of recognition. However,
nonterminals can also be recognized by the fact that they appear on the left–hand sides of
productions. A nonterminal is, in effect, a placeholder. All nonterminals must be replaced,
or rewritten, by a production having the appropriate nonterminal on its LHS. In constrast,
terminals are never changed or rewritten. Rather, they represent the tokens of a language.
Thus the overall purpose of a set of productions (a context–free grammar) is to specify
what sequences of terminals (tokens) are legal. A context–free grammar does this in a
remarkably elegant way: We start with a single nonterminal symbol called the start or goal
symbol. We then apply productions, rewriting nonterminals until only terminals remain.
Any sequence of terminals that can be produced by doing this is considered legal. To see
how this works, let us look at a context–free grammar for a small subset of Pascal that we
call Small Pascal. λ will represent the empty or null string. Thus a production
states that A can be replaced by the empty string, effectively erasing it. Programming language constructs often involve optional items, or lists of items. To cleanly represent such features, an extended BNF notation is often utilised. An optional item sequence is enclosed in square brackets, [ and ]. For example, in
a program can be optionally labelled. Optional sequences are enclosed by braces, { and }. Thus in a statement sequence is defined to be a single a statement, optionally followed by zero or more additional statements. An extended BNF has the same definitional capability as ordinary BNFs. In particular, the following transforms can be used to map extended BNFs into standard form. An optional sequence is replaced by a new nonterminal that generates λ or the items in the sequence. Similarly, an optional sequence is replaced by a new nonterminal that genereates λ or the sequence of followed by the nonterminal. Thus our statement sequence can be transformed into <statement sequence> → <statement> <statement tail> The advantage of extended BNFs is that they are more compact and readable. We can envision a preprocessor that takes extended BNFs and produces standard BNFs. Figure 3.1 shows a context–free grammar for a Small Pascal. An augmenting production
appears in the grammar to make sure that the string matched by <system goal> includeds all the input. It specifies that the end–of–file marker, EOF, must follow after the last valid token of a program.
1. <system goal> → <program> EOF To see how this grammar defines legal Small Pascal programs, let’s follow the derivation of one such program, begin ID := ID + ( INTLITERAL – ID ) ; end EOF, starting from the nonterminal <system goal>.
<system goal> At this point no nonterminals remain, so our derivation of a Small Pascal program is completed. A context–free grammar defines a language, which is a set of sequences of tokens. Any sequence of tokens that can be derived using the grammar is valid; any sequence that cannot be derived is not valid. Actually, to be precise, any token sequence derivable from a context–free grammar is considered syntactically valid. When static semantics are checked by the semantic routines, semantic errors in a syntactically valid program may be discovered. For example, in Pascal the statement
has no syntax errors but it does have a semantic error: the operator + is not defined for adding a character to a boolean value. Structure as well as syntax can be defined in a context–free grammar. For expressions, this includes associctivity and operator precedence. Associativity is concerned with the order in which consecutive instances of an operator are applied (as in A–B–C). Operator precedence refers to the relative priority of operators. For example, we expect A+B*C to mean A+(B*C), since * is usually considered to be a higher precedence operator than +. Small Pascal has only one level of precedence, because it does not include multiplication nor addition. If multiplication were included, it would be at a higher level of precedence than addition. The following grammar fragment defines such a precedence relationship. <expression> → <term> {<add op> <term>}
<term> → <factor> {<mult op> <factor>} Examining a derivation tree for the expression A+B*C, which is derived from this grammar fragment, illustrates how this definition works. A derivation tree is formed by showing nonterminals expansions as subtrees. Figure 3.2 shows the tree for this expression. ![]() The tree shows that * has a higher precedence than + because the second and third Ids are grouped together (with *) in a subtree, and then this subtree is grouped together with the first ID the first ID (using +) in the main derivation tree. Now we may ask: can a derivation tree with wrong precedence ever be formed in this grammar? No – the production rules don’t allow it. Try to build ID+ID*ID with + applied first. Since a * can only be generated by a <term>, ID+ID must appear unless the subexpression is enclosed in parentheses. With parentheses, the desired grouping can be forced. We have so far introduced regular expressions and context–free grammars, which are respectively used for defining the lexer and parser of a grammar. There is one more thing the reader must know, which is the type of grammar used in SableCC. In this paper we will give just a brief idea of a particular grammar, LALR(1), a subset of LR(1), which is commonly used in the description of almost any programming language. If you would like to know more about other types of grammars, such as LL(k) or LR(k), then you should refer to [1] or [2]. Instead of describing the LALR(1) type of grammar in full, we will just outline the
conflicts that may occur in this type of grammar. There are two types of conflicts that may occur in this type of grammar: Reduce⁄Reduce and Shift⁄Reduce conflicts. A reduce⁄reduce conflict occurs when the parser does not know which production to reduce. For example, the grammar
is not LALR(1) because if the input to the parser is d, the parser will not know if it reduces to B or if it reduces to A, because both A and B can be replaced by d. To solve this conflicts, we eliminate production B, which is a subset of A, and change the grammar to Now if the input to the parser is a d, it will reduce to A. As for shift⁄reduce conflicts, they occur when the parser does not know if it reduces or if it shifts (pushes the token or nonterminal to the stack) the current input. For instance, let’s analyse the grammar A → BaDe | BaDeCe Suppose that the input is cabe. When the parser receives this input, it first reduces c to B, then it shifts it (pushes it to the stack). It then reads a and shifts it, next it reads b, reduces it to D and shifts it, finally it reads e and know it does not know if it shifts e or if it reduces the string cabe to A. Now that you know the theory behind regular expressions and grammars you can dive into the Sablecc tutorial, where I show you how to write a small compiler using a compiler generator tool called SableCC. If you have any queries regarding this tutorial, just drop me an email. References:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|