RuleList

It is possible to mimic most languages with Grok32` by customizing one or more CharacterSets with appropriately designed Named Functions and procedures.� Each CharacterSet has an associated RuleList which assigns meanings to individual Characters and thereby specifies how an Expression is parsed from a String. Each RuleList subdivides Characters into one of the following subsets:

{Noops, Digits, Letters, ParaPuncs}

The NamedCharacter Set is the union of these five subsets.� (This Set is not the same as all RecognizedCharacters which includes all Characters from all recognized CharacterSets.)� With rare exceptions, any String containing a CharacterCode signifying an unNamedCharacter will not be successfully interpreted.

A RuleList is compiled as a String.� This String's form comports to a data-representation scheme whose Character assignments are read in ordered sequence by a RuleListReader. The order of assignment is critical since some Characters may use other Characters in their definition.� For example, this is the reason that the first Character in the first sublist, (Noops), is the Escape Sequence to the EndCharacter.

The RuleListReader reads a CharacterSet's RuleList into memory, (if it is not already there), when a String using the� CharacterSet is interpreted as an Expression.

Noops

The Noops list is an ordered collection of CharacterCodes or Strings which may include �escape-sequence� tokens, non-printing characters, control codes, comments, terminations, empty-space-characters, and Character errors. The Noops are distinguished by the fact that they either escape the character scheme, or have no literal significance to the Expression reckoned (from an input String containing them). The first and most important of the Noops is the EscapeSequence.

EscapeSequence

The EscapeSequencePrefix is a token invoking the EscapeSequenceFunction which defines how the Character sequence following the EscapeSequencePrefix is interpreted.

The first EscapeSequence character (sequence) must produce a CharacterSet�s EndCharacter.

The Standard ASCII RuleList uses the backslash, "\", as the EscapeSequencePrefix, and �\0�, to signify a String�s EndCharacter.� A CharacterSet can be defined without an EscapeSequence, in which case the EndCharacter code is placed where the EscapeSequence specification would otherwise be. The EndCharacter is a String�s final escape sequence.

If the EscapeSequence is not just the CharacterCode for the EndCharacter, then it is specified as follows:

EscapeSequence = {EscapeSequencePrefix, EscapeSequenceFunction}

The EscapeSequence identifies special (non-printing) Characters (like EndCharacter, backspace, tab, EOF etc.), and provides a way to specify Characters by CharacterCode. � A CharacterSet defined with an EscapeSequence, must include the EndCharacter as one outcome. An example of a noop-character which might be an EscapeSequenceFunction result is EOF (which is the name normally given to the Character signaling the end of a CharacterStream). The EscapeSequencePrefix is a token that invokes the EscapeSequenceFunction.� It reads a sequence of Characters following the prefix and returns a single CharacterCode as the result.

If the sequence following the EscapeSequencePrefix is not recognized, the EscapeSequenceFunction terminates, and the EscapeSequencePrefix is ignored. In all cases, the previous String Character parsing resumes when the �escape sequence� ends as determined by the EscapeSequenceFunction. See NonPrintingCharacterToken.

For example, the EscapeSequence used by the Standard ASCII RuleList embodies the conventions for C-string escape sequences, which specifies CharacterCodes using a three digit octal number. (This technique does not generalize to all CharacterSets since an octal number presumes at least 8 elements in the Digits list.) The EscapeSequenceFunction decides how many Characters will be read as part of the EscapeSequence, before control is returned to normal Character processing. (An EscapeSequence could lead to a QABS�)

(1) Noops = {EscapeSequence, Unassigned, Unrecognized,

{StartComment, EndComment}, EmptySpace�}

The above can be described as the following pattern�

Noops = {(CharOrString | {EscapeSequencePrefix, EscapeSequenceFunction}),�� (* EscapeSequence *)

�� (CharOrString | Function), �� (* Unassigned *)

�� (CharOrString | Function),�� (* Unrecognized *)

�� {CharOrString, CharOrString}�� (* {StartComment, EndComment}� *)

�� CharOrString�� (* EmptySpace� *)

}

�where CharOrString is the name given to the pattern matching a CharacterCode or a String.

Unassigned is the name given to any NamedCharacter that is elicited but unassigned.� The Grok32Kernal only returns Character Strings that use NamedCharacters.� If a NamedCharacter is output and the CharacterSet in use has no assigned CharacterCode, then the Unassigned CharacterCode is used.� The Unassigned Name is assigned to this CharacterCode.� If Unassigned is assigned to a Function, then this Function takes the Unassigned Character and returns a String that is spliced into the evaluating CharacterStream.� The appearance of the Unassigned Character is usually evidence that the CharacterSet is incompletely defined, or is being used inappropriately.

A CharacterCode which is not a NamedCharacter is an UnrecognizedCharacter.� The Name, "UnrecognizedCharacter" is assigned to the CharacterCode returned whenever an UnrecognizedCharacter is encountered.� Such characters will not be generated by the Grok32Kernal without the artificial construct�

�� Cardinal[String][�charSet�[n]]

�where n is the unrecognized CharacterCode. If no glyph is assigned to unrecognized characters, the user may never know when an unrecognized CharacterCode is used.

An UnrecognizedCharacter can be a CharacterCode, a String or a Function. If it is just a CharacterCode or a String, then every unrecognized character is replaced by this code or String. When this 2^nd Noops list element is a Function, then it is understood to take the unrecognized character as input, and to produce one or more CharacterCodes as output. The output CharacterCode, is then spliced into the Reckoned String�s CharacterStream and subjected again to the CharacterSet�s RuleList. If the UnrecognizedCharacter Function outputs an UnrecognizedCharacter, it will produce an infinite recursion.

StartComment & EndComment

�Comment blocks� delimit character strings which are ignored by Reckon, Function, Compute, or Compile. Different Computer Languages have signature mechanisms to designate �comment blocks�. For example, C uses �/*� to open a comment span which ends with �*/�. Similarly, Mathematica comments begin with �(*� and end with �*)�.

An EmptySpace is a Character that displays empty space between Characters. Most CharacterSets include many different empty space Characters, and for this reason, the Noops list allows an unlimited sequence of EmptySpace Characters. Typically, the �space� Character, tab and carriage-return are EmptySpace Characters.

(A CharacterSet could define an �empty space� Character as a Letter, in which case it would not be an EmptySpace and it would make it possible to create Names with space between alphanumeric characters. But such a practice would make source code difficult for humans to decipher.)

Digits

Digits is the list of CharacterCodes representing digits.

(2) Digits = {CC0, CC1, CC2, �}

A digit is assigned its Cardinal value by its position in the Digits list.

The 1^st DigitChar is zero, the 2^nd is one, etc.

The number of DigitChars in Digits determines the CharacterSet�s number base.

Letters

Letters is the ordered list of CharacterCodes representing letters.

(3) Letters = {CCa, CCb, �}

A letter glyph is the kind of Character one expects to find in a Name, which must start with a Letter Character.

The order of the elements in Letters determines which is first in sequencing (sorting) operations. The order of the Characters in Letters, determines the alphabetic order for the CharacterSet. (See SequenceList.)

ParaPuncs

ParaPuncs={ContextMark, �{ExprStart, ExprEnd}�,

{StandardStart, StandardSeperator, StandardEnd},

�ParaPuncSpec�� }

The Standard Expression Form looks something like the following�

head[arg1, �]

Humans and machines can parse the above Expression because the square brackets and the comma, serve to separate and distinguish an Expression�s anatomy. Similarly, quotation marks serve to separate a quote from the surrounding text. The convention of separating paragraphs with either an empty line or an indented first line is another example of a lexical grouping device which is called the ParaPunc. There are many types of ParaPuncs used in both speech and in written language. For example, the tone, or accent are normally used to stylize spoken words for distinctive effect. These behaviors are sophisticated auditory string ParaPuncs.

Expression-Strings have a wide variety of distinctive conjugate-Expressions that can be reduced to the Standard Expression Form (E.G. "head[arg1, arg2,�]").� Suppose the following character sequence is taken from an ASCII String conforming to the Standard ASCII RuleList.� Then when�

�� 4*(a + b)

�is reckoned, it puts everything into the Standard Expression Form�

�� Times[4, Plus[a, b]]

This is possible because the characters, "*", "+", and parenthesis are special tokens in the� Standard ASCII RuleList.� Regular Expressions like the above use:

"[" as the StandardStart of the argument sequence,

"," as the StandardSeperator, and

"]" as the StandardEnd.

More generally, a CharacterSet can define its own distinctive {StandardStart, StandardSeperator, StandardEnd} in the 3^rd element of the ParaPunc Specification.

The StandardASCIIRuleList defines "(" (left parenthesis) and ")", (right parenthesis) as the only Expression group creator.� This means�

�� (�expr�)

�interprets "�expr�" as a discrete Expression.� The value of this is realized with precedance ordered operators.

An Expression is presumed between the left parenthesis (aka ExprStart) and the right parenthesis (aka ExprEnd).� See ParaPunc Specification List below.

In principle, a RuleList can have more then one Expression-grouping token pair.� For this reason, the RuleListReader will accept more then one duple (after the ContextMark) in the ParaPunc Specification List.� Each "{ExprStart, ExprEnd}"-pair has the same meaning however.� If both "(" and "{" are defined as a ExprStart characters, for example, then they are indistinguishable as far as Expression-Reckoning is concerned.� (Not recommended.)� The following box is the form of a ParaPunc Specification List.

(*4*)

(*�� ParaPunc Specification List Form� *)

ParaPuncs={ContextMark,

�{ExprStart, ExprEnd}�,

{StandardStart, StandardSeperator, StandardEnd},

�ParaPuncSpec� }

"�ParaPuncSpec�" may be delimited by tokens {ExprStart, ExprEnd} to group Operators with the same precedance.

Otherwise Operators are in order of decreasing precedence. �� *)

A "ParaPuncSpec" (see above) is a List of 2 to 4 CharOrString elements, interpreted according to the following scheme:

If there are two elements, the 1^st element is the operator (or token CharOrString), and the 2^nd element is the invoked Name.� Operators have precedance; see below.
If there are three elements, then the 1^st element is the start, the 2^nd element is the end, and the 3^rd element is the Name invoked.� The seperator is assumed to be the StandardSeperator.
If there are four elements, then the 1^st element is the start, the 2^nd element is the end, the 3^rd element is the seperator (character or String), and the Name invoked is the last element.

ContextMark

The first element in the ParaPuncs is the ContextMark.

The ContextMark conjoins a Name with its containing Context(s).

All Names are created and recognized in Context.

The ContextMark�s associated Function drops the actual ContextMarks, and reduces the proto-Name to an ordered sequence of ContextNames, �cntxt1�, �cntxt2�, �, terminating with the LocalName, �localName�, and generates a NameString with the following form:

Name[String][�cntxt1�, �cntxt2��, �localName�]

The ContextMark is the operator with the highest precedence.

Operators

Normally an Expression is interpreted left to right.� The Expression Head or start token references a Named ElicitationForm, (which may (or may not) match the interpreted argument(s)).� Real-world String-Expressions are not so regular.� Often an argument is introduced before its relation to the larger containing Expression is revealed.� A simple example would be an English text parser that invokes Sentence[word1, word2,�] when a period (".") is encountered in some text's CharacterStream.� In this case, the period is the Operator token for the Sentence algorithm.

Operators are not merely retroactive tokens, they are also used as conjunctions between Expressions.

As a conjunction the Operator invokes a specific Head, taking arguments from either side of the copula.

A copula is a token that invokes a Name parameterized by the string-expressions conjoined by the copula. A copula has the following form�

(* Operator Specification Form *)

�� {�conjuctionString�, function}

(* �where �conjuctionString� is the String matched by the copula, and function is the invoked Head.*)

Since conjunctive operators can be written adjacent to one another, there are precedence issues.� See Arithmetic Operators for an example.

Operators are ordered by precedence� in the ParaPuncs list.

By contrast, ParaPuncs with "start" and "end" tokens unambiguously define their bounds, and consequently have no relative precedence.

If two or more copulas have the same precedence, then they are grouped together with the ExprStart-ExprEnd pair. (This pair is the parenthesis in the StandardASCIIRuleList.) ExprStart & ExprEnd are the 2nd tokens listed so they can be used to interpret subsequent copulas of equal precedence. If the ExprStart & ExprEnd pair is respectively, left and right parenthesis, and the String ParaPunc tokens are the apostrophe pair, then the copula precedence sequence for the standard arithmetic operators would look like the following:

("*", "/", "%"), ("+", "-"), ("<", ">", "<=", ">="), ("==", "!=")

In the above example, the multiplication (*), divide (/), and modulus (%) operators are all in the same parenthesis, and, therefore, have equal precedence.

When Reckon[String] interprets a copula in the CharacterStream, whose associated Function is named func, it precipitates the following routine:

(a) The previous Expression is presumed to be one (probably the first) of the Function Slots associated with func.

(b) In any case, the previous Expression is presumed to be Slot[1] to func, and the next Expression (after the copula ) is Slot[2].

(c) If the previous Expression already has an argument Slot in some already-named-procedure, which will be called initFunc, then the procedure branches according to the following contingencies:

(d) If initFunc is func, then it just joins in proc�s argument sequence regardless of whether this joinder is allowed by func�s ElicitationForm.� initFunc has already established a SlotN, which is the number of arguments already known to be ordered arguments for the initFunc Function. initFuncSlotN is incremented as the current number of known arguments to initFunc, then the previous Expression from (b) above, is Slot[initFuncSlotN], and the next Expression (after the copula) will be Slot[initFuncSlotN + 1].

(e) If initFunc is not func, then�

i. If func has higher precedence then initFunc, then the previous Expression is Slot[1] to func.� HeadStack stores a reference to initFunc "on top", and func becomes initFunc.

ii. If func has lower precedence then initFunc, then the previous Expression is completed with initFunc, which reference is removed from HeadStack, and the completed Expression becomes Slot[1] in func.

In other words, when a copula of lower precedence is encountered in ReckonString, it completes the previous copula-rendered-Expression, and uses that result as the first argument to the new copula.

When a copula of higher precedence is encountered, it makes the previous, Slot[1] and the next, Slot[2].

See ReckonString Source for more details.

Arithmetic Copulas

The Standard ASCII RuleList will parse the following algebraic string-expression�

�2*(4^3/7+1)�

�into�

Times[2, Plus[Divide[Power[4, 3], 7] + 1]]

�based on both the precedence of the operators, and the ability to distinguish the appropriate copula in the ExpressionCharacterMachine.

Linguistic Design and Operators

A copula could be an arithmetic operator between algebraic expressions, or a copula could be the relation of subject to predicate in a sentence. The design of the copulas in a CharacterSet determines how Strings are parsed into Expressions. In principle, any language string can be mimiced or interpreted whether the language is English, algebra, or whatever. The ordered copula list must be designed according to the intended linguistic application. (The default ASCII CharacterSet EscapeSequence and arithmetic operators implement linguistics in conformance with C�s string-expressions, and the forms used in this Language Specification.)

Without the Operators list, a String Reckoned as a Character sequence must conform to the standard �head[arg1, arg2,�]� format. Otherwise, the String will fail to Reckon as an Expression. By contrast, the copula evokes the Expression Heads with arguments from before and after the copula. This greatly increases the linguistic flexibility and power of String-Expressions.

The RuleList for a CharacterSet customizes the rules for Reckoning Strings into Expressions. By combining a customized CharacterSet with appropriately designed Named Functions and procedures, it is possible to mimic most languages with Grok32`.

RuleListReader

Since a RuleList is compiled as a String which is interpreted as an Expression, its form comports to a data-structure whose Character assignments are used to interpret its literal String. A RuleList can be specified in the CharacterCode it applies to, provided it assigns meaning prior to using a Character. This requirement guides the order of the Characters and Strings assigned in RuleList.

When a RuleList is read into memory, (by RuleListReader), the following Names are assigned:

{Noops, = {EscapeSequence, Unassigned, Unrecognized, StartComment, EndComment, �EmptySpace�},

Digits, (*� =�� {CC0, CC1, CC2, �} *)

Letters, (*� = {CCa, CCb, �} *)

ParaPuncs, �= {ContextMark, �{ExprStart, ExprEnd}�,

�� {StandardStart, StandardSeperator, StandardEnd},

�� ParaPuncSpec�� }

}

Grok32`

� 2004, 2005

by John Van Wie Bergamini