String Implementation

Grok32` was invented to translate any language into a Grok32` Expression. This is accomplished by compiling the rules for parsing any language from Character Strings. The Character String RuleLists are compiled and stored as CharacterSet assignments in �`Construct`String`Name`�.

The RuleLists implement an ergonomic system that translates other languages into Grok32` Expressions. This is realized with a String object that distinguishes parsing rules by CharacterSet. Each CharacterSet has its own RuleList which assigns Character semantics. The String abstraction enables unlimited linguistic and character flexibility together with the benefits of ASCII standardization.

Grok32` is a true meta-language. In principle, any language string can be mimiced or interpreted whether the language is English, algebra, or whatever. A copula (see ParaPuncs) can be an arithmetic operator between algebraic expressions, or a copula can be the relation of subject to predicate in a sentence. The design of the copulas in a CharacterSet determines how Strings are parsed into Expressions. Whether Grok32` is applied to English semantics or to algebraic simplification, the CharacterSet�s EscapeSequence and ordered copula list must be designed according to the intended linguistic application. The default ASCII CharacterSet EscapeSequence and arithmetic operators implement linguistics in conformance with C�s strings, and the forms used in this Language Specification. Errors in this Language Specification or the Kernal are corrected upon discovery.

`Construct`String`Expression`

`Construct`String`Expression` has the algorithms used to attempt to transform a String into an Expression. This is the basic implementing algorithm in Reckon[String[�]].

Each Character in a String is Reckoned by applying a RuleList assigned to the CharacterSet.

`Construct`String`Expression`Character` contains the `Set` subcontext housing this subject.

`Construct`String`Expression`Character` contain the `RuleList` and `Set` subcontexts.

The ability of the RuleList to define the way a CharacterSet is parsed into an Expression, is bounded by the relatively simple Expression Character Machine definition. The Expression Character Machine is really just a modern, well-adorned version of the Turing Machine! `Construct`String`Expression`Character`Machine` contains this subject.

`Construct`String`Name`

`Construct`String`Name` contains the Standard ASCII Rule List, �ASCII�, and others. This Context houses the RuleLists for named CharacterSets. Any language modeled by Grok32` must have its compiled RuleList in this Context to make it generally available for language translation. The name assigned to this compiled RuleList should match the StringName assigned to the modeled language.

String CharacterSet Encoding

The String object implements a CharacterSet�s RuleList which determines the style of Expression elicitation. The RuleList defines the subsets, {Noops, Digits, Letters, ParaPuncs}, which together implement character sequence tokens which allow any linguistic String to be parsed as an Expression. For this translation to work, an other-lingo String must obey its own semantic grammar (defined by its Language Specification). Furthermore, the other-lingo String RuleList must be consistent.

By default, a String�s CharacterSet is ASCII, but any N-byte-sized CharacterSet will do.

Each byte either represents one of the 128 ASCII Characters, or the byte specifies one of 127 CharacterSets. If the byte specifies a new CharacterSet, then subsequent String bytes are interpreted as CharacterCodes from that Set. There are two different String parsing states: default ASCII and the more general (non-ASCII) String parsing.

Default ASCII String Parsing

128-bit ASCII is the default CharacterCode. The first byte of a String is parsed as an ASCII character unless its ASCIIbit is False, in which case its remaining 7-bits identify one of 127 possible non-ASCII CharacterSet

. Over the subsequent span of characters, the CharacterStream is interpreted according to the CharacterCode of that non-ASCII String.

In default, ASCII mode, the most significant bit of each byte is 0 if the byte represents a 7-bit ASCII character. This bit is called the ASCIIbit. An ASCII String lasts as long as each element in a sequence of bytes continues to have 0 in the ASCIIbits. The moment the ASCIIbit is 1, the ASCII String ends, and the remaining 7-bits identify another CharacterSet. In this case, an ASCII String will be followed immediately by a non-ASCII String.

General String Parsing

A String�s Characters are presumed to be ASCII, and are parsed using the Default ASCII String Parsing described above. A feature of this default parsing is that any byte may have a False ASCIIbit, in which case either the String begins a new CharacterSet (identified by 7 other bits in the byte), or the String ends. This system allows user-defined Character semantics that preserve ASCII standardization together with unlimited CharacterSet flexibility.

A non-ASCII String continues until its EndCharacter or StringEscapeCode is reached.

� The EndCharacter terminates the containing String.

� The StringEscapeCode initiates the Default ASCII String Parsing.

The Grok32` Kernal keeps track of each CharacterSet it encounters. Each CharacterSet employs its own RuleList. The presumption is that a relatively small number of CharacterSets are used in a session.

The identity of a CharacterSet may be specified with 7 bits because the String object keeps track of the CharacterSets which have been read into memory. The name of a CharacterSet is assigned a 7-bit Cardinal which codes its identity within the String�s structured ByteString sequence. When a complex (using many CharacterSets) String is Compiled, the functional equivalent of a With[�] construct is written.� This is the subject of TextFile.

String as �enum� datatype

Proposal 1.

The String may be implemented with the C enum data-type.

Why is the C enum data-type a good candidate?

�When a variable can have only one of a set of values, we may use the enumeration type facility to specify the possible values of the variable. We define an enumeration type by giving the keyword enum followed by an optional type designator and a brace-enclosed list of values.

An enumeration type that allows only the days of the week is defined by

enum Days

{

Sunday, Monday, Tuesday, Wednesday,�

}�

The formal proposal is to somehow specify the various RuleList elements using the C enum construct�

To do this with the C compiler will need to be built into Grok32` to compile a template file which will yield the CharacterCode CompiledObject. This is what happens when�

String[Name[�charSetNam�]][RuleList]

�is invoked. See CharacterSet.

etc.

� Each String Character has a CharacterSet.

� The actual value assigned to the displayed �characters� may be any type of object. See (3) below and see ReckonString.

What does "alphanumeric" mean in non-ASCII Character Sets?

As noted above, Grok32` source code text is interpreted as ASCII by default. Furthermore, a Name's literal String is a sequence of alphanumeric characters. This has an unambiguous meaning for ASCII Strings.

For non-ASCII Strings, the alphanumeric characters comprise the Set formed by the union of Digits and Letters and is assigned together with the rest of a CharacterSet's RuleList. The CharacterSet's Alphabetic order, digit precedence, and operator precedence is established by the order in the lists of Digits, Letters, and Operators made there.

The �ctype� library contains various functions which make all the essential distinctions found between C characters.

Similarly, Grok32` should be able to distinguish between the following types of characters: �alphabetic�, �digits�, �empty-space�, �control characters�, �printing-not-space�, �octal digit�, �punctuation�.� Many of these distinctions are part of a CharacterSet's RuleList.� Here is the RuleList lexicon.

Noops, = {EscapeSequence, Unassigned, Unrecognized, StartComment, EndComment, �EmptySpace�},

Digits, (*� =�� {CC0, CC1, CC2, �} *)

Letters, (*� = {CCa, CCb, �} *)

ParaPuncs, �= {ContextMark, �{ExprStart, ExprEnd}�,

�� {StandardStart, StandardSeperator, StandardEnd},

�� ParaPuncSpec�� }

}

Presumably, the above keyword will be defined in the Construct`String`Expression`RuleList` Context.� This will be a protected part of Grok32`s metabolism.

ISO-Latin1 character set

A graphical list of all the characters which may be used in an HTML document.

Rich text format

�Rich Text Format� (RTF) is a standard formalized by Microsoft Corporation for specifying document and character format in ASCII. RTF files have special commands to indicate formatting information, such as fonts and margins. Other document formatting languages include the Hypertext Markup Language (HTML), which is used to define documents on the World Wide Web, and the Standard Generalized Markup Language (SGML), which is a more robust version of HTML.

Grok32` will recognize ASCII and isoLatin1 CharacterSets. A String defaults to ASCII however�

In Unicode, the low 8 bits are compatible with ASCII values.

.rtf documents show the relationship between fonts and character strings�

In the Standard C++ Library, a string is actually a template class, named basic_string. The template argument represents the type of character that will be held by the string container. By defining strings in this fashion, the Standard C++ Library not only provides facilities for manipulating sequences of 8-bit characters, but also for manipulating other types of character-like sequences, such as 16-bit wide characters. The datatypes string and wstring (for wide string) are simply typedefs of basic_string, defined as follows:

typedef std::basic_string<char,std::char_traits<char>,

std::allocator<char> > std::string;

typedef std::basic_string<wchar_t,std::char_traits<wchar_t>,

std::allocator<wchar_t> > std::wstring;

TrueType descriptive page:

http://www.truetype.demon.co.uk/

The TrueType Font File is describes at http://developer.apple.com/fonts/TTRefMan/RM06/Chap6.html#Overview.

NOTES:

A TrueType font file consists of a sequence of concatenated tables.

A table is a sequence of words.

Each table must be long aligned and padded with zeroes if necessary.

glyph: An image used in the visual representation of characters; roughly speaking, how a character looks. A font is a set of glyphs.

concatenate: To arrange (strings of characters) into a chained list.

The DataTypes used in TrueType fontspecs

Table 1 : The 'sfnt' data types

Macintosh Data type	OS/2 Data Type Description
uint8	BYTE	8-bit unsigned integer
int8	CHAR	8-bit signed integer
uint16	USHOR	16-bit unsigned integer
int16	SHORT	16-bit signed integer
uint32	ULONG	32-bit unsigned integer
int32	LONG	32-bit signed integer
Fixed		16.16-bit signed fixed-point number
FWord		16-bit signed integer that describes a quantity in FUnits, the smallest measurable distance in em space.
uFWord	-	16-bit unsigned integer that describes a quantity in FUnits, the smallest measurable distance in em space.
F2Dot14	-	16-bit signed fixed number with the low 14 bits representing fraction.
longDateTime	-	The long internal format of a date in seconds since 12:00 midnight, January 1, 1904. It is represented as a signed 64-bit integer

Grok32`

� 2004, 2005

by John Van Wie Bergamini.