Characters

Why is ASCII the default character set in Grok32`?

We usually think of characters as letters or glyphs comprising an alphabet, but they usually encompass more then that.
On a computer, a character is any member of the machine’s recognized CharacterSet. This is language and culture dependent. At the level of the machine processor, where character strings are parsed, characters are manipulated with simple integer operations. The machine does not care if they represent Spanish or Chinese or English or whatever.

Grok32` handles CharacterSets of arbitrary complexity from any linguistic system. The objective is to model any language without prejudice to the extent this is possible while maintaining an ASCII default.

The standard called ASCII is the acronym for "American Standard Code for Information Interchange". It provides 128 different symbols. Virtually all modern operating systems recognize ASCII which originated with teletype machines. As people required computers to understand additional characters and non-printing characters the ASCII set became restrictive and a newer extended standard with 256 characters found some application. Since the String construct is designed to use many different CharacterSets, defining Grok32` in 128-character ASCII is not a restrictive burden.

Here is the 128-character set listed in order of each character’s numeric code:

0 1 2 3 4 5 6 7 8 9 A B C D E F

0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI

1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US

2 SP ! " # $ % & ' ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ DEL

Here is an interpretation for the first 32 which are somewhat esoteric members in the above ASCII character table. These are often referred to as Control Codes.

NUL (null)

SOH (start of heading)

STX (start of text)

ETX (end of text)

EOT (end of transmission) - Not the same as ETB

ENQ (enquiry)

ACK (acknowledge)

BEL (bell) - Caused teletype machines to ring a bell. Causes a beep

in many common terminals and terminal emulation programs.

BS (backspace) - Moves the cursor (or print head) backwards (left) 1 space.

TAB (horizontal tab) - Moves the cursor (or print head) right to the

next tab stop. The spacing of tab stops is dependent

on the output device, but is often either 8 or 10.

LF (NL line feed, new line) - Moves the cursor (or print head) to a

new line. On Unix systems,

moves to a new line AND all the way to the left.

VT (vertical tab)

FF (form feed) - Advances paper to the top of the next page (if the

output device is a printer).

CR (carriage return) - Moves the cursor all the way to the left,

but does not advance to the next line.

SO (shift out) - Switches output device to alternate character set.

SI (shift in) - Switches output device back to default character set.

DLE (data link escape)

DC1 (device control 1)

DC2 (device control 2)

DC3 (device control 3)

DC4 (device control 4)

NAK (negative acknowledge)

SYN (synchronous idle)

ETB (end of transmission block) - Not the same as EOT

CAN (cancel)

EM (end of medium)

SUB (substitute)

ESC (escape)

FS (file separator)

GS (group separator)

RS (record separator)

US (unit separator)

The String is adapted to parse any character-expression sequence, from any form of communication, human or otherwise...

The operative belief is that in varying degrees of complexity, all language can be decomposed into character Strings. Frequently, communication is multi-channel, and a correct "parsing" requires parallel String processing of different kinds of "character strings". A "character string" could be a sequence of gestures. For example, what complex of Character-String channels are sufficient to characterize a whale's language, or a bee's? The development of sensory parallel String processing systems is an essential prerequisite to this study. Also necessary to this study is an understanding of how a species regards itself as a unitary being. To make this vivid, consider the incredible, audible "songs" produced by whales. For all we know, the songs they sing are necessary to cohere their massive bodies as a unitary whole. Which comes first, the song or the whale?

The available characters and their internal representations are machine dependent. The most common character sets are ASCII and EBCDIC. While Grok32`` seems to treat ASCII as a special CharacterSet, that appearance is an artifact of this Language Specification's American English. In principle, this document could be translated into ANY language using the appropriate local "standard" CharacterSet. Mathematicians believe their reckonings transcend the character strings used. A rich CharacterSet is not just a collection of coded letters, but also tokens and symantic operations that generate meaning. The process of transforming a character string into an interpreted Expression is accomplished by assigning characters, character sequences and tokens to "invoked" Functions parameterized constructively from the character sequences. This is accomplished with a RuleList associated with its CharacterSet.

Most word-processing applications have a "Save" option which allows writing to be saved as "plain" text with no formatting such as tabs, bold or under scoring. This raw format is the most common character standard. It is well understood by both humans and computers. This document, for example, was composed on a Mozilla derivative which has an "Export Text..." choice under the File-menu that reduces an HTML document to a link free text version. Each former hyperlink is followed by <disabledLinkAddress>. Because of this convention of identifying links as "<"...">" delimited Expressions, "<"& ">" comprise a ParaPunc in the Standard ASCII RuleList.

This capacity to render documents as text facilitates importing text from diverse applications without compatibility issues.

"Plain" text usually means ASCII text, although there are minor variations such as "MS-DOS Format".

("Text Document" (with or without "MS-DOS Format") | "Unicode Document" | "Rich Text Document")

Unicode has recently become an important standard. It copes with character-alternatives by sheer numbers...

Unicode allows up to 65,536 different characters. Since Unicode is more complex it is not implemented on many operating systems. By design, Unicode is a standard that serves many languages spawned mostly from the English speaking computer engineeing community. For the purposes of discriminating between linguistic systems,
Unicode characters are subdivided by structue and function since they serve many diverse, independent or interdependent purposes.

ISO-Latin1 character set

For a graphical list of all the characters that may be used in an HTML document, see ISO Latin-1 Characters and Control Characters Table.

Grok32`