We usually think of characters as letters or glyphs comprising an alphabet,
but they usually encompass more then that.
On a computer, a character is any member of the machine’s recognized CharacterSet. This is language and culture dependent. At the level of the machine processor,
where character strings are parsed, characters are manipulated with simple integer operations. The machine does not care if they represent Spanish or Chinese or English or whatever.
Grok32` handles CharacterSets of arbitrary complexity from any linguistic system. The objective is to model any language without prejudice to the extent this is possible while maintaining an ASCII default.
The standard called ASCII is the acronym for "American Standard Code for Information Interchange". It provides 128 different
symbols. Virtually all modern operating systems recognize ASCII which originated with teletype machines. As people required computers to understand additional characters
and non-printing characters the ASCII set became restrictive and a newer extended standard with 256
characters found some application. Since the String construct is designed to use many different CharacterSets, defining Grok32` in 128-character ASCII is not a restrictive burden.
Here is the 128-character set listed in order of each
character’s numeric code:
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 SP ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p
q r s t u v
w x y z { |
} ~ DEL
Here is an interpretation
for the first 32 which are somewhat esoteric members in the above ASCII character
table. These are often referred to as Control Codes.
NUL (null)
SOH (start of heading)
STX (start of text)
ETX (end of text)
EOT (end of transmission) - Not the same as ETB
ENQ (enquiry)
ACK (acknowledge)
BEL (bell) - Caused teletype machines to ring a bell. Causes a beep
in many common terminals and terminal emulation programs.
BS (backspace) - Moves the cursor (or print head) backwards (left) 1 space.
TAB (horizontal tab) - Moves the cursor (or print head) right to the
next tab stop. The spacing of tab stops is dependent
on the output device, but is often either 8 or 10.
LF (NL line feed, new line) - Moves the cursor (or print head) to a
new line. On Unix systems,
moves to a new line AND all the way to the left.
VT (vertical tab)
FF (form feed) - Advances paper to the top of the next page (if the
output device is a printer).
CR (carriage return) - Moves the cursor all the way to the left,
but does not advance to the next line.
SO (shift out) - Switches output device to alternate character set.
SI (shift in) - Switches output device back to default character set.
DLE (data link escape)
DC1 (device control 1)
DC2 (device control 2)
DC3 (device control 3)
DC4 (device control 4)
NAK (negative acknowledge)
SYN (synchronous idle)
ETB (end of transmission block) - Not the same as EOT
CAN (cancel)
EM (end of medium)
SUB (substitute)
ESC (escape)
FS (file separator)
GS (group separator)
RS (record separator)
US (unit separator)
The operative belief is that in varying degrees of complexity, all language can be decomposed into character Strings. Frequently, communication is multi-channel,
and a correct "parsing" requires parallel String processing of different kinds of "character strings". A "character string" could be a sequence of gestures. For example, what complex
of Character-String channels are sufficient to characterize a whale's language, or a bee's? The development
of sensory parallel String processing systems is an essential prerequisite to this study. Also necessary
to this study is an understanding of how a species regards itself as a unitary
being. To make this vivid, consider the incredible, audible "songs" produced
by whales. For all we know, the songs they sing are necessary to cohere their
massive bodies as a unitary whole. Which comes first, the song or the whale?
The available characters
and their internal representations are machine dependent. The most common character sets are ASCII and EBCDIC. While Grok32`` seems to treat ASCII as a special CharacterSet, that appearance is an artifact of this Language Specification's American English. In principle, this document could be translated into ANY language using the appropriate local "standard" CharacterSet.
Mathematicians believe their reckonings transcend the character strings used. A rich CharacterSet is not just a collection of coded letters, but also tokens
and symantic operations that generate meaning. The process of transforming
a character string into an interpreted Expression is accomplished by assigning characters, character sequences
and tokens to "invoked" Functions parameterized constructively from the character sequences. This is accomplished with a RuleList associated with its CharacterSet.
Most word-processing applications have a "Save" option
which allows writing to be saved as "plain"
text with no formatting such as tabs, bold or under scoring. This raw format
is the most common character standard. It is well understood by both humans
and computers. This document, for example, was composed on a Mozilla derivative which has an "Export Text..." choice under the File-menu that reduces an HTML
document to a link free text version. Each former hyperlink is
followed by <disabledLinkAddress>. Because of this convention of identifying
links as "<"...">" delimited Expressions, "<"& ">" comprise a ParaPunc in the Standard ASCII RuleList.
This capacity to render documents as text facilitates importing text from diverse applications without compatibility issues.
"Plain" text usually means ASCII text, although there are minor variations such as "MS-DOS Format".
("Text Document" (with or without "MS-DOS Format") | "Unicode Document" | "Rich Text Document")
Unicode has recently become an important standard. It copes with character-alternatives by sheer numbers...
Unicode allows up to 65,536 different characters. Since Unicode is more
complex it is not implemented on many operating systems. By design, Unicode is a standard that serves many languages spawned mostly from the English speaking computer engineeing community. For the purposes of discriminating between linguistic systems,
Unicode characters are subdivided by structue and function since they serve many diverse, independent or interdependent purposes.
For a graphical list of all the characters that may be used in an HTML document, see ISO Latin-1 Characters and Control Characters Table.
(c) 2004-2007 by
John Van Wie Bergamini.
All rights reserved.