The compression algorithm presented in this article is effective and is based a simple concept, even if its expression in algorithmic form is complex. Each character or byte in the input character or byte stream is directly encoded without change or modification in the compressed output stream. Then, if the given character or byte reappears or reoccurs in the next 2,040 characters or bytes, skipping characters or bytes that have already been encoded as reappearances or recurrences of previous bytes or characters, these reappearances or recurrences are encoded if the choice to encode the reappearances or recurrences saves space in the compressed output stream. Otherwise, no encoding is performed, and the next character or byte in the input character or byte stream is encoded without change or modification, and its reappearances or recurrences are possibly encoded if the choice to encode saves space, etc.
Consider the input character or byte stream 'ABAB'. The first character or
byte in this input character or byte stream is 'A'. Another 'A' reappears
or reoccurs within the next 2,040 bytes or characters. In fact, it appears
two characters later. This second occurrences of 'A' is not encoded because
doing so will not save space. If we chose to encode the second appearance
or occurrence of 'A', first we'd encode the first 'A' without change or
modification, followed by the escape sequences,
. The
escape sequence indicates to the decompression algorithm that an encoding
follows. If a natural
occurs in the input character or
byte stream,
is encoded twice in the output stream to
indicate that the
should not be interpreted
as an encoding,
but as a natural occurrence of
. Next, following the escape
sequence, a byte or character is used to specify the length of the encoding
in bytes,
which can be any number between 1 and 255. The encoding itself then follows
the length of the encoding, which was preceded by the escape sequence.
In our example where our input character or byte stream is 'ABAB', and
assuming we chose to encode the second appearance or occurrence of 'A' even
though it doesn't save space, the encoding length would be 1 byte or
character, and the encoding itself would be
. But bits are represented
in groups of 8 as bytes. So, appending 0's, the encoding would be
. The left-most
in this encoding indicates that the
character immediately following the first 'A' is not a reappearance or
recurrences of 'A'. In fact, it's a 'B', not an 'A'. The
that then
follows indicates that the character following 'B' is the first
and only reappearance or recurrence of 'A' in the input character or byte
stream.
Let's continue with this example, where our input character or byte stream
is 'ABAB'. Assume we encoded the first and second appearance or occurrence
of 'A', and we are now considering the first appearance or occurrence of 'B'.
Assume also that even though it will not save space, we choose to encode
all reappearances or recurrences of 'B' after the first 'B'. So, first 'B'
would be encoded without modification or change in the output stream,
followed by the escape sequence,
, followed by the length of
the stream, again
, followed by the encoding. The encoding itself skips
the 'A' that follows the first 'B' since this 'A' was encoded as a
reappearance or recurrence after the first 'A'. So, our encoded sequence
is simply
. But bits are represented in groups of eight (8) as bytes, so
appending 0's we get
. The left-most one (1) encodes the second
'B' in our input byte or character stream.
A more complicated compression example is given in Appendix A.
When deciding whether or not to encode reappearances or recurrences of a
character or byte in our input stream, we look at the following 2,040
characters or bytes in the input stream that have not already been encoded
as reappearances or recurrences of a previous character or byte in the
input stream. If we choose to encode, after the escape sequence, we specify
the length of the encoding measured in bytes, a number between 1 and 255.
The number zero (0) is reserved for the escape sequence, and if the number
zero appears twice in-a-row, it specifies a natural 0 (zero) in our input
stream. Notice that
. We look no father than 2,040
bytes or characters ahead, because an encoding of length 255, measures in
bytes, represents 2,040 bits.
Figures 1 and 2 specify the compression algorithm. Figure 2 is the algorithm for COMPRESS, which uses the helper function ENCODE. Figure 1 is the algorithm for ENCODE. Parameters passed to either function, either COMPRESS or ENCODE, which are preceded by var are passed by reference, so their values can change in the algorithms.