next up previous
Next: Decompression algorithm Up: A new compression and Previous: A new compression and

Compression algorithm

The compression algorithm presented in this article is effective and is based a simple concept, even if its expression in algorithmic form is complex. Each character or byte in the input character or byte stream is directly encoded without change or modification in the compressed output stream. Then, if the given character or byte reappears or reoccurs in the next 2,040 characters or bytes, skipping characters or bytes that have already been encoded as reappearances or recurrences of previous bytes or characters, these reappearances or recurrences are encoded if the choice to encode the reappearances or recurrences saves space in the compressed output stream. Otherwise, no encoding is performed, and the next character or byte in the input character or byte stream is encoded without change or modification, and its reappearances or recurrences are possibly encoded if the choice to encode saves space, etc.

Consider the input character or byte stream 'ABAB'. The first character or byte in this input character or byte stream is 'A'. Another 'A' reappears or reoccurs within the next 2,040 bytes or characters. In fact, it appears two characters later. This second occurrences of 'A' is not encoded because doing so will not save space. If we chose to encode the second appearance or occurrence of 'A', first we'd encode the first 'A' without change or modification, followed by the escape sequences, $00000000_{2}$. The escape sequence indicates to the decompression algorithm that an encoding follows. If a natural $00000000_{2}$ occurs in the input character or byte stream, $00000000_{2}$ is encoded twice in the output stream to indicate that the $00000000_{2}$ $00000000_{2}$ should not be interpreted as an encoding, but as a natural occurrence of $00000000_{2}$. Next, following the escape sequence, a byte or character is used to specify the length of the encoding in bytes, which can be any number between 1 and 255. The encoding itself then follows the length of the encoding, which was preceded by the escape sequence.

In our example where our input character or byte stream is 'ABAB', and assuming we chose to encode the second appearance or occurrence of 'A' even though it doesn't save space, the encoding length would be 1 byte or character, and the encoding itself would be $010$. But bits are represented in groups of 8 as bytes. So, appending 0's, the encoding would be $01000000$. The left-most $0$ in this encoding indicates that the character immediately following the first 'A' is not a reappearance or recurrences of 'A'. In fact, it's a 'B', not an 'A'. The $1$ that then follows indicates that the character following 'B' is the first and only reappearance or recurrence of 'A' in the input character or byte stream.

Let's continue with this example, where our input character or byte stream is 'ABAB'. Assume we encoded the first and second appearance or occurrence of 'A', and we are now considering the first appearance or occurrence of 'B'. Assume also that even though it will not save space, we choose to encode all reappearances or recurrences of 'B' after the first 'B'. So, first 'B' would be encoded without modification or change in the output stream, followed by the escape sequence, $00000000_{2}$, followed by the length of the stream, again $1$, followed by the encoding. The encoding itself skips the 'A' that follows the first 'B' since this 'A' was encoded as a reappearance or recurrence after the first 'A'. So, our encoded sequence is simply $1$. But bits are represented in groups of eight (8) as bytes, so appending 0's we get $10000000$. The left-most one (1) encodes the second 'B' in our input byte or character stream.

A more complicated compression example is given in Appendix A.

When deciding whether or not to encode reappearances or recurrences of a character or byte in our input stream, we look at the following 2,040 characters or bytes in the input stream that have not already been encoded as reappearances or recurrences of a previous character or byte in the input stream. If we choose to encode, after the escape sequence, we specify the length of the encoding measured in bytes, a number between 1 and 255. The number zero (0) is reserved for the escape sequence, and if the number zero appears twice in-a-row, it specifies a natural 0 (zero) in our input stream. Notice that $255 * 8 = 2,040$. We look no father than 2,040 bytes or characters ahead, because an encoding of length 255, measures in bytes, represents 2,040 bits.

Figures 1 and 2 specify the compression algorithm. Figure 2 is the algorithm for COMPRESS, which uses the helper function ENCODE. Figure 1 is the algorithm for ENCODE. Parameters passed to either function, either COMPRESS or ENCODE, which are preceded by var are passed by reference, so their values can change in the algorithms.

Figure 1: ENCODE

Figure 2: COMPRESS


next up previous
Next: Decompression algorithm Up: A new compression and Previous: A new compression and
James Riechel 2007-12-12
Hosted by www.Geocities.ws

1