Consider the following character or byte stream, which we wish to compress.
Alan Alan Alan Alan Alan Alan Alan Alan Alan
There are 44 characters or bytes in this character or byte stream, and we wish to compress it, so it uses less memory. The first character in the character or byte stream is 'A'. This is immediately encoded into the compressed output stream. It is obvious, but this is our compressed output stream so far:
A
After this first 'A', additional A's occur 5 characters or bytes later, 10 characters or bytes later, 15 characters or bytes later, 20 characters or bytes later, 25 characters or bytes later, 30 characters or bytes later, 35 characters or bytes later, and 40 characters or bytes later. That's the other eight A's occurring in the character or byte stream. Each 'Alan' begins with a capital 'A', and there are eight Alan's following the first 'Alan'.
For each of the forty-three (43) characters following the first 'A',
we can assign a binary digit. This binary digit is equal to zero
(
) if the character is not an 'A', and is equal to one
(
) if the character is equal to 'A'. Let's look at this stream
of binary digits:
00001000 01000010 00010000 10000100 00100001 000
Next, there are three (3) zeros at the tail of this binary stream. But in any implementation, binary digits are represented as bytes, which are eight bits a piece. So, it our representation, this bit steam is actually:
00001000 01000010 00010000 10000100 00100001 00000000
So, this is the binary encoding of the forty-three (43) characters that follow the first 'A', represented as six (6) bytes. Next, we strip away trailing 00000000's from the bit or byte stream. These trailing 00000000's are unnecessary, and waste space should we choose to encode them later. So, our binary or byte stream is now:
00001000 01000010 00010000 10000100 00100001
It will take 5 characters or bytes of
space to encode or represent these eight additional occurrences of
'A' following the first 'A'. But we also need to encode the escape sequence,
, following the 'A' which we have already encoded, and after the
escape sequence we need to specify the number of bytes or characters in
the encoded byte stream, namely five (5). The number 5 in binary is
. So, it takes
characters to represent
these eight additional A's. Since
, the algorithm chooses to
encode in this format, to save space. So, here's our compressed output
so far:
A 0000000 00000101 00001000 01000010 00010000 10000100 0010000100
The next unique character in our input is 'l'. It is immediately encoded:
A 0000000 00000101 00001000 01000010 00010000 10000100 00100001 l
Once again, there are eight more occurrences of 'l', but this time we skip all the additional A's previously encoded. Ignoring these already encoded A's, after the first 'l', l's occur 4 characters or bytes later, 8 characters or bytes later, 12 characters or bytes later, 16 characters or bytes later, 20 characters or bytes later, 24 characters or bytes later, 28 characters or bytes later, and 32 characters or bytes later. As before, here is how these recurrences are represented as a binary stream of digits:
00010001 00010001 00010001 00010001 00
There are forty-two (42) binary digits in this bit stream, but, again, in any implementation, bits are represented as bytes, so we append additional 0's to create five (5) bytes:
00010001 00010001 00010001 00010001 00000000
But again, the last byte,
, encodes nothing, and would only
waste space if we chose to encode it. Therefore, it is removed from the
encoding. This leaves us with the following bit or byte stream:
00010001 00010001 00010001 00010001
Each of these binary numbers take one character or byte to represent in
compressed format. But once again, after the 'l' we need to include the
escape sequence,
, and the number of bytes in the bit or byte
stream, namely four (4) or
. So a total of 6 bytes is
required to represent these additional eight occurrences of 'l'.
Since
, the encoded representation is chosen. So, here's our
compressed output:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001
The next unique character in our input is 'a'. It is immediately encoded:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001 a
Now eight additional a's occur in the input. We ignore already encoded characters, namely, the last eight A's and the last eight l's. Ignoring these already encoded letters, after the first 'a', additional a's occur 3 characters or bytes later, 6 characters or bytes later, 9 characters or bytes later, 12 characters or bytes later, 15 characters or bytes later, 18 characters or bytes later, 21 characters or bytes later, and 24 characters or bytes later. Again, these recurrences are represented as a binary stream of digits:
00100100 10010010 01001001 0
But once again, in any implementation, bits are represented as bytes, so the tailing zero (0) is really represented by 8-bits, with zero's (0's) appended, or:
00100100 10010010 01001001 00000000
But once again, the tailing byte,
, encodes nothing, and would
waste space if we chose to encode it, so it is trimmed from the binary bit
or byte stream, leaving:
00100100 10010010 01001001
Once again, these three bytes or characters must be precede by the escape
sequence,
, and the number of bytes in the stream, namely
three (3) or
. Therefore,
characters are
necessary to represent the eight additional l's. Since
, we choose to
encode the eight additional a's in this manner, so now the compress output
looks like:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001 a 00000000 00000011 00100100 10010010 01001001
The next unique character in our input is 'n'. It is immediately represented in our output:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001 a 00000000 00000011 00100100 10010010 01001001 n
After this first occurrence of 'n', there are eight additional occurrences of 'n'. Ignoring the characters that have already been encoded, 'A', 'l', and 'a', after the first 'n', n's occur after 2 characters or bytes, 4 characters or bytes, 6 characters or bytes, 8 characters or bytes, 10 characters or bytes, 12 characters or bytes, 14 characters or bytes, and 16 characters or bytes. These recurrences are represented as:
01010101 01010101
These two binary numbers are each represented by a single character or byte,
but we must also include the escape sequence,
, and the
number of characters or bytes in the stream, namely two (2), or
.
Therefore,
characters or bytes are needed to represent
these additional eight occurrences of 'n'. Since,
we choose to
encode. Our output is now:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001 a 00000000 00000011 00100100 10010010 01001001 n 00000000 00000010 01010101 01010101
The next unique character in our input is the space character, ' '. It is immediately represented in our output:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001 a 00000000 00000011 00100100 10010010 01001001 n 00000000 00000010 01010101 01010101 ' '
After this first space, ' ', seven additional spaces occur. Ignoring characters that have already been encoded, 'A', 'l', 'a', and 'n', after the first space, ' ', additional spaces occur after 1 character or byte, 2 characters or bytes, 3 characters or bytes, 4 characters or bytes, 5 characters or bytes, 6 characters or bytes, and 7 characters or bytes. This can be represented as:
1111111
But in any implementation, bits are represented in groups of 8 as bytes, so we need to add a single zero (0) to the end of this stream to make it a full byte:
11111110
This binary number will be represented as a single character or byte in
the output. As usual, we'll also have to include the escape
sequence,
, after the first space, and the number of bytes
in the binary bit or byte stream, namely one (1), or
. So,
it'll take
characters or bytes to represent these
additional seven spaces, ' '. Since
, we encode in this fashion.
Here's our output:
A 00000000 00000101 00001000 01000010 00010000 10000100 00100001 l 00000000 00000100 00010001 00010001 00010001 00010001 a 00000000 00000011 00100100 10010010 01001001 n 00000000 00000010 01010101 01010101 ' ' 00000000 00000001 11111110
Now the entire input character or byte stream has been compressed or encoded. Remember, the input character or byte stream was 44 characters or bytes in length. In comparison, the compressed output is only 30 characters of bytes in length. Compression has been successful.