next up previous
Next: Decompression example Up: A new compression and Previous: Time complexity of the

Compression example

Consider the following character or byte stream, which we wish to compress.

Alan Alan Alan Alan Alan Alan Alan Alan Alan

There are 44 characters or bytes in this character or byte stream, and we wish to compress it, so it uses less memory. The first character in the character or byte stream is 'A'. This is immediately encoded into the compressed output stream. It is obvious, but this is our compressed output stream so far:

A

After this first 'A', additional A's occur 5 characters or bytes later, 10 characters or bytes later, 15 characters or bytes later, 20 characters or bytes later, 25 characters or bytes later, 30 characters or bytes later, 35 characters or bytes later, and 40 characters or bytes later. That's the other eight A's occurring in the character or byte stream. Each 'Alan' begins with a capital 'A', and there are eight Alan's following the first 'Alan'.

For each of the forty-three (43) characters following the first 'A', we can assign a binary digit. This binary digit is equal to zero ($0_{2}$) if the character is not an 'A', and is equal to one ($1_{2}$) if the character is equal to 'A'. Let's look at this stream of binary digits:

00001000 01000010 00010000 10000100 00100001 000

Next, there are three (3) zeros at the tail of this binary stream. But in any implementation, binary digits are represented as bytes, which are eight bits a piece. So, it our representation, this bit steam is actually:

00001000 01000010 00010000 10000100 00100001 00000000

So, this is the binary encoding of the forty-three (43) characters that follow the first 'A', represented as six (6) bytes. Next, we strip away trailing 00000000's from the bit or byte stream. These trailing 00000000's are unnecessary, and waste space should we choose to encode them later. So, our binary or byte stream is now:

00001000 01000010 00010000 10000100 00100001

It will take 5 characters or bytes of space to encode or represent these eight additional occurrences of 'A' following the first 'A'. But we also need to encode the escape sequence, $00000000_{2}$, following the 'A' which we have already encoded, and after the escape sequence we need to specify the number of bytes or characters in the encoded byte stream, namely five (5). The number 5 in binary is $00000101_{2}$. So, it takes $1 + 1 + 5 = 7$ characters to represent these eight additional A's. Since $7 < 8$, the algorithm chooses to encode in this format, to save space. So, here's our compressed output so far:

A 0000000 00000101 00001000 01000010 00010000 10000100
0010000100

The next unique character in our input is 'l'. It is immediately encoded:

A 0000000 00000101 00001000 01000010 00010000 10000100
00100001 l

Once again, there are eight more occurrences of 'l', but this time we skip all the additional A's previously encoded. Ignoring these already encoded A's, after the first 'l', l's occur 4 characters or bytes later, 8 characters or bytes later, 12 characters or bytes later, 16 characters or bytes later, 20 characters or bytes later, 24 characters or bytes later, 28 characters or bytes later, and 32 characters or bytes later. As before, here is how these recurrences are represented as a binary stream of digits:

00010001 00010001 00010001 00010001 00

There are forty-two (42) binary digits in this bit stream, but, again, in any implementation, bits are represented as bytes, so we append additional 0's to create five (5) bytes:

00010001 00010001 00010001 00010001 00000000

But again, the last byte, $00000000_{2}$, encodes nothing, and would only waste space if we chose to encode it. Therefore, it is removed from the encoding. This leaves us with the following bit or byte stream:

00010001 00010001 00010001 00010001

Each of these binary numbers take one character or byte to represent in compressed format. But once again, after the 'l' we need to include the escape sequence, $00000000_{2}$, and the number of bytes in the bit or byte stream, namely four (4) or $00000100_{2}$. So a total of 6 bytes is required to represent these additional eight occurrences of 'l'. Since $6 < 8$, the encoded representation is chosen. So, here's our compressed output:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001

The next unique character in our input is 'a'. It is immediately encoded:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001 a

Now eight additional a's occur in the input. We ignore already encoded characters, namely, the last eight A's and the last eight l's. Ignoring these already encoded letters, after the first 'a', additional a's occur 3 characters or bytes later, 6 characters or bytes later, 9 characters or bytes later, 12 characters or bytes later, 15 characters or bytes later, 18 characters or bytes later, 21 characters or bytes later, and 24 characters or bytes later. Again, these recurrences are represented as a binary stream of digits:

00100100 10010010 01001001 0

But once again, in any implementation, bits are represented as bytes, so the tailing zero (0) is really represented by 8-bits, with zero's (0's) appended, or:

00100100 10010010 01001001 00000000

But once again, the tailing byte, $00000000_{2}$, encodes nothing, and would waste space if we chose to encode it, so it is trimmed from the binary bit or byte stream, leaving:

00100100 10010010 01001001

Once again, these three bytes or characters must be precede by the escape sequence, $00000000_{2}$, and the number of bytes in the stream, namely three (3) or $00000011_{2}$. Therefore, $1 + 1 + 3 = 5$ characters are necessary to represent the eight additional l's. Since $5 < 8$, we choose to encode the eight additional a's in this manner, so now the compress output looks like:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001 a 00000000 00000011 00100100 10010010 01001001

The next unique character in our input is 'n'. It is immediately represented in our output:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001 a 00000000 00000011 00100100 10010010 01001001 
n

After this first occurrence of 'n', there are eight additional occurrences of 'n'. Ignoring the characters that have already been encoded, 'A', 'l', and 'a', after the first 'n', n's occur after 2 characters or bytes, 4 characters or bytes, 6 characters or bytes, 8 characters or bytes, 10 characters or bytes, 12 characters or bytes, 14 characters or bytes, and 16 characters or bytes. These recurrences are represented as:

01010101 01010101

These two binary numbers are each represented by a single character or byte, but we must also include the escape sequence, $00000000_{2}$, and the number of characters or bytes in the stream, namely two (2), or $00000010_{2}$. Therefore, $1 + 1 + 2 = 4$ characters or bytes are needed to represent these additional eight occurrences of 'n'. Since, $4 < 8$ we choose to encode. Our output is now:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001 a 00000000 00000011 00100100 10010010 01001001
n 00000000 00000010 01010101 01010101

The next unique character in our input is the space character, ' '. It is immediately represented in our output:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001 a 00000000 00000011 00100100 10010010 01001001
n 00000000 00000010 01010101 01010101 ' '

After this first space, ' ', seven additional spaces occur. Ignoring characters that have already been encoded, 'A', 'l', 'a', and 'n', after the first space, ' ', additional spaces occur after 1 character or byte, 2 characters or bytes, 3 characters or bytes, 4 characters or bytes, 5 characters or bytes, 6 characters or bytes, and 7 characters or bytes. This can be represented as:

1111111

But in any implementation, bits are represented in groups of 8 as bytes, so we need to add a single zero (0) to the end of this stream to make it a full byte:

11111110

This binary number will be represented as a single character or byte in the output. As usual, we'll also have to include the escape sequence, $00000000_{2}$, after the first space, and the number of bytes in the binary bit or byte stream, namely one (1), or $00000001_{2}$. So, it'll take $1 + 1 + 1 = 3$ characters or bytes to represent these additional seven spaces, ' '. Since $3 < 7$, we encode in this fashion. Here's our output:

A 00000000 00000101 00001000 01000010 00010000 10000100
00100001 l 00000000 00000100 00010001 00010001 00010001
00010001 a 00000000 00000011 00100100 10010010 01001001
n 00000000 00000010 01010101 01010101 ' ' 00000000 00000001
11111110

Now the entire input character or byte stream has been compressed or encoded. Remember, the input character or byte stream was 44 characters or bytes in length. In comparison, the compressed output is only 30 characters of bytes in length. Compression has been successful.


next up previous
Next: Decompression example Up: A new compression and Previous: Time complexity of the
James Riechel 2007-12-12
Hosted by www.Geocities.ws

1