A Second Modified Run Length Encoding
Scheme for Blocksort Transformed Data
by Michael A. Maniscalco
This paper details a run length encoding (RLE) scheme which is designed to work with data streams which
have undergone blocksorting. This RLE method is an extension of the simpler RLE method which I had
described previously in the paper
A Modified Run Length Encoding Scheme For Blocksort Transformed Data.
This paper does not detail the basics of Run Length Encoding. For those who are not familiar
with Run Length Encoding, there are many papers available at Mark Nelson's "Data Compression Library"
locatated at
http://www.dogma.net/DataCompression.
The basic concept behind the Modified Run Length Encoding method (mod-RLE) is to modify the
lengths of the symbol runs in the source data stream such that the resulting run length can be
used to determine the length of the original run. Additional information is required to
reverse the encoded run and, this information is stored elsewhere in the encoded stream. For
the original mod-RLE (mod1-RLE), this additional information was a fixed length code of 1 bit in size.
For the second mod-RLE (mod2-RLE), this additional information is a variable length code.
The original Mod1-RLE method employed fixed length codes of
1 bit in size to reverse the actual Run Length Encoding. The transform was employed only on
runs of length 3 or greater where the encoded run is of length 2+((N-1)/2) and N is the
length of the original run. To reverse this, the additional fixed length code of 1 bit is needed
and is either 0 if the original run was odd in length or, 1 if the original run was even in
length. Thus, the decoder could reverse the mod-RLE with the formula ((N-2)*2)+1 where N is
the length of the encoded run. And this value is incremented by 1 if the fixed length 1 bit
code has a value of 1.
Thus, in C the code might appear as:
// for encoding ...
tempRunLength = originalRunLength + 3;
fixedLengthCodeBit = tempRunLength & 1;
encodedRunLength = tempRunLength >> 1;
// for decoding ...
originalRunLength = ((encodedRunLength << 1) | fixedLengthCodeBit)-3;
|
Compression ratios can be improved here by using an Arithmetic encoder to encode the
single bit fixed length code because, after blocksorting, data streams will usually
contain far more runs of length 3 than length 4. Thus, this bit is more often 0 than
1. But, for the sake of simplicity, the original paper on Mod-RLE1 did not use this
improvement.
The remainder of the paper will focus on the new Mod2-RLE. This method has several improvements
over the original Mod1-RLE method. These improvments include the introduction of variable
length codes rather than the fixed length 1 bit codes used in Mod1-RLE as well as the employment
of arithmetic encoding for these variable length codes. The resulting encoded streams
produced with Mod2-RLE are typically much smaller than those encoded with Mod1-RLE. However,
while the Mod2-RLE streams are more compressed than the Mod1-RLE streams, the overall compression
results where RLE is only the first step in the compression process, (ie. blocksorting) are
typically the same as the original Mod1-RLE. This came as quite a suprise but these results
might be due to the post-RLE compression scheme used in this test
*.
There are some cases where Mod1-RLE is slightly better than the more complex Mod2-RLE.
But, in general, Mod2-RLE does typically leaded to moderate compression improvements.
*�
NOTE: The post-RLE compression scheme used here is M99. This coder is noted to typically produce more compressed
streams than a general purpose RLE scheme. I believe this is why the overall compression rates
for Mod1-RLE and Mod2-RLE are so similar. Further tests to prove this will involve using a
standard Arithmetic coder in stead of an M99 coder.
The general strategy with Mod2-RLE is somewhat similar to its predecesor. As with Mod1-RLE,
only runs of length 3 or greater will be encoded. Also, as with Mod1-RLE, the length of the
resulting encoded run will be used to determine the length of the original run. However,
with Mod2-RLE the length of the encoded run will also be used to determine the size of the
variable length code (stored else where in the compressed stream) which is needed to realize
the original decoded run length.
The following table illustrates how the size and value of the variable length codes for
Mod2-RLE are derived.
Original Run Length (ORL) | ORL - 1 | ORL - 1 (Binary) | Bit Length
| Bit Length -1 Variable Length Code Size
| From ORL -1 (Binary) Variable Length Code Value |
3 | 2 | 10 | 2 | 1 | 0 = 0 |
4 | 3 | 11 | 2 | 1 | 1 = 1 |
5 | 4 | 100 | 3 | 2 | 00 = 0 |
6 | 5 | 101 | 3 | 2 | 01 = 1 |
7 | 6 | 110 | 3 | 2 | 10 = 2 |
8 | 7 | 111 | 3 | 2 | 11 = 3 |
9 | 8 | 1000 | 4 | 3 | 000 = 0 |
10 | 9 | 1001 | 4 | 3 | 001 = 1 |
11 | 10 | 1010 | 4 | 3 | 010 = 2 |
12 | 11 | 1011 | 4 | 3 | 011 = 3 |
13 | 12 | 1100 | 4 | 3 | 100 = 4 |
14 | 13 | 1101 | 4 | 3 | 101 = 5 |
15 | 14 | 1110 | 4 | 3 | 110 = 6 |
16 | 15 | 1111 | 4 | 3 | 111 = 7 |
17 | 16 | 10000 | 5 | 4 | 0000 = 0 |
18 | 17 | 10001 | 5 | 4 | 0001 = 1 |
Thus, in C, the code for calculating the size of the variable length code might be:
if (originalRunLength >= 3){
���orlMinusOne = originalRunLength - 1;
���codeSize = -1;
���bit = 1;
���while (bit <= orlMinusOne){
������codeSize++;
������bit <<= 1;
������}
���}
���else
���codeSize = 0;
|
And the code for calculating the value of the variable length code might be:
codeValue = 0;
bit = 1;
while (codeSize > 0){
���codeValue |= ( orlMinusOne & bit);
���bit <<= 1;
���codeSize--;
���}
|
Ofcorse, it would be wiser to build a look up table to improve speed rather than to
re-calculate these values with every encoding.
At this point, all that remains in order to encode a run is to calculate the new run length for the
encoded stream. This new run length is calculated as N+2 where N is the number of bits in the
variable length code as calculated above. Thus, an original run length of 3 or 4 generates an
encoded run length of 3. An original run length of 5,6,7 or 8 generates an encoded run length
of 4. An original run length of 9 through 16 generates a run length of 5. Etc.
The decoding process determines the length of the original run as follows:
The size of the variable length code is N-2 bits where N is the length of the encoded run.
The length of the original run is then calculated as 2^(N-2)+V+1 where N is the length of the encoded run and
V is the value of the variable length code.
The following table illustrates the decoding process.
Encode Run Length (N) | Variable Length Code Size (S) = N-2 |
Variable Length Code Value (V) | Decoded Run Length (2^S)+V+1
|
3 | 1 | 0 = 0 | (2^1)+0+1 = 3 |
3 | 1 | 1 = 1 | (2^1)+1+1 = 4 |
4 | 2 | 00 = 0 | (2^2)+0+1 = 5 |
4 | 2 | 01 = 1 | (2^2)+1+1 = 6 |
4 | 2 | 10 = 2 | (2^2)+2+1 = 7 |
4 | 2 | 11 = 3 | (2^2)+3+1 = 8 |
5 | 3 | 000 = 0 | (2^3)+0+1 = 9 |
5 | 3 | 001 = 1 | (2^3)+1+1 = 10 |
5 | 3 | 010 = 2 | (2^3)+2+1 = 11 |
5 | 3 | 011 = 3 | (2^3)+3+1 = 12 |
5 | 3 | 100 = 4 | (2^3)+4+1 = 13 |
5 | 3 | 101 = 5 | (2^3)+5+1 = 14 |
5 | 3 | 110 = 6 | (2^3)+6+1 = 15 |
5 | 3 | 111 = 7 | (2^3)+7+1 = 16 |
6 | 4 | 0000 = 0 | (2^4)+0+1 = 17 |
6 | 4 | 0000 = 1 | (2^4)+1+1 = 18 |
Thus, the code for the decoder might be:
int calc2Exp(int N){
���// calculate 2^N
���int v = 1;
���while (N>0){
������N--;
������v <<= 1;
������}
���return v;
���}
if (encodedRunLength >= 3){
���variableLengthCodeSize = encodedRunLength - 2;
���variableLengthCodeValue = getCode(variableLengthCodeSize);
���decodedRunLength = calc2Exp(variableLengthCodeSize) + variableLengthCodeValue + 1;
���}
���else
���decodedRunLength = encodedRunLength;
|
As mentioned above, a strong improvement to Mod2-RLE would be to encode the variable length
code values using an arithmetic encoder. Since most of the variable length codes will be less
than 8 bits in size (often far less), the stragedy that I implemented was to have two
variable length code buffers. The first is the output stream from an arithmetic coder which
is used to encode all variable length codes of size 8 bits and less. The second buffer is
used to store the literal binary values of codes which are greater than 8 bits in length.
The following table is the results of both the original Mod1-RLE and the new Mod2-RLE presented
here on the calgary corpus as apply to the files post-blocksort. Also encluded is the compression
ratio when the M99 encoding scheme is applied to the Mod-RLE encoded streams.
File | Size | Mod1-RLE | Mod2-RLE | Mod1-RLE+M99 | Mod2-RLE+M99
|
bib | 111,261 | 83,090 | 64,737 | 28,610 | 28,329
|
book1 | 768,771 | 650,338 | 592,170 | 225,833 | 226,963
|
book2 | 610,856 | 479,106 | 403,311 | 157,242 | 156,932
|
geo | 102,400 | 88,276 | 76,550 | 58,247 | 58,196
|
news | 377,109 | 304,769 | 264,416 | 121,117 | 120,559
|
obj1 | 21,504 | 17,638 | 15,006 | 10,867 | 10,913
|
obj2 | 246,814 | 183,239 | 142,395 | 80,699 | 79,878
|
paper1 | 53,161 | 43,265 | 38,654 | 17,305 | 17,342
|
paper2 | 82,199 | 67,517 | 60,254 | 25,561 | 25,545
|
paper3 | 46,526 | 39,429 | 36,463 | 16,240 | 16,328
|
paper4 | 13,286 | 11,660 | 11,042 | 5,343 | 5,431
|
paper5 | 11,954 | 10,386 | 9,836 | 5,019 | 5,098
|
paper6 | 38,105 | 31,354 | 28,408 | 12,912 | 13,007
|
pic | 513,216 | 301,372 | 115,138 | 46,752 | 47,771
|
progc | 39,611 | 31,907 | 28,265 | 13,122 | 13,150
|
progl | 71,646 | 52,265 | 41,197 | 16,720 | 16,648
|
progp | 49,379 | 35,921 | 28,244 | 11,903 | 11,794
|
trans | 93,695 | 64,589 | 47,007 | 20,548 | 20,282
|
Totals | 3,251,493 | 2,496,121 | 2,003,093 | 874,040 | 874,166
|
The results of the two Mod-RLE are very similar. It is also noteworthy that the Mod1-RLE method
does not call for the use of Arithmetic coding where are the Mod2-RLE method does. Tests suggest
that the Mod1-RLE method would be improved by approx 2,000 bytes by employing an arithmetic coder.
While this would make the original Mod1-RLE method a better method overall, this does not suggest
that it is a better choice in all cases. Note that the overall size after just Mod-RLE is
better for Mod2-RLE. This suggests that it is a better choice for pre-blocksort RLE where
Mod1-RLE is a better choice for post-blocksort RLE.
(C) 2001 Michael A Maniscalco
To contact the author with comments or questions (or your results) email to
Click Here.
To visit the M99 Data Coder site:
Click here