A Run Length Encoding Scheme
For Block Sort Transformed Data
by Michael A. Maniscalco
This paper details a modified Run Length Encoding (RLE) method for use on data sources
which have undergone Block Sorting. While similar to the standard form of RLE, the
method described here offers a potentially more optimal encoding scheme for data where
the length of symbol runs is highly variable. The compression results posted later in
this paper show an overall compression improvement for the Calgary corpus of 11k when employed
with the M99 data coder. Further use of this technique (specific to M99 encoding scheme) has
produced a reduction 20k to the same corpus over M99 without this modified RLE technique.
This paper does not detail the basics of Run Length Encoding. For those who are not familiar
with Run Length Encoding, there are many papers available at Mark Nelson's Data Compression
Library (http://dogma.net/DataCompression/).
The basic idea behind the method described here is to encode character runs with an RLE scheme
while not introducing data into the transformed stream which will interupt the character runs left
behind. Furthermore, this scheme is designed to be optimal for small runs while being capable of
providing good encoding for larger runs.
The implementation details are as follows:
Only runs of length 3 or greater are encoded. All runs of 1 and two are left unchanged.
For any run of symbols in an input stream greater than 2, a run of the same symbol
will be placed into the stream with a length equal to (N+3)/2 where N=the length of the
original run. The following demonstrates this encoding.
Original Run -> Coded run
AAA -> AAA
AAAA -> AAA
AAAAA -> AAAA
AAAAAA -> AAAA
AAAAAAA -> AAAAA
AAAAAAAA -> AAAAA
The only overhead needed to reverse this encoding is one bit which indicates whether the original
run of symbols was even or odd. These bits are packed together and stored at the end of
the file. Given this, the above example is completely encoded as follows ...
Original Run -> Coded run + 1 bit
AAA -> AAA + 0
AAAA -> AAA + 1
AAAAA -> AAAA + 0
AAAAAA -> AAAA + 1
AAAAAAA -> AAAAA +0
AAAAAAAA -> AAAAA +1
This method leaves large runs of symbols smaller but uninterupted for encoding with
stardard Block Sorting methods but reduces common symbol runs nearly in half.
The decoding for this scheme is very straight forward. For every run of symbols in the
encoded stream, the original run is 2N-2 where N is the length of the encoded symbol run.
This value is decremented by one if the extra encoded bit is a 0.
Sample code for coder:
Sample code for decoder:
Results:
The following are the compression results of the Calgary Corpus with the M99 data compression
method. The basic method is Blocksorted with M99 coding only. The second method is
Blocksorted + modified RLE + M99 coding only.
File | Size | M99 | RLE+M99 | Reduction | % increase
|
bib | 111,261 | 29,278 | 28,610 | 668 | +0.600%
|
book1 | 768,771 | 225,673 | 225,833 | -160 | -0.020%
|
book2 | 610,856 | 159,214 | 157,242 | 1,972 | +0.322%
|
geo | 102,400 | 58,393 | 58,247 | 146 | +0.142%
|
news | 377,109 | 123,237 | 121,117 | 2,120 | +0.562%
|
obj1 | 21,504 | 11,031 | 10,867 | 164 | +0.762%
|
obj2 | 246,814 | 83,296 | 80,699 | 2,597 | +1.052%
|
paper1 | 53,161 | 17,633 | 17,305 | 328 | +0.617%
|
paper2 | 82,199 | 25,895 | 25,561 | 334 | +0.406%
|
paper3 | 46,526 | 16,429 | 16,240 | 189 | +0.406%
|
paper4 | 13,286 | 5,392 | 5,343 | 49 | +0.368%
|
paper5 | 11,954 | 5,082 | 5,019 | 63 | +0.527%
|
paper6 | 38,105 | 13,183 | 12,912 | 271 | +0.711%
|
pic | 513,216 | 46,839 | 46,752 | 87 | +0.017%
|
progc | 39,611 | 13,450 | 13,122 | 328 | +0.828%
|
progl | 71,646 | 17,343 | 16,720 | 623 | +0.869%
|
progp | 49,379 | 12,335 | 11,903 | 432 | +0.874%
|
trans | 93,695 | 21,852 | 20,548 | 1,304 | +1.391%
|
Totals | 3,251,493 | 885,555 | 874,040 | 11,515 | +0.579% (average)
|
This basic modified RLE coding scheme can be further modified such that for each symbol
in the encoded run, the value symbols it represents is doubled. For this, the number of bits
needed to reverse the encoding increases by one for every successive character in the encoded
stream above 2. This improves compression ratios for Blocksorted files which have longer runs
of symbols on average. This modification improves the compression on book2 somewhat however,
on other files, it can be less effective than the basic method described above. Thus, to make
the scheme general purpose, the results posted here are confined to the originally described scheme.
While the test coder used here is not typical, I hope that this simple technique can improve
other blocksorters aswell.
(C) 2000 Michael A Maniscalco
To contact the author with comments or questions (or your results) email to
Click Here.
To visit the M99 Data Coder site: Click here
Visitor Number:�