A Run Length Encoding Scheme
For Block Sort Transformed Data
by Michael A. Maniscalco



This paper details a modified Run Length Encoding (RLE) method for use on data sources which have undergone Block Sorting. While similar to the standard form of RLE, the method described here offers a potentially more optimal encoding scheme for data where the length of symbol runs is highly variable. The compression results posted later in this paper show an overall compression improvement for the Calgary corpus of 11k when employed with the M99 data coder. Further use of this technique (specific to M99 encoding scheme) has produced a reduction 20k to the same corpus over M99 without this modified RLE technique.

This paper does not detail the basics of Run Length Encoding. For those who are not familiar with Run Length Encoding, there are many papers available at Mark Nelson's Data Compression Library (http://dogma.net/DataCompression/).

The basic idea behind the method described here is to encode character runs with an RLE scheme while not introducing data into the transformed stream which will interupt the character runs left behind. Furthermore, this scheme is designed to be optimal for small runs while being capable of providing good encoding for larger runs.

The implementation details are as follows: Only runs of length 3 or greater are encoded. All runs of 1 and two are left unchanged. For any run of symbols in an input stream greater than 2, a run of the same symbol will be placed into the stream with a length equal to (N+3)/2 where N=the length of the original run. The following demonstrates this encoding.


Original Run -> Coded run
AAA -> AAA
AAAA -> AAA
AAAAA -> AAAA
AAAAAA -> AAAA
AAAAAAA -> AAAAA
AAAAAAAA -> AAAAA

The only overhead needed to reverse this encoding is one bit which indicates whether the original run of symbols was even or odd. These bits are packed together and stored at the end of the file. Given this, the above example is completely encoded as follows ...


Original Run -> Coded run + 1 bit
AAA -> AAA + 0
AAAA -> AAA + 1
AAAAA -> AAAA + 0
AAAAAA -> AAAA + 1
AAAAAAA -> AAAAA +0
AAAAAAAA -> AAAAA +1

This method leaves large runs of symbols smaller but uninterupted for encoding with stardard Block Sorting methods but reduces common symbol runs nearly in half.

The decoding for this scheme is very straight forward. For every run of symbols in the encoded stream, the original run is 2N-2 where N is the length of the encoded symbol run. This value is decremented by one if the extra encoded bit is a 0.

Sample code for coder:



Sample code for decoder:



Results:
The following are the compression results of the Calgary Corpus with the M99 data compression method. The basic method is Blocksorted with M99 coding only. The second method is Blocksorted + modified RLE + M99 coding only.

FileSizeM99RLE+M99Reduction% increase
bib111,26129,27828,610668+0.600%
book1768,771225,673225,833-160-0.020%
book2610,856159,214157,2421,972+0.322%
geo102,40058,39358,247146+0.142%
news377,109123,237121,1172,120+0.562%
obj121,50411,03110,867164+0.762%
obj2246,81483,29680,6992,597+1.052%
paper153,16117,63317,305328+0.617%
paper282,19925,89525,561334+0.406%
paper346,52616,42916,240189+0.406%
paper413,2865,3925,34349+0.368%
paper511,9545,0825,01963+0.527%
paper638,10513,18312,912271+0.711%
pic513,21646,83946,75287+0.017%
progc39,61113,45013,122328+0.828%
progl71,64617,34316,720623+0.869%
progp49,37912,33511,903432+0.874%
trans93,69521,85220,5481,304+1.391%
Totals3,251,493885,555874,04011,515+0.579% (average)


This basic modified RLE coding scheme can be further modified such that for each symbol in the encoded run, the value symbols it represents is doubled. For this, the number of bits needed to reverse the encoding increases by one for every successive character in the encoded stream above 2. This improves compression ratios for Blocksorted files which have longer runs of symbols on average. This modification improves the compression on book2 somewhat however, on other files, it can be less effective than the basic method described above. Thus, to make the scheme general purpose, the results posted here are confined to the originally described scheme. While the test coder used here is not typical, I hope that this simple technique can improve other blocksorters aswell.

(C) 2000 Michael A Maniscalco
To contact the author with comments or questions (or your results) email to Click Here.
To visit the M99 Data Coder site: Click here

Visitor Number:�
Hosted by www.Geocities.ws

1