A Run Length Encoding Scheme
For Block Sort Transformed Data by Michael A. Maniscalco

This paper details a modified Run Length Encoding (RLE) method for use on data sources which have undergone Block Sorting. While similar to the standard form of RLE, the method described here offers a potentially more optimal encoding scheme for data where the length of symbol runs is highly variable. The compression results posted later in this paper show an overall compression improvement for the Calgary corpus of 11k when employed with the M99 data coder. Further use of this technique (specific to M99 encoding scheme) has produced a reduction 20k to the same corpus over M99 without this modified RLE technique.

This paper does not detail the basics of Run Length Encoding. For those who are not familiar with Run Length Encoding, there are many papers available at Mark Nelson's Data Compression Library (http://dogma.net/DataCompression/).

The basic idea behind the method described here is to encode character runs with an RLE scheme while not introducing data into the transformed stream which will interupt the character runs left behind. Furthermore, this scheme is designed to be optimal for small runs while being capable of providing good encoding for larger runs.

The implementation details are as follows: Only runs of length 3 or greater are encoded. All runs of 1 and two are left unchanged. For any run of symbols in an input stream greater than 2, a run of the same symbol will be placed into the stream with a length equal to (N+3)/2 where N=the length of the original run. The following demonstrates this encoding.

Original Run -> Coded run
AAA -> AAA
AAAA -> AAA
AAAAA -> AAAA
AAAAAA -> AAAA
AAAAAAA -> AAAAA
AAAAAAAA -> AAAAA

The only overhead needed to reverse this encoding is one bit which indicates whether the original run of symbols was even or odd. These bits are packed together and stored at the end of the file. Given this, the above example is completely encoded as follows ...

Original Run -> Coded run + 1 bit
AAA -> AAA + 0
AAAA -> AAA + 1
AAAAA -> AAAA + 0
AAAAAA -> AAAA + 1
AAAAAAA -> AAAAA +0
AAAAAAAA -> AAAAA +1

This method leaves large runs of symbols smaller but uninterupted for encoding with stardard Block Sorting methods but reduces common symbol runs nearly in half.

The decoding for this scheme is very straight forward. For every run of symbols in the encoded stream, the original run is 2N-2 where N is the length of the encoded symbol run. This value is decremented by one if the extra encoded bit is a 0.

Sample code for coder:

Sample code for decoder:

Results:
The following are the compression results of the Calgary Corpus with the M99 data compression method. The basic method is Blocksorted with M99 coding only. The second method is Blocksorted + modified RLE + M99 coding only.

File	Size	M99	RLE+M99	Reduction	% increase
bib	111,261	29,278	28,610	668	+0.600%
book1	768,771	225,673	225,833	-160	-0.020%
book2	610,856	159,214	157,242	1,972	+0.322%
geo	102,400	58,393	58,247	146	+0.142%
news	377,109	123,237	121,117	2,120	+0.562%
obj1	21,504	11,031	10,867	164	+0.762%
obj2	246,814	83,296	80,699	2,597	+1.052%
paper1	53,161	17,633	17,305	328	+0.617%
paper2	82,199	25,895	25,561	334	+0.406%
paper3	46,526	16,429	16,240	189	+0.406%
paper4	13,286	5,392	5,343	49	+0.368%
paper5	11,954	5,082	5,019	63	+0.527%
paper6	38,105	13,183	12,912	271	+0.711%
pic	513,216	46,839	46,752	87	+0.017%
progc	39,611	13,450	13,122	328	+0.828%
progl	71,646	17,343	16,720	623	+0.869%
progp	49,379	12,335	11,903	432	+0.874%
trans	93,695	21,852	20,548	1,304	+1.391%
Totals	3,251,493	885,555	874,040	11,515	+0.579% (average)

This basic modified RLE coding scheme can be further modified such that for each symbol in the encoded run, the value symbols it represents is doubled. For this, the number of bits needed to reverse the encoding increases by one for every successive character in the encoded stream above 2. This improves compression ratios for Blocksorted files which have longer runs of symbols on average. This modification improves the compression on book2 somewhat however, on other files, it can be less effective than the basic method described above. Thus, to make the scheme general purpose, the results posted here are confined to the originally described scheme. While the test coder used here is not typical, I hope that this simple technique can improve other blocksorters aswell.

(C) 2000 Michael A Maniscalco
To contact the author with comments or questions (or your results) email to Click Here.
To visit the M99 Data Coder site: Click here

Visitor Number:�

Hosted by www.Geocities.ws