Arithmetic Coding Compression Algorithm Method

Arithmetic Coding

This compression is quite similar to Huffman, because it search for the quite same kind of content to compress, the different are the way it process the source and instead of giving bit value for each symbol it uses probability value for each symbol. It is based on probability between 0 (zero) to 1 (one). This method use a simple math calculation, and this simple calculation resulted in the best compression ratio.

It requires 5 variables for the encoding, they are RANGE, LOW, HIGH, LR (Low Range) and HR (High Range).
Eg: source: ABCBAA (6 bytes).
A = 3, B = 2, C = 1.
p(A) = 3 / 6 = 0,50 -> RANGE: [0,00 - less than 0,50] -> LR = 0,00, HR = 0,50
p(B) = 2 / 6 = 0,33 -> RANGE: [0,50 - less than 0,83] -> LR = 0,50, HR = 0,83
p(C) = 1 / 6 = 0,16 -> RANGE: [0,83 - less than 1,00] -> LR = 0,83, HR = 1,00
Encoding:
Define starting variable value: LOW = 0, HIGH = 1
RANGE = HIGH (previous) - LOW (previous)
LOW = LOW (previous) + (RANGE * LR of current symbol)
HIGH = LOW (previous) + (RANGE * HR of current symbol)

Symbol	RANGE	LOW	HIGH
(start - none)	1	0	1
A	1	0	0,5
B	0,5	0,25	0,415
C	0,165	0,38695	0,415
B	0,02805	0,400975	0,4102315
A	0,0092565	0,400975	0,40560325
A	0,00462825	0,400975	0.403289125

The output value is 0,400975.

Decoding:
Requires 5 variables, they are RANGE, LR (Low Range), HR (High Range), VALUE and RD (Range Different).
1. Output the symbol by determine in which range the VALUE is.
2. Get a new VALUE using this calculation:
RD = HR - LR
VALUE = (VALUE - LR) / RD

VALUE	RANGE	Symbol	RD
0,400975	0,00 - less than 0,50	A	0,50
0.80195	0,50 - less than 0,83	B	0,33
0.915	0,83 - less than 1,00	C	0,16
0.53125	0,50 - less than 0,83	B	0,33
0.09469696	0,00 - less than 0,50	A	0,50
0.18939392	0,00 - less than 0,50	A	0,50

Note: as characters to encode increasing then the calculation is more complicated
Pros: may have better (smaller) result than Huffman.
Cons: slow (requires big FPU calculation).
Conclusion: at the time this document written this compression is not popular as Huffman because of its gain is not worth the slowness, may be in the future as the computer technology became faster and faster this will not be a problem.

Tips: because we have to store the last value of LOW, it is better if we sort the symbol and put the highest frequent symbol on the top, a zero value of LOW will resulted in a smaller value of the final value of LOW

This compression method is the best compression available as this writing, if we use(combine) arbitrary precision than we can compress the whole content within one loop, but how do we store the last floating point for decoding later? because we may have 3000 digits decimal value, also this will add the time needed for encoding and decoding. If we can use small enough space for this last value then this method could be the only method needed to compress data.

If we use arbitrary precision, this method can be done recursively, because we can compress random data. This mean we can compress any (big) file size to a small (probably fixed) size.

But what is the use, if we need, perhaps, one day or 6 hours to encode and also to decode the data, because we can just straigh download it for the same time amount needed, right? that is right, but we can have an ultimate recursive lossless compression, which can compress any data to small size, we save space! maybe one floppy (1.44MB) for two or more hours of decompress audio-video entertainment, goverments or any company do not need bulky media storage for historical record.

Maybe when we have a super computer with very fast CPU/FPU or maybe if we have a special chip (hardware) for encoding for decoding then we can use (apply) this compression method. Maybe later in the future we can find another simpler math calculation that we can exploit to compress data, without the need of huge computing resources like this method.

RLE

Huffman

Hosted by www.Geocities.ws