Ocamyd

a file (de-)compressor based on the DMC algorithm

Ocamyd is a statistical file compressor/decompressor which resulted from my experiments with Gordon V. Cormack's and R. Nigel Horspool's Dynamic Markov Compression (DMC) algorithm. Like DMC - Ocamyd is a bit prediction algorithm, but it combines DMC's finite state machine with PPM-like low-order bit-context-models and a suffix match routine. On a statistical level Ocamyd tries to learn from the data it has processed by adjusting and weighting its predictions according to the prediction errors it made in the past. Beside the encoding parameters Ocamyd only transmits the arithmetic coded prediction errors to the decoder which makes the same assumptions as the encoder and therefore is capable of reconstructing the original message with the help of the transmitted and decoded error information. Ocamyd is a plain file to file (de-)compressor and has no archiving functions nor does it store any meta file information. Ocamyd is written in C and is open source and like DMC the implementation is free so long as you don't resell it.

A word of caution

Ocamyd is experimental software and in its current state i do not recommend it for end users or to backup/archive your data with it. Ocamyds usefulness is currently limited: beside the fact that it en-/decodes very slowly and requires relatively much memory, the different versions are not compatible with each other and currently Ocamyd is not compiler independent, so one should always use exactly the same version and compilation for the en- and decoding process. Furthermore Ocamyd makes heavy use of floating point arithmetic and correct functioning may depend on the floating point implementation of the source and target machine. If you are looking for an integer arithmetic DMC instead, i recommend Matt Mahoneys DMC implementation in PAQ8L.

Downloads

To my knowledge Ocamyd was the first approach to explore the possibilities of the DMC algorithm in combination with other prediction techniques (Edit: Not true. Suzanne Bunton has done something similar before for her Ph.D. thesis, she used a 256-ary alphabet DMC and merged it with PPM). A win32 console executable with the C source code is published below as a demonstration of these possibilities and with the hope that other programmers will pick up the concept and improve it. If you want to contribute something to Ocamyds source code, and you want it to be published on this site, you can send me your source code. Any serious contributions are welcome.
(The zip files of the final versions contain a win32 console executable build with MinGW GCC/g++ 3.4.2 and the C source code. Minimum requirement: 64 MB free memory)

Linux/Unix

Ocamyd is also compilable under Linux/Unix. For compilation with GCC 4 you should link the math library (-lm compiler option).  I have done no serious testing under Linux/Unix yet. But see also here.

Optimized builds

To speed up Ocamyds en-/decoding process with GCC you can turn on the optimization flag (-O) and you can generate architecture specific instruction set for the CPU type (-march) and for the floating point co-processor (-mfpmath) of the target machine (see the GCC Manual). Be aware that those builds will produce slightly different encoding results. As i mentioned above: always use identical builds for the encoding and decoding process.

Compression Results

See the following Compression Benchmark sites:
Maximum Compression, Ultimate Command Line Compressors, Large Text Compression Benchmark, Squeeze Chart, Netstore 101, EmilCont Ultracompression, Black FOX Compression Benchmark, Squxe Archivers Chart, Monster of Compression.

Other compression programs

I also did some experiments with symbol ranking. The resulting programs, which  may be best described as variants of a hybrid LZ/Symbolranking algorithm, are listed below. Please read the source codes for more details. All these programs are experimental software, free available and released under the GPL. The zip files with source codes and Windows executables for Symbra and lcssr are kindly provided by Matt Mahoney, the zip file with source code and Windows executable for contra is kindly provided by LovePimple.

Symbra
 is a symbol ranking file (de-)compressor similar to P. Fenwicks srank program.  Compared to srank it has an improved symbol ranking and code generation and a different way to handle the output, but at the cost of some speed. It optionally allows to merge the context queues with a suffix match search and has a 2-pass mode which analyzes the input file in the 1st pass to reorder the alphabet by descendending symbol frequencies. Symbras default mode is quite fast, with compression rates similar to LZOP/QuickLZ/slug. The suffix match search mode is slow, but it can achieve compression rates similar to those of the LZ77 deflate algorithms, and in some cases even surpass them.
Downloads:

lcssr
is a derivative of Symbra, which has dropped Symbras constant order context queues and the 2-pass mode to concentrate on symbol ranking by variable length contexts for which it uses the same binary search as Symbra. It is a slow, but compact algorithm.
Downloads:

contra
started as a variant of lcssr, which also ranks symbols by variable length contexts. Version 0.1 was an experiment in improving the speed of the suffix match search by using a ternary search instead of a binary search. In version 0.2 a state map + an arithmetic en-/decoder (taken from sr2 by Matt Mahoney) replaces the en-/decoder from version 0.1. Version 0.3 allows to configure the secondary context length by commandline argument. Version 0.4 expands the rank table context to 12 bits. Version 0.5 adds a binary search tree and a secondary context construction mode.
Downloads:

Any comments?

View the guestbook and/or add an entry to the guestbook.


This page is maintained by Frank Schwellinger. Last Update: 26-Mar-2008.
Hosted by www.Geocities.ws

1