Ocamyd

a file (de-)compressor based on the DMC algorithm

Ocamyd is a statistical file compressor/decompressor which resulted from my experiments with Gordon V. Cormack's and R. Nigel Horspool's Dynamic Markov Compression (DMC) algorithm. Like DMC - Ocamyd is a bit prediction algorithm, but it combines DMC's finite state machine with PPM-like low-order bit-context-models and a suffix match routine. On a statistical level Ocamyd tries to learn from the data it has processed by adjusting and weighting its predictions according to the prediction errors it made in the past. Beside the encoding parameters Ocamyd only transmits the arithmetic coded prediction errors to the decoder which makes the same assumptions as the encoder and therefore is capable of reconstructing the original message with the help of the transmitted and decoded error information. Ocamyd is a plain file to file (de-)compressor and has no archiving functions nor does it store any meta file information. Ocamyd is written in C and is open source and like DMC the implementation is free so long as you don't resell it.

A word of caution

Ocamyd is experimental software and in its current state i do not recommend it for end users or to backup/archive your data with it. Ocamyds usefulness is currently limited: beside the fact that it en-/decodes very slowly and requires relatively much memory, the different versions are not compatible with each other and currently Ocamyd is not compiler independent, so one should always use exactly the same version and compilation for the en- and decoding process. Furthermore Ocamyd makes heavy use of floating point arithmetic and correct functioning may depend on the floating point implementation of the source and target machine. If you are looking for an integer arithmetic DMC instead, i recommend Matt Mahoneys DMC implementation in PAQ8L.

Downloads

To my knowledge Ocamyd was the first approach to explore the possibilities of the DMC algorithm in combination with other prediction techniques (Edit: Not true. Suzanne Bunton has done something similar before for her Ph.D. thesis, she used a 256-ary alphabet DMC and merged it with PPM). A win32 console executable with the C source code is published below as a demonstration of these possibilities and with the hope that other programmers will pick up the concept and improve it. If you want to contribute something to Ocamyds source code, and you want it to be published on this site, you can send me your source code. Any serious contributions are welcome.

latest version: Ocamyd-1.66.final (Caution: this version has a bug in pupdate(), i guess it's an underflow error. As a workaround you can try to replace the pupdate() function with the one from v1.65. Update: LovePimple kindly provides a speed optimized ocamyd166test1 win32 executable [Pentium III or above required] with a replaced pupdate() function [as suggested above]. Please let me know if this version works on those files where the final 1.66 version fails.)
latest stable version: Ocamyd-1.65.final
older versions: Ocamyd-1.64 (external link, no source code)
other versions: Ocamyd-1.65.final-LTCB-1.0 (Modified 1.65 version by Mauro Vezzosi for Matt Mahoney's Large Text Compression Benchmark, resulting from Mauros experiments on large text files, his improvements are also included in the 1.66 version). Ocamyd-1.66.mod4 (Source code of a modified 1.66 version by Nania Francesco Antonio from Dec-2007, introducing Francescos ADMC routines. This is a preliminary version - i still haven't found time to work on it. Update: Here's a zip file of this version with source code and a speed optimized win32 executable [Pentium III or above required] provided by LovePimple.)

_{(The zip files of the final versions contain a win32 console executable build with
MinGW GCC/g++ 3.4.2 and the C source code. Minimum requirement: 64 MB
free memory)}

Linux/Unix

Ocamyd is also compilable under Linux/Unix. For compilation with GCC 4 you should link the math library (-lm compiler option). I have done no serious testing under Linux/Unix yet. But see also here.

Optimized builds

To speed up Ocamyds en-/decoding process with GCC you can turn on the optimization flag (-O) and you can generate architecture specific instruction set for the CPU type (-march) and for the floating point co-processor (-mfpmath) of the target machine (see the GCC Manual). Be aware that those builds will produce slightly different encoding results. As i mentioned above: always use identical builds for the encoding and decoding process.

Compression Results

See the following Compression Benchmark sites:
Maximum Compression, Ultimate Command Line Compressors, Large Text Compression Benchmark, Squeeze Chart, Netstore 101, EmilCont Ultracompression, Black FOX Compression Benchmark, Squxe Archivers Chart, Monster of Compression.

Other compression programs

I also did some experiments with symbol ranking. The resulting programs, which may be best described as variants of a hybrid LZ/Symbolranking algorithm, are listed below. Please read the source codes for more details. All these programs are experimental software, free available and released under the GPL. The zip files with source codes and Windows executables for Symbra and lcssr are kindly provided by Matt Mahoney, the zip file with source code and Windows executable for contra is kindly provided by LovePimple.

Symbra is a symbol ranking file (de-)compressor similar to P. Fenwicks srank program. Compared to srank it has an improved symbol ranking and code generation and a different way to handle the output, but at the cost of some speed. It optionally allows to merge the context queues with a suffix match search and has a 2-pass mode which analyzes the input file in the 1st pass to reorder the alphabet by descendending symbol frequencies. Symbras default mode is quite fast, with compression rates similar to LZOP/QuickLZ/slug. The suffix match search mode is slow, but it can achieve compression rates similar to those of the LZ77 deflate algorithms, and in some cases even surpass them.

Downloads:

Symbra v0.1 (C++ source code released 26-Nov-2007): symbra-0.1.cpp.gz
Symbra v0.2 (C++ source code released 29-Nov-2007): symbra-0.2.cpp.gz
Symbra v0.2 (C++ source code + Windows executable): symbra02.zip

lcssr is a derivative of Symbra, which has dropped Symbras constant order context queues and the 2-pass mode to concentrate on symbol ranking by variable length contexts for which it uses the same binary search as Symbra. It is a slow, but compact algorithm.

Downloads:

lcssr v0.1 (C++ source code released 01-Dec-2007): lcssr-0.1.cpp.gz
lcssr v0.2 (C++ source code released 03-Dec-2007): lcssr-0.2.cpp.gz
lcssr v0.2 (C++ source code + Windows executable): lcssr02.zip

contra started as a variant of lcssr, which also ranks symbols by variable length contexts. Version 0.1 was an experiment in improving the speed of the suffix match search by using a ternary search instead of a binary search. In version 0.2 a state map + an arithmetic en-/decoder (taken from sr2 by Matt Mahoney) replaces the en-/decoder from version 0.1. Version 0.3 allows to configure the secondary context length by commandline argument. Version 0.4 expands the rank table context to 12 bits. Version 0.5 adds a binary search tree and a secondary context construction mode.

Downloads:

contra v0.1 (C++ source code released 14-Dec-2007): contra-0.1.cpp.gz
contra v0.2 (C++ source code released 21-Dec-2007): contra-0.2.cpp.gz
contra v0.3 (C++ source code released 22-Dec-2007): contra-0.3.cpp.gz
contra v0.4 (C++ source code released 23-Dec-2007): contra-0.4.cpp.gz
contra v0.5 (C++ source code released 26-Dec-2007): contra-0.5.cpp.gz
contra v0.5 (C++ source code + Windows executable): contra05.zip

Any comments?

View the guestbook and/or add an entry to the guestbook.

This page is maintained by Frank Schwellinger. Last Update: 26-Mar-2008.

Hosted by www.Geocities.ws