Ocamyd
a file (de-)compressor
based on the
DMC algorithm
Ocamyd is a statistical file compressor/decompressor which resulted
from my experiments with Gordon V. Cormack's and R. Nigel Horspool's
Dynamic
Markov Compression
(DMC) algorithm. Like DMC - Ocamyd is a bit prediction algorithm, but it
combines
DMC's finite state machine with PPM-like
low-order bit-context-models and a suffix match routine. On a
statistical level Ocamyd tries
to learn from the data it has processed by adjusting and weighting its
predictions according to the prediction errors it made in the past.
Beside the encoding parameters Ocamyd only transmits the arithmetic
coded prediction
errors to the decoder which makes the same assumptions as the encoder
and therefore is capable of reconstructing the original message with
the help of the transmitted and decoded error information.
Ocamyd is a
plain
file to file (de-)compressor and has no archiving functions nor does it
store any meta file information. Ocamyd is
written in C and is open source and like DMC the implementation is free
so long as you don't resell it.
A word of caution
Ocamyd is experimental software and in its current
state i do not
recommend it for end users or to backup/archive your
data with it. Ocamyds usefulness is currently limited: beside
the fact
that it en-/decodes very slowly and requires relatively much
memory, the different versions are not compatible with each other and
currently Ocamyd is not compiler independent, so one should
always use
exactly the same version and compilation for the en- and decoding
process. Furthermore Ocamyd makes heavy use of floating point
arithmetic and correct functioning may depend on the floating
point implementation of the source and target machine. If you are
looking for an integer arithmetic DMC instead, i recommend Matt Mahoneys DMC
implementation in PAQ8L.
Downloads
To my knowledge Ocamyd was the first approach to explore the
possibilities of the DMC algorithm in combination with other prediction
techniques (Edit: Not true. Suzanne
Bunton has done something similar before for her Ph.D. thesis, she used a
256-ary alphabet DMC and merged it with PPM). A win32 console executable with the C source
code is
published below as a demonstration of these possibilities and with the
hope that other programmers will pick up the concept and
improve
it. If you want to contribute something to Ocamyds source code, and you want it to be published on this site, you can
send me your source code. Any serious contributions are welcome.
- latest version: Ocamyd-1.66.final (Caution:
this version has a bug in pupdate(), i guess it's an underflow error.
As a workaround you can try to replace the pupdate() function with
the one from v1.65. Update: LovePimple kindly provides a speed optimized ocamyd166test1
win32 executable [Pentium III or above required] with a replaced
pupdate() function [as suggested above]. Please let me know if this
version works on those files where the final 1.66 version fails.)
- latest stable version: Ocamyd-1.65.final
- older versions: Ocamyd-1.64
(external link, no source code)
- other versions: Ocamyd-1.65.final-LTCB-1.0 (Modified
1.65 version by Mauro Vezzosi for Matt Mahoney's Large Text
Compression Benchmark, resulting from Mauros experiments on large text
files, his improvements are also included in the 1.66 version). Ocamyd-1.66.mod4
(Source code of a modified 1.66 version by Nania Francesco Antonio from
Dec-2007, introducing Francescos ADMC routines. This is a preliminary
version - i still haven't found time to work on it. Update: Here's
a zip file of this version with source code and a speed
optimized win32 executable [Pentium III or above required] provided by
LovePimple.)
(The zip files of the final versions contain a win32 console executable build with
MinGW GCC/g++ 3.4.2 and the C source code. Minimum requirement: 64 MB
free memory)
Linux/Unix
Ocamyd is also compilable under Linux/Unix. For
compilation with GCC 4 you should link the math library (-lm compiler
option). I have done no serious testing under Linux/Unix yet.
But
see also here.
Optimized builds
To speed up Ocamyds en-/decoding process with GCC you can turn on the
optimization flag (-O) and you can generate architecture specific
instruction set for the CPU type (-march) and for
the floating
point co-processor (-mfpmath) of the target machine (see the GCC
Manual). Be aware that those
builds will produce slightly different encoding results. As i mentioned
above: always use identical builds for the encoding and decoding
process.
Compression Results
See the following Compression Benchmark sites:
Maximum
Compression, Ultimate
Command Line Compressors, Large
Text Compression Benchmark, Squeeze Chart,
Netstore
101, EmilCont
Ultracompression, Black
FOX
Compression Benchmark, Squxe Archivers Chart, Monster of Compression.
Other compression programs
I also did some experiments with symbol ranking. The resulting
programs, which may be best
described as variants of a hybrid LZ/Symbolranking algorithm, are
listed below. Please read the source codes for more details. All
these programs are experimental software, free available and released
under the
GPL. The zip files with source codes and Windows executables for Symbra and lcssr are kindly
provided by Matt Mahoney, the zip file with source code and Windows executable for contra is kindly provided by LovePimple.
Symbra is a symbol ranking file (de-)compressor similar to P. Fenwicks srank program. Compared to srank it
has an improved symbol ranking and code generation and a different
way to handle the output, but at the cost of some speed. It
optionally allows to merge the context queues with a suffix match
search and
has a 2-pass mode which analyzes the input file in the 1st pass to reorder the alphabet by descendending
symbol frequencies. Symbras default mode is quite fast, with
compression rates similar to LZOP/QuickLZ/slug. The suffix match search mode is slow, but it can achieve compression rates similar to those of the LZ77 deflate algorithms, and in some cases even surpass them.
- Downloads:
lcssr is a derivative of Symbra, which has dropped Symbras
constant order context queues and the 2-pass mode to
concentrate on symbol ranking by variable length contexts for
which it uses the same binary search as Symbra. It is a slow, but compact
algorithm.
- Downloads:
contra started as a variant of lcssr,
which also ranks symbols by variable length contexts. Version 0.1 was an experiment in
improving the speed of the suffix match search by using a ternary
search instead of a binary search. In version 0.2 a state map + an arithmetic en-/decoder (taken from sr2 by
Matt Mahoney) replaces the en-/decoder from version 0.1. Version 0.3
allows to configure the secondary context length by commandline
argument. Version 0.4 expands the rank table context to 12 bits.
Version 0.5 adds a binary search tree and a secondary context
construction mode.
- Downloads:
Any comments?
View
the guestbook and/or add
an entry to the guestbook.
This page is
maintained by Frank Schwellinger. Last Update: 26-Mar-2008.