These are notes I took from a reading of Li and Vitanyi's An Introduction to Kolmogorov Complexity and Its Applications.

Arriving at -∑i pi log pi

We show an encoding for strings that leads to the expression -∑i pi log pi, used as the information content, or entropy, in Shannon's Information Theory.

Given a fixed, arbitrary set of s alphabets, {a1,...,as}. There exists a method to encode/transmit any string of length k in at most s log k + log (k!/k1!k2!...!ks!) bits, where ki is the number of occurences of ai within the string. (Note: we will furthermore show that log (k!/k1!k2!...!ks!) ∼ -k ∑i (ki/k) log (ki/k).)

This is the encoding/decoding scheme:

First the numbers k1,...,ks are communicated. This takes s log k bits. Now there are only

(k k1)(k-k1 k2)... (ks ks)= k!/k1!k2!...!ks!
strings of such occurrences of letters. This allows us to uniquely specify any one of these strings using a number from 1 to k!/k1!k2!...!ks!. Communicating this number takes at most log (k!/k1!k2!...!ks!) bits. (An interesting observation here is that the number of bits communicated using such a scheme is lesser when the numbers k1,...,ks differ more.)

Now using Stirling's approximation (i.e. k!∼ (2πk)1/2(k/e)k => log k!∼ k log k), we have

-log (k1!k2!...!ks!/k!) = -∑i ki log ki+k log k = -∑i ki log (ki/k) (i.e. since k=∑i ki).
And hence we have log (k!/k1!k2!...!ks!) ∼ -k ∑i (ki/k) log (ki/k).

-∑i pi log pi and prefix-free codings

The formula -∑i pi log pi, where pi=ki/k also has meaning when using encoding/decoding schema based on encoding alphabet (instead of the indexing method described).

These encoding schema encodes each letter ai into a binary code and then transmit the string by these codes. Since we want these codes to be self-delimiting (so that we do not waste a letter as delimiter) we require the binary codes to be prefix-free.

We can show that such prefix-free codes correspond to paths to leaf nodes in a binary tree. Furthermore, the paths to a parent of any used leaf node cannot be used, since it would be the prefix of the path to the leaf node. In the case that every leaf node is used to represent a path, we call the resultant code complete. It can be verified that for any complete code {s1,s2,...,sn}, i 2-|si|=1. This gives us the intuition to prove that: there exists a set of prefix codes {s1,s2,...,sn} of lengths l1,l2,...,ln if and only if i 2-li ≤ 1 (Kraft's inequality).

Normalized metric: a normalized metric is a distance measure d:UxU->

We now define the notion of an optimal prefix-free code. Let A={a1,...,an} be an alphabet where the probability for each letter ai to appear is pi (i.e. ai occurs pi k times in a string of length k). A prefix-free code is called optimal with respect to A just in case it has the least i pi |si| among all prefix-free codes.

Let H(A)=-∑i pi log pi. The Noiseless Coding Theorem states that the average length L=∑ipi|si| of the optimal code {s1,...,sn} has H(A) ≤ L≤ H(A)+1.

Note that completeness does not guarantee a prefix-free codes to achieve this optimality. However, the Huffman codes, which is prefix-free, does.

Also, while the Huffman coding encodes to the shortest average length among all prefix-free codes, other methods --devised under other assumptions-- may result in shorter lengths.

Kolmogorov Complexity

The plain Kolmogorov complexity of a string x computed by a function φ,

Cφ( x | y ) = min{ l(p) | φ(<p,y>) = x }, where
p is considered the input.
x is the output.
y is an auxilary input, which is not taken into account in the complexity. That is, y costs nothing.

l(x) stands for the length of x. x can be a number or a string. In the case that x is a number, l(x) = log x + O(1).

We fix a universal Turing machine Φ which on input p deconstruct p into two parts <i,j>, and then runs Ti(j). Let

C( x | y ) = CΦ( x | y )
C( x ) = C( x | ε )

Invariance Theorem

The complexities for any two universal Turing machines Φ1 and Φ2 can be different for only up to a constant factor. This is due to the fact that any Φ1 is produced by Φ2 running an emulation of Φ1 (via machine Ti say)
CΦ2( x | y ) ≤ CΦ1( x | y ) + i

Non-subadditivity

The complexity of two strings is not necessarily greater than the sum of their complexities. In fact, since two strings u, v can be obtained if we have two programs p, q that generates them, and a delimiter for p, q (which is of length in the order of log(min(C(u),C(v)))).
C(<u,v>) ≤ C(u) + C(v) + O(log(min(C(u),C(v))))

Incompressibility Theorem

For each constant c we say a string x is c-incompressible if C(x) ≥ l(x) - c. Intuitively, c stands for the amount of "compression" achieved.

For each n there are 2n strings of length n, but there are only 0≤i≤ n-c-12i = 2n-c-1 descriptors of lengths up to n-c-1. (Note that the strings of descriptors of lengths ≥ n-c are c-incompressible.)

Hence for any n and c ≤ n, there are 2n - 2n-c + 1 c-incompressible strings.

We generalize this observation to:
  For each fixed y, every finite set A of cardinality m has at least m(1 - 2-c) + 1 elements x with C(x|y) > log m - c.

Substring Incompressibility

Substrings of incompressible strings have to be incompressible --- to a certain extent.

Let x = uvw. Suppose p is a short program for v.

Now uvw can be reconstructed from the specification q = l(p) p l(u) u w, by say some machine T.
Hence,

C(x) ≤ CT(x) + O(1) ≤ l(q) + O(1)
On the other hand,
l(q) ≤ 2l(C(v)) + C(v) + 2l(n) + (n-l(v))
for l(p) + for p + for l(u) + for uw

Noting that l(C(v)) and l(n) are of the order of log n,

C(x) ≤ l(q) + O(1) ≤ C(v) + (n-l(v)) + 4 log n + O(1)

Suppose x is c-incompressible. That is, C(x) ≥ n - c. We have

n - c ≤ C(x) ≤ C(v) + (n-l(v)) - O(log n).
and hence C(v) ≥ l(v) - O(log n).

Hence v is incompressible to an extent of O(log n). This cannot be improved because if v is incompressible then it implies that it is very irregular. However, if x has only irregular substrings then x can be only of certain forms, making it easier to specify (and hence compressible). Hence it is inevitable that some substrings of x be compressible.

Complexity of incompressible strings are incompressible

If C(x)=l(px), then px is incompressible. That is, exists c such that for all x, C(px) ≥ l(px) - c.

This c is determined by the index of the machine which does its computation twice: once with the input and once with the output of the input as input. (The proof of this result should be straight-forward from here.)

Complexity is non-monotonic on prefixes

For m < n it is possible that C(1m) > C(1n), notwithstanding that 1m is a prefix of 1n.

To see that, let n = 2k for some k. Then C(1n) ≤ log log n + O(1). However, there are n strings of length m ≤ n. Hence there exists one such string of complexity C(1m) ≥ log n + O(1) (see section on incompressibility).

One may try to overcome this problem by giving, for free, the length of the input string --- that is, consider instead, the length-conditional complexity C( x | l(x) ). However, even this measure is not monotonic on prefixes. To see this, we use strings where their lengths give away all the information there is to know about the string. Let sn = n0n-l(sn). It is clear that for any n, sn can be reconstructed completely from n, and hence C(sn | n) ≤ c, where c is decided by the index of the machine that does the reconstruction of sn from n. Now note that for any string there exists a superstring of it that is sn for some n, and we are done.

Meager sets

For a set A and natural number n, let A≤ n = {x &isin A | l(x) ≤ n }. We say A is meager just in case
limn -> infinity d(A≤ n)/2n=0.

If A is meager and recursive, then for every constant c there are only finitely many x in A that are c-incompressible!

We can similarly show that elements of meager r.e. sets are highly compressible, as follows.

Let A be an r.e. set of elements of the form (x,y) (intuitively, consider y the length of x. We keep this result as general as possible to introduce the randomness deficiency in the next section). If for each natural number y, Ay = {x | (x,y) ∈ A} is finite, then for some constant c depending only on A, for all x ∈ Ay, we have (proof omitted)

C(x|y) ≤ l(d(Ay)) + c.

Let y be the length of the string x in the result, and suppose d(A≤ n) ≤ p(n) for some polynomial p (that is, A is meager). Hence by the result, C(x|n) ≤ l(p(n)) + O(1) = O(log n). However, for all x of length at most n it is clear that C(x) ≤ C(x|n) + 2 l(n) + O(1). This gives us that for any member x of length n from a meager r.e. set, C(x)=O(log n).

Randomness deficiency

We continue from that an element x from a set A has C(x|A) ≤ l(d(A)) + c, where c is independent of x but possibly dependent on A. The randomness deficiency of x relative to A is defined as δ(x|A) = l(d(A)) - C(x|A). That is, randomness deficiency measures the difference between the maximal complexity of a string in A and the complexity of x in A. For the case where A is the set of all binary strings of length n,
δ(x|n) = n - C(x|n) + O(1)

If δ(x|A) is large, this means that there is a description of x with the help of A that is considerably shorter than just giving x's "serial number" in A. Intutively, a large δ(x|A) implies that x is not so random with respect to A. We may consider x to be random in the set A iff δ(x|A)=O(1).

Properties of the function C

If we consider C as a function mapping integers to integers, we can deduce the following properties for it.

Martin-Lof tests

To evaluate the degree of randomness of an element from a universal set V, it is intuitive to devise the following test: Let P be a recursive probability distribution. A Martin-Lof P-test is a total function δ which decides membership of elements in Vm, by defining Vm = {(m,x) ∈ V | δ(x) ≥ m}. (It is clear that Vm+1 is a subset of Vm.) The condition that δ has to fulfill is that:
  1. Each Vm is recursively enumerable.
  2. l(x)=n,δ(x)≥m P(x) ≤ 2-m.
Note that δ(x) may be negative (necessitated by the universal Martin-Lof test to be discussed next). This is allowed, even though Vm for negative m is not involved in the test. This implies that the randomness of any x where δ(x) < 0 are cannot be determined by δ.

An important case is when the probability distribution P corresponds to the uniform distribution. That is, P(x)=2-2 l(x). In this case, the second condition for the Martin-Lof test becomes:

A Martin-Lof test which fulfills this condition is called an L-test.

Universal Martin-Lof test

Fix a probability distribution P. A universal Martin-Lof test with respect to P (or a universal P-test) is a test δ0(.|P) such that for each P-test δ, there is a constant c, such that for all x, δ0(x|P) ≥ δ(x) - c.

It turns out that for any P, all the P-tests can be effectively enumerated, and this allows the definition of a universal P-test:
Let δ12,... be an enumeration of all P-tests, then P0(x|P) = max{δy(x) - y | y ≥ 1} is a universal P-test.

Interestingly, we can also show f(x) = l(x) - C(x|l(x)) - 1 to be a universal L-test. 1