Notes on Li and Vitanyi's Kolmogorov Complexity book

These are notes I took from a reading of Li and Vitanyi's An Introduction to Kolmogorov Complexity and Its Applications.

Arriving at -∑_i p_i log p_i

We show an encoding for strings that leads to the expression -∑_i p_i log p_i, used as the information content, or entropy, in Shannon's Information Theory.

Given a fixed, arbitrary set of s alphabets, {a₁,...,a_s}. There exists a method to encode/transmit any string of length k in at most s log k + log (k!/k₁!k₂!...!k_s!) bits, where k_i is the number of occurences of a_i within the string. (Note: we will furthermore show that log (k!/k₁!k₂!...!k_s!) ∼ -k ∑_i (k_i/k) log (k_i/k).)

This is the encoding/decoding scheme:

First the numbers k₁,...,k_s are communicated. This takes s log k bits. Now there are only

(k k₁)(k-k₁ k₂)... (k_s k_s)= k!/k₁!k₂!...!k_s! strings of such occurrences of letters. This allows us to uniquely specify any one of these strings using a number from 1 to k!/k₁!k₂!...!k_s!. Communicating this number takes at most log (k!/k₁!k₂!...!k_s!) bits. (An interesting observation here is that the number of bits communicated using such a scheme is lesser when the numbers k₁,...,k_s differ more.)

Now using Stirling's approximation (i.e. k!∼ (2πk)^1/2(k/e)^k => log k!∼ k log k), we have

-log (k₁!k₂!...!k_s!/k!) = -∑_i k_i log k_i+k log k = -∑_i k_i log (k_i/k) (i.e. since k=∑_i k_i). And hence we have log (k!/k₁!k₂!...!k_s!) ∼ -k ∑_i (k_i/k) log (k_i/k).

-∑_i p_i log p_i and prefix-free codings

The formula -∑_i p_i log p_i, where p_i=k_i/k also has meaning when using encoding/decoding schema based on encoding alphabet (instead of the indexing method described).

These encoding schema encodes each letter a_i into a binary code and then transmit the string by these codes. Since we want these codes to be self-delimiting (so that we do not waste a letter as delimiter) we require the binary codes to be prefix-free.

We can show that such prefix-free codes correspond to paths to leaf nodes in a binary tree. Furthermore, the paths to a parent of any used leaf node cannot be used, since it would be the prefix of the path to the leaf node. In the case that every leaf node is used to represent a path, we call the resultant code complete. It can be verified that for any complete code {s₁,s₂,...,s_n}, ∑_i 2^-|s_i|=1. This gives us the intuition to prove that: there exists a set of prefix codes {s₁,s₂,...,s_n} of lengths l₁,l₂,...,l_n if and only if ∑_i 2^-l_i ≤ 1 (Kraft's inequality).

Normalized metric: a normalized metric is a distance measure d:UxU->

We now define the notion of an optimal prefix-free code. Let A={a₁,...,a_n} be an alphabet where the probability for each letter a_i to appear is p_i (i.e. a_i occurs p_i k times in a string of length k). A prefix-free code is called optimal with respect to A just in case it has the least ∑_i p_i |s_i| among all prefix-free codes.

Let H(A)=-∑_i p_i log p_i. The Noiseless Coding Theorem states that the average length L=∑_ip_i|s_i| of the optimal code {s₁,...,s_n} has H(A) ≤ L≤ H(A)+1.

Note that completeness does not guarantee a prefix-free codes to achieve this optimality. However, the Huffman codes, which is prefix-free, does.

Also, while the Huffman coding encodes to the shortest average length among all prefix-free codes, other methods --devised under other assumptions-- may result in shorter lengths.

Kolmogorov Complexity

The plain Kolmogorov complexity of a string x computed by a function φ,

C_φ( x | y ) = min{ l(p) | φ(<p,y>) = x }, where p is considered the input.
x is the output.
y is an auxilary input, which is not taken into account in the complexity. That is, y costs nothing.

l(x) stands for the length of x. x can be a number or a string. In the case that x is a number, l(x) = log x + O(1).

We fix a universal Turing machine Φ which on input p deconstruct p into two parts <i,j>, and then runs T_i(j). Let

C( x | y ) = C_Φ( x | y )
C( x ) = C( x | ε )

Invariance Theorem

The complexities for any two universal Turing machines Φ₁ and Φ₂ can be different for only up to a constant factor. This is due to the fact that any Φ₁ is produced by Φ₂ running an emulation of Φ₁ (via machine T_i say) C_Φ₂( x | y ) ≤ C_Φ₁( x | y ) + i

Non-subadditivity

The complexity of two strings is not necessarily greater than the sum of their complexities. In fact, since two strings u, v can be obtained if we have two programs p, q that generates them, and a delimiter for p, q (which is of length in the order of log(min(C(u),C(v)))).

C(<u,v>) ≤ C(u) + C(v) + O(log(min(C(u),C(v))))

Incompressibility Theorem

For each constant c we say a string x is c-incompressible if C(x) ≥ l(x) - c. Intuitively, c stands for the amount of "compression" achieved.

For each n there are 2ⁿ strings of length n, but there are only ∑_{0≤i≤ n-c-1}2ⁱ = 2^n-c-1 descriptors of lengths up to n-c-1. (Note that the strings of descriptors of lengths ≥ n-c are c-incompressible.)

Hence for any n and c ≤ n, there are 2ⁿ - 2^n-c + 1 c-incompressible strings.

We generalize this observation to:
For each fixed y, every finite set A of cardinality m has at least m(1 - 2^-c) + 1 elements x with C(x|y) > log m - c.

Substring Incompressibility

Substrings of incompressible strings have to be incompressible --- to a certain extent.

Let x = uvw. Suppose p is a short program for v.

Now uvw can be reconstructed from the specification q = l(p) p l(u) u w, by say some machine T.
Hence,

C(x) ≤ C_T(x) + O(1) ≤ l(q) + O(1) On the other hand, l(q) ≤ 2l(C(v)) + C(v) + 2l(n) + (n-l(v))
for l(p) + for p + for l(u) + for uw

Noting that l(C(v)) and l(n) are of the order of log n,

C(x) ≤ l(q) + O(1) ≤ C(v) + (n-l(v)) + 4 log n + O(1)

Suppose x is c-incompressible. That is, C(x) ≥ n - c. We have

n - c ≤ C(x) ≤ C(v) + (n-l(v)) - O(log n).
and hence C(v) ≥ l(v) - O(log n).

Hence v is incompressible to an extent of O(log n). This cannot be improved because if v is incompressible then it implies that it is very irregular. However, if x has only irregular substrings then x can be only of certain forms, making it easier to specify (and hence compressible). Hence it is inevitable that some substrings of x be compressible.

Complexity of incompressible strings are incompressible

If C(x)=l(p_x), then p_x is incompressible. That is, exists c such that for all x, C(p_x) ≥ l(p_x) - c.

This c is determined by the index of the machine which does its computation twice: once with the input and once with the output of the input as input. (The proof of this result should be straight-forward from here.)

Complexity is non-monotonic on prefixes

For m < n it is possible that C(1^m) > C(1ⁿ), notwithstanding that 1^m is a prefix of 1ⁿ.

To see that, let n = 2^k for some k. Then C(1ⁿ) ≤ log log n + O(1). However, there are n strings of length m ≤ n. Hence there exists one such string of complexity C(1^m) ≥ log n + O(1) (see section on incompressibility).

One may try to overcome this problem by giving, for free, the length of the input string --- that is, consider instead, the length-conditional complexity C( x | l(x) ). However, even this measure is not monotonic on prefixes. To see this, we use strings where their lengths give away all the information there is to know about the string. Let sⁿ = n0^n-l(sⁿ). It is clear that for any n, sⁿ can be reconstructed completely from n, and hence C(sⁿ | n) ≤ c, where c is decided by the index of the machine that does the reconstruction of sⁿ from n. Now note that for any string there exists a superstring of it that is sⁿ for some n, and we are done.

Meager sets

For a set A and natural number n, let A^{≤ n} = {x &isin A | l(x) ≤ n }. We say A is meager just in case lim_{n -> infinity} d(A^{≤ n})/2ⁿ=0.

If A is meager and recursive, then for every constant c there are only finitely many x in A that are c-incompressible!

We can similarly show that elements of meager r.e. sets are highly compressible, as follows.

Let A be an r.e. set of elements of the form (x,y) (intuitively, consider y the length of x. We keep this result as general as possible to introduce the randomness deficiency in the next section). If for each natural number y, A^y = {x | (x,y) ∈ A} is finite, then for some constant c depending only on A, for all x ∈ A^y, we have (proof omitted)

C(x|y) ≤ l(d(A^y)) + c.

Let y be the length of the string x in the result, and suppose d(A^{≤ n}) ≤ p(n) for some polynomial p (that is, A is meager). Hence by the result, C(x|n) ≤ l(p(n)) + O(1) = O(log n). However, for all x of length at most n it is clear that C(x) ≤ C(x|n) + 2 l(n) + O(1). This gives us that for any member x of length n from a meager r.e. set, C(x)=O(log n).

Randomness deficiency

We continue from that an element x from a set A has C(x|A) ≤ l(d(A)) + c, where c is independent of x but possibly dependent on A. The randomness deficiency of x relative to A is defined as δ(x|A) = l(d(A)) - C(x|A). That is, randomness deficiency measures the difference between the maximal complexity of a string in A and the complexity of x in A. For the case where A is the set of all binary strings of length n, δ(x|n) = n - C(x|n) + O(1)

If δ(x|A) is large, this means that there is a description of x with the help of A that is considerably shorter than just giving x's "serial number" in A. Intutively, a large δ(x|A) implies that x is not so random with respect to A. We may consider x to be random in the set A iff δ(x|A)=O(1).

Properties of the function C

If we consider C as a function mapping integers to integers, we can deduce the following properties for it.

C(x) is unbounded.
Let m(x) = min{C(y) | y ≥ x}, then m(x) is unbounded.
For any partial recursive function Φ(x) which goes monotonic to infinity from some x₀ onward, we have m(x) < Φ(x) except for finitely many x.
The function C(x) is not partial recursive. Nevertheless, there is a total recursive function Φ(t,x), monotonic decreasing in t, such that lim_t->infinity Φ(t,x) = C(x).

Martin-Lof tests

To evaluate the degree of randomness of an element from a universal set V, it is intuitive to devise the following test:

for each degree m=0,1,..., define a subset V_m of V such that V_m = {x | (m,x) ∈ V}, where
V_m+1 is subset of V_m. Hence the probability of membership in each V_m+1 is lesser than probability of membership in V_m.
Each such V_m is called a critical region, and tests the randomness of an element x at the degree/level m. More precisely, an element x is not random at level m if x ∈ V_m. It is clear that if an element is not random at level m then it is not also random at all levels m' ≤ m. The complement of a critical region, V-V_m is called a confidence interval.
For the underlying probability distribution P we want to satisfy certain bounds:
That is, for each V_m, we want ∑_{l(x)=n, x ∈ V_m} P(x) ≤ ε_m.

Let P be a recursive probability distribution. A Martin-Lof P-test is a total function δ which decides membership of elements in V_m, by defining V_m = {(m,x) ∈ V | δ(x) ≥ m}. (It is clear that V_m+1 is a subset of V_m.) The condition that δ has to fulfill is that:

Each V_m is recursively enumerable.
∑_{l(x)=n,δ(x)≥m} P(x) ≤ 2^-m.

Note that δ(x) may be negative (necessitated by the universal Martin-Lof test to be discussed next). This is allowed, even though V_m for negative m is not involved in the test. This implies that the randomness of any x where δ(x) < 0 are cannot be determined by δ.

An important case is when the probability distribution P corresponds to the uniform distribution. That is, P(x)=2^{-2 l(x)}. In this case, the second condition for the Martin-Lof test becomes:

d({ x | l(x)=n, δ(x)≥m }) ≤ 2^n-m.

A Martin-Lof test which fulfills this condition is called an L-test.

Universal Martin-Lof test

Fix a probability distribution P. A universal Martin-Lof test with respect to P (or a universal P-test) is a test δ₀(.|P) such that for each P-test δ, there is a constant c, such that for all x, δ₀(x|P) ≥ δ(x) - c.

It turns out that for any P, all the P-tests can be effectively enumerated, and this allows the definition of a universal P-test:
Let δ₁,δ₂,... be an enumeration of all P-tests, then P₀(x|P) = max{δ_y(x) - y | y ≥ 1} is a universal P-test.

Interestingly, we can also show f(x) = l(x) - C(x|l(x)) - 1 to be a universal L-test.

Arriving at -∑i pi log pi

-∑i pi log pi and prefix-free codings