Kraft–McMillan inequality

In coding theory, the Kraft–McMillan inequality gives a necessary and sufficient condition for the existence of a prefix code^[1] (in Kraft's version) or a uniquely decodable code (in McMillan's version) for a given set of codeword lengths. Its applications to prefix codes and trees often find use in computer science and information theory.

Kraft's inequality was published in Kraft (1949). However, Kraft's paper discusses only prefix codes, and attributes the analysis leading to the inequality to Raymond Redheffer. The result was independently discovered of the result in McMillan (1956). McMillan proves the result for the general case of uniquely decodable codes, and attributes the version for prefix codes to a spoken observation in 1955 by Joseph Leo Doob.

Applications and intuitions

Kraft's inequality limits the lengths of codewords in a prefix code: if one takes an exponential of the length of each valid codeword, the resulting set of values must look like a probability mass function, that is, it must have total measure less than or equal to one. Kraft's inequality can be thought of in terms of a constrained budget to be spent on codewords, with shorter codewords being more expensive. Among the useful properties following from the inequality are the following statements:

If Kraft's inequality holds with strict inequality, the code has some redundancy.
If Kraft's inequality holds with equality, the code in question is a complete code.
If Kraft's inequality does not hold, the code is not uniquely decodable.
For every uniquely decodable code, there exists a prefix code with the same length distribution.

Formal statement

Let each source symbol from the alphabet

S=\{\,s_1,s_2,\ldots,s_n\,\}\,

be encoded into a uniquely decodable code over an alphabet of size $r$ with codeword lengths

\ell_1,\ell_2,\ldots,\ell_n.\,

Then

\sum _{i=1}^{n}r^{-\ell _{i}}\leqslant 1.

Conversely, for a given set of natural numbers $\ell_1,\ell_2,\ldots,\ell_n\,$ satisfying the above inequality, there exists a uniquely decodable code over an alphabet of size $r$ with those codeword lengths.

Example: binary trees

9, 14, 19, 67 and 76 are leaf nodes at depths of 3, 3, 3, 3 and 2, respectively.

Any binary tree can be viewed as defining a prefix code for the leaves of the tree. Kraft's inequality states that

\sum _{\ell \in {\text{leaves}}}2^{-{\text{depth}}(\ell )}\leqslant 1.

Here the sum is taken over the leaves of the tree, i.e. the nodes without any children. The depth is the distance to the root node. In the tree to the right, this sum is

{\frac {1}{4}}+4\left({\frac {1}{8}}\right)={\frac {3}{4}}\leqslant 1.

Proof

Proof for prefix codes

Example for binary tree. Red nodes represent a prefix tree. The method for calculating the number of descendant leaf nodes in the full tree is shown.

Suppose that $\ell _{1}\leqslant \ell _{2}\leqslant \cdots \leqslant \ell _{n}$ . Let $A$ be the full $r$ -ary tree of depth $\ell_n$ . Every word of length $\ell \leqslant \ell _{n}$ over an $r$ -ary alphabet corresponds to a node in this tree at depth $\ell$ . The $i$ th word in the prefix code corresponds to a node $v_{i}$ ; let $A_{i}$ be the set of all leaf nodes (i.e. of nodes at depth $\ell_n$ ) in the subtree of $A$ rooted at $v_{i}$ . That subtree being of height $\ell _{n}-\ell _{i}$ , we have

|A_i| = r^{\ell_n-\ell_i}.

Since the code is a prefix code, those subtrees cannot share any leaves, which means that

A_{i}\cap A_{j}=\varnothing ,\quad i\neq j.

Thus, given that the total number of nodes at depth $\ell_n$ is $r^{\ell_n}$ , we have

\left|\bigcup _{i=1}^{n}A_{i}\right|=\sum _{i=1}^{n}|A_{i}|=\sum _{i=1}^{n}r^{\ell _{n}-\ell _{i}}\leqslant r^{\ell _{n}}

from which the result follows.

Conversely, given any ordered sequence of $n$ natural numbers,

\ell _{1}\leqslant \ell _{2}\leqslant \cdots \leqslant \ell _{n}

satisfying the Kraft inequality, one can construct a prefix code with codeword lengths equal to each $\ell_i$ by choosing a word of length $\ell_i$ arbitrarily, then ruling out all words of greater length that have it as a prefix. There again, we shall interpret this in terms of leaf nodes of an $r$ -ary tree of depth $\ell_n$ . First choose any node from the full tree at depth $\ell _{1}$ ; it corresponds to the first word of our new code. Since we are building a prefix code, all the descendants of this node (i.e., all words that have this first word as a prefix) become unsuitable for inclusion in the code. We consider the descendants at depth $\ell_n$ ; there are $r^{\ell _{n}-\ell _{1}}$ such descendant nodes to be removed from consideration. The next iteration removes $r^{\ell _{n}-\ell _{2}}$ other nodes, and so on. After $n$ iterations, we have removed a total of

\sum _{i=1}^{n}r^{\ell _{n}-\ell _{i}}

nodes. The question is whether we need to remove more leaf nodes than we actually have available — $r^{\ell _{n}}$ in all — in the process of building the code. Since the Kraft inequality holds, we have indeed

\sum _{i=1}^{n}r^{\ell _{n}-\ell _{i}}\leqslant r^{\ell _{n}}

and thus a prefix code can be built. Note that as the choice of nodes at each step is largely arbitrary, many different suitable prefix codes can be built, in general.

Proof of the general case

Consider the generating function in inverse of x for the code S

F(x) = \sum_{i=1}^n x^{-|s_i|} = \sum_{\ell=\min}^\max p_\ell \, x^{-\ell}

in which $p_\ell$ —the coefficient in front of $x^{-\ell}$ —is the number of distinct codewords of length $\ell$ . Here min is the length of the shortest codeword in S, and max is the length of the longest codeword in S.

Consider all m-powers S^m, in the form of words $s_{i_1}s_{i_2}\dots s_{i_m}$ , where $i_1, i_2, \dots, i_m$ are indices between 1 and n. Note that, since S was assumed to uniquely decodable, $s_{i_1}s_{i_2}\dots s_{i_m}=s_{j_1}s_{j_2}\dots s_{j_m}$ implies $i_1=j_1, i_2=j_2, \dots, i_m=j_m$ . Because of this property, one can compute the generating function $G(x)$ for $S^{m}$ from the generating function $F(x)$ as

{\begin{aligned}G(x)&=\left(F(x)\right)^{m}=\left(\sum _{i=1}^{n}x^{-|s_{i}|}\right)^{m}\\&=\sum _{i_{1}=1}^{n}\sum _{i_{2}=1}^{n}\cdots \sum _{i_{m}=1}^{n}x^{-\left(|s_{i_{1}}|+|s_{i_{2}}|+\cdots +|s_{i_{m}}|\right)}\\&=\sum _{i_{1}=1}^{n}\sum _{i_{2}=1}^{n}\cdots \sum _{i_{m}=1}^{n}x^{-|s_{i_{1}}s_{i_{2}}\cdots s_{i_{m}}|}=\sum _{\ell =m\cdot \min }^{m\cdot \max }q_{\ell }\,x^{-\ell }\;.\end{aligned}}

Here, similarly as before, $q_\ell$ — the coefficient in front of $x^{-\ell}$ in $G(x)$ — is the number of words of length $\ell$ in $S^{m}$ . Clearly, $q_\ell$ cannot exceed $r^\ell$ . Hence for any positive x,

(F(x))^{m}\leq \sum _{\ell =m\cdot \min }^{m\cdot \max }r^{\ell }\,x^{-\ell }\;.

Substituting the value x = r we have

(F(r))^{m}\leq m\cdot (\max -\min )+1

for any positive integer $m$ . The left side of the inequality grows exponentially in $m$ and the right side only linearly. The only possibility for the inequality to be valid for all $m$ is that $F(r) \le 1$ . Looking back on the definition of $F(x)$ we finally get the inequality.

\sum_{i=1}^n r^{-\ell_i} = \sum_{i=1}^n r^{-|s_i|} = F(r) \le 1 \; .

Alternative construction for the converse

Given a sequence of $n$ natural numbers,

\ell _{1}\leqslant \ell _{2}\leqslant \cdots \leqslant \ell _{n}

satisfying the Kraft inequality, we can construct a prefix code as follows. Define the i^th codeword, C_i, to be the first l_i digits after the radix point (e.g. decimal point) in the base r representation of

\sum_{j = 1}^{i - 1} r^{-l_j}.

Note that by Kraft's inequality, this sum is never more than 1. Hence the codewords capture the entire value of the sum. Therefore, for j > i, the first l_i digits of C_j form a larger number than C_i, so the code is prefix free.

Notes

↑ Cover, Thomas M.; Thomas, Joy A. (2006), Elements of Information Theory (PDF) (2nd ed.), John Wiley & Sons, Inc, pp. 108–109, doi:10.1002/047174882X.ch5, ISBN 0-471-24195-4

References

Kraft, Leon G. (1949), A device for quantizing, grouping, and coding amplitude modulated pulses, Cambridge, MA: MS Thesis, Electrical Engineering Department, Massachusetts Institute of Technology .

McMillan, Brockway (1956), "Two inequalities implied by unique decipherability", IEEE Trans. Information Theory, 2 (4): 115–116, doi:10.1109/TIT.1956.1056818 .