Notes/Shannon's Source Coding Theorem: The Fundamental Limit of Compression

Shannon's Source Coding Theorem: The Fundamental Limit of Compression

Understanding Shannon's source coding theorem, entropy as the absolute limit of lossless compression, optimal code lengths, and Huffman coding as the practical realization.

2025-07-28AI-Synthesized from Personal Notes

Pure MathShannonSource CodingCompression

Terminology

Term	Definition
Shannon entropy $H(X)$	The average information content per symbol from a source: $H(X) = -\sum_{i} p_i \log_2 p_i$ bits
Source coding	The process of encoding data from a source into a compressed representation; also called data compression
Lossless compression	Compression where the original data can be perfectly reconstructed from the compressed form; no information is lost
Prefix-free code	A code where no codeword is a prefix of another, enabling unambiguous decoding without delimiters
Huffman coding	A greedy algorithm that constructs an optimal prefix-free code by building a binary tree from the least frequent symbols up
Kraft's inequality	For a prefix-free code with codeword lengths $l_1, \ldots, l_n$: $\sum 2^{-l_i} \leq 1$; constrains which code lengths are achievable
Optimal code	A prefix-free code that minimizes the expected codeword length $\sum p_i l_i$; Huffman coding achieves this
Arithmetic coding	A compression method that encodes an entire message as a single number in $[0, 1)$, achieving near-entropy compression
Asymptotic equipartition property (AEP)	For long sequences from a stationary source, almost all sequences have probability close to $2^{-nH}$, where $H$ is the entropy

What & Why

In 1948, Claude Shannon published "A Mathematical Theory of Communication," founding information theory. His source coding theorem establishes the fundamental limit of lossless data compression: you cannot compress data below its entropy, and you can get arbitrarily close to it.

Entropy $H(X)$ measures the average "surprise" per symbol. A source that always outputs the same symbol has zero entropy (no surprise, infinite compressibility). A source that outputs each of $n$ symbols with equal probability has entropy $\log_2 n$ bits per symbol (maximum surprise, no compressibility beyond $\log_2 n$ bits per symbol).

Shannon's theorem says: for a source with entropy $H$ bits per symbol, any lossless compression scheme must use at least $H$ bits per symbol on average. Conversely, there exist codes that use at most $H + \epsilon$ bits per symbol for any $\epsilon > 0$ (by coding long blocks). Huffman coding achieves within 1 bit of entropy per symbol, and arithmetic coding gets arbitrarily close.

This theorem is the reason gzip, zstd, and every other compressor has a limit. It is the reason truly random data cannot be compressed. It is the mathematical foundation of every compression algorithm ever built.

How It Works

Entropy: The Measure of Information

For a discrete random variable $X$ with possible values $x_1, \ldots, x_n$ and probabilities $p_1, \ldots, p_n$:

$H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i$

Properties of entropy:

$H(X) \geq 0$, with equality iff $X$ is deterministic.
$H(X) \leq \log_2 n$, with equality iff all outcomes are equally likely.
$H(X)$ is measured in bits (when using $\log_2$).

The Source Coding Theorem

Let $X_1, X_2, \ldots$ be i.i.d. random variables from a source with entropy $H$. For any lossless code:

$\text{Expected bits per symbol} \geq H(X)$

And there exists a code achieving:

$\text{Expected bits per symbol} \leq H(X) + \frac{1}{n}$

when encoding blocks of $n$ symbols at a time. As $n \to \infty$, the rate approaches $H(X)$.

Huffman Coding: The Optimal Symbol Code

Huffman coding assigns shorter codewords to more frequent symbols. It builds a binary tree bottom-up:

Create a leaf node for each symbol with its probability.
Repeatedly merge the two nodes with the smallest probabilities.
Assign 0 to the left branch and 1 to the right branch.

The resulting code is optimal among all prefix-free codes (minimum expected length). The expected length $L$ satisfies:

$H(X) \leq L < H(X) + 1$

Why You Cannot Beat Entropy

The proof uses Kraft's inequality. For any prefix-free code with lengths $l_1, \ldots, l_n$:

$\sum_{i=1}^{n} 2^{-l_i} \leq 1$

The expected code length is $L = \sum p_i l_i$. Using the method of Lagrange multipliers (or the log-sum inequality), the minimum of $L$ subject to Kraft's inequality is achieved when $l_i = -\log_2 p_i$, giving $L = H(X)$. Since code lengths must be integers, the actual minimum is at most $H(X) + 1$.

Complexity Analysis

Algorithm	Compression Ratio	Time Complexity
Huffman coding	$H \leq L < H + 1$ bits/symbol	$O(n \log n)$ to build tree
Arithmetic coding	$L \to H$ as message length $\to \infty$	$O(n)$ per symbol (with finite precision)
Shannon-Fano coding	$H \leq L < H + 1$ bits/symbol	$O(n \log n)$
Block Huffman (blocks of $k$)	$H \leq L < H + 1/k$ bits/symbol	$O(\|\Sigma\|^k \log \|\Sigma\|^k)$ to build tree

Example: a source with $P(A) = 0.5, P(B) = 0.25, P(C) = 0.125, P(D) = 0.125$:

$H = -(0.5 \log_2 0.5 + 0.25 \log_2 0.25 + 0.125 \log_2 0.125 + 0.125 \log_2 0.125) = 1.75 \text{ bits}$

Huffman code: A = 0, B = 10, C = 110, D = 111. Expected length: $0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 1.75$ bits. This matches entropy exactly because the probabilities are all powers of 2.

Implementation

ALGORITHM ShannonEntropy(probabilities)
INPUT: probabilities: list of symbol probabilities (must sum to 1)
OUTPUT: entropy in bits
BEGIN
  H <- 0
  FOR EACH p IN probabilities DO
    IF p > 0 THEN
      H <- H - p * LOG2(p)
    END IF
  END FOR
  RETURN H
END

ALGORITHM HuffmanBuildTree(symbols, probabilities)
INPUT: symbols: list of symbols, probabilities: corresponding probabilities
OUTPUT: root of Huffman tree
BEGIN
  // Create leaf nodes
  heap <- MIN_PRIORITY_QUEUE
  FOR i FROM 0 TO LENGTH(symbols) - 1 DO
    node <- CREATE_LEAF(symbols[i], probabilities[i])
    INSERT node INTO heap WITH PRIORITY probabilities[i]
  END FOR

  // Build tree bottom-up
  WHILE SIZE(heap) > 1 DO
    left <- EXTRACT_MIN(heap)
    right <- EXTRACT_MIN(heap)
    parent <- CREATE_INTERNAL_NODE(left, right)
    parent.probability <- left.probability + right.probability
    INSERT parent INTO heap WITH PRIORITY parent.probability
  END WHILE

  RETURN EXTRACT_MIN(heap)  // root of the tree
END

ALGORITHM HuffmanEncode(message, codeTable)
INPUT: message: sequence of symbols
       codeTable: map from symbol to binary codeword
OUTPUT: compressed binary string
BEGIN
  encoded <- empty bit string
  FOR EACH symbol IN message DO
    APPEND codeTable[symbol] TO encoded
  END FOR
  RETURN encoded
END

ALGORITHM HuffmanDecode(bitString, tree)
INPUT: bitString: compressed binary string
       tree: Huffman tree root
OUTPUT: decoded message
BEGIN
  decoded <- empty list
  node <- tree

  FOR EACH bit IN bitString DO
    IF bit = 0 THEN
      node <- node.left
    ELSE
      node <- node.right
    END IF

    IF node IS LEAF THEN
      APPEND node.symbol TO decoded
      node <- tree  // restart from root
    END IF
  END FOR

  RETURN decoded
END

ALGORITHM VerifySourceCodingTheorem(source, codeTable)
INPUT: source: probability distribution over symbols
       codeTable: map from symbol to codeword
OUTPUT: comparison of expected length vs entropy
BEGIN
  H <- ShannonEntropy(source.probabilities)

  expectedLength <- 0
  FOR EACH (symbol, prob) IN source DO
    codeLength <- LENGTH(codeTable[symbol])
    expectedLength <- expectedLength + prob * codeLength
  END FOR

  ASSERT expectedLength >= H  // source coding theorem lower bound
  ASSERT expectedLength < H + 1  // Huffman guarantee

  RETURN (entropy: H, expectedLength: expectedLength, gap: expectedLength - H)
END

Real-World Applications

Every lossless compression algorithm (gzip, zstd, brotli, lz4) is bounded by Shannon's theorem: they can approach entropy but never beat it, and their effectiveness depends on how far the source's entropy is below the raw bit rate
Huffman coding is used in JPEG image compression (for encoding quantized DCT coefficients), DEFLATE (the algorithm behind gzip and PNG), and MP3 audio compression
Arithmetic coding, which gets closer to entropy than Huffman, is used in H.264/H.265 video codecs (CABAC), JPEG 2000, and modern compression libraries like zstd and ANS
Shannon's theorem explains why encrypted data and truly random data cannot be compressed: their entropy is maximal (1 bit per bit), so no compression is possible
In communication systems, the source coding theorem determines the minimum bandwidth needed to transmit data from a source, informing the design of modems, satellite links, and streaming protocols
The theorem connects to Kolmogorov complexity: Shannon entropy is the expected Kolmogorov complexity for strings from a probabilistic source, bridging the probabilistic and individual views of information

Key Takeaways

Shannon's source coding theorem establishes that entropy $H(X)$ is the fundamental lower bound on lossless compression: no code can achieve fewer than $H$ bits per symbol on average
Entropy $H(X) = -\sum p_i \log_2 p_i$ measures the average information content per symbol; it is zero for deterministic sources and maximal ($\log_2 n$) for uniform distributions
Huffman coding constructs an optimal prefix-free code with expected length between $H$ and $H + 1$ bits per symbol; it matches entropy exactly when probabilities are powers of 2
Arithmetic coding and block coding can get arbitrarily close to entropy by encoding longer sequences, approaching the theoretical limit as block length increases
The theorem explains why random and encrypted data are incompressible (entropy equals the raw bit rate) and why structured data compresses well (entropy is much lower)
Shannon's 1948 paper founded information theory and established the mathematical framework that underlies all modern data compression, communication, and coding theory

2025-07-29

P, NP, NP-Hard, and NP-Complete: The Complexity Zoo

Understanding the major complexity classes, what makes a problem polynomial or non-polynomial, the P vs NP question, NP-hardness, NP-completeness, and how they all relate.

2025-06-15

Groups: The Algebra of Symmetry

Understanding groups, their four axioms, and why this single algebraic structure underpins error detection, cryptography, and algorithm design.

2025-06-16

Cyclic Groups and Generators: The Engine Behind Diffie-Hellman

How a single element can generate an entire group, and why this property is the algebraic foundation of modern key exchange.