Back to Blog

Shannon's Source Coding Theorem: The Fundamental Limit of Compression

Understanding Shannon's source coding theorem, entropy as the absolute limit of lossless compression, optimal code lengths, and Huffman coding as the practical realization.

2025-07-28
Share
Proofsproofsshannonsource-codingcompression

Terminology

Term Definition
Shannon entropy $H(X)$ The average information content per symbol from a source: $H(X) = -\sum_{i} p_i \log_2 p_i$ bits
Source coding The process of encoding data from a source into a compressed representation; also called data compression
Lossless compression Compression where the original data can be perfectly reconstructed from the compressed form; no information is lost
Prefix-free code A code where no codeword is a prefix of another, enabling unambiguous decoding without delimiters
Huffman coding A greedy algorithm that constructs an optimal prefix-free code by building a binary tree from the least frequent symbols up
Kraft's inequality For a prefix-free code with codeword lengths $l_1, \ldots, l_n$: $\sum 2^{-l_i} \leq 1$; constrains which code lengths are achievable
Optimal code A prefix-free code that minimizes the expected codeword length $\sum p_i l_i$; Huffman coding achieves this
Arithmetic coding A compression method that encodes an entire message as a single number in $[0, 1)$, achieving near-entropy compression
Asymptotic equipartition property (AEP) For long sequences from a stationary source, almost all sequences have probability close to $2^{-nH}$, where $H$ is the entropy

What & Why

In 1948, Claude Shannon published "A Mathematical Theory of Communication," founding information theory. His source coding theorem establishes the fundamental limit of lossless data compression: you cannot compress data below its entropy, and you can get arbitrarily close to it.

Entropy H(X) measures the average "surprise" per symbol. A source that always outputs the same symbol has zero entropy (no surprise, infinite compressibility). A source that outputs each of n symbols with equal probability has entropy \log_2 n bits per symbol (maximum surprise, no compressibility beyond \log_2 n bits per symbol).

Shannon's theorem says: for a source with entropy H bits per symbol, any lossless compression scheme must use at least H bits per symbol on average. Conversely, there exist codes that use at most H + \epsilon bits per symbol for any \epsilon > 0 (by coding long blocks). Huffman coding achieves within 1 bit of entropy per symbol, and arithmetic coding gets arbitrarily close.

This theorem is the reason gzip, zstd, and every other compressor has a limit. It is the reason truly random data cannot be compressed. It is the mathematical foundation of every compression algorithm ever built.

How It Works

Entropy: The Measure of Information

For a discrete random variable X with possible values x_1, \ldots, x_n and probabilities p_1, \ldots, p_n:

$H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i$

Properties of entropy:

  • H(X) \geq 0, with equality iff X is deterministic.
  • H(X) \leq \log_2 n, with equality iff all outcomes are equally likely.
  • H(X) is measured in bits (when using \log_2).

The Source Coding Theorem

Let X_1, X_2, \ldots be i.i.d. random variables from a source with entropy H. For any lossless code:

$\text{Expected bits per symbol} \geq H(X)$

And there exists a code achieving:

$\text{Expected bits per symbol} \leq H(X) + \frac{1}{n}$

when encoding blocks of n symbols at a time. As n \to \infty, the rate approaches H(X).

Block length n Bits per symbol H(X) entropy Huffman (symbol-by-symbol) Arithmetic coding (block) IMPOSSIBLE REGION: below entropy

Huffman Coding: The Optimal Symbol Code

Huffman coding assigns shorter codewords to more frequent symbols. It builds a binary tree bottom-up:

  1. Create a leaf node for each symbol with its probability.
  2. Repeatedly merge the two nodes with the smallest probabilities.
  3. Assign 0 to the left branch and 1 to the right branch.

The resulting code is optimal among all prefix-free codes (minimum expected length). The expected length L satisfies:

$H(X) \leq L < H(X) + 1$

Why You Cannot Beat Entropy

The proof uses Kraft's inequality. For any prefix-free code with lengths l_1, \ldots, l_n:

$\sum_{i=1}^{n} 2^{-l_i} \leq 1$

The expected code length is L = \sum p_i l_i. Using the method of Lagrange multipliers (or the log-sum inequality), the minimum of L subject to Kraft's inequality is achieved when l_i = -\log_2 p_i, giving L = H(X). Since code lengths must be integers, the actual minimum is at most H(X) + 1.

Complexity Analysis

Algorithm Compression Ratio Time Complexity
Huffman coding $H \leq L < H + 1$ bits/symbol $O(n \log n)$ to build tree
Arithmetic coding $L \to H$ as message length $\to \infty$ $O(n)$ per symbol (with finite precision)
Shannon-Fano coding $H \leq L < H + 1$ bits/symbol $O(n \log n)$
Block Huffman (blocks of $k$) $H \leq L < H + 1/k$ bits/symbol $O(|\Sigma|^k \log |\Sigma|^k)$ to build tree

Example: a source with P(A) = 0.5, P(B) = 0.25, P(C) = 0.125, P(D) = 0.125:

$H = -(0.5 \log_2 0.5 + 0.25 \log_2 0.25 + 0.125 \log_2 0.125 + 0.125 \log_2 0.125) = 1.75 \text{ bits}$

Huffman code: A = 0, B = 10, C = 110, D = 111. Expected length: 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 1.75 bits. This matches entropy exactly because the probabilities are all powers of 2.

Implementation

ALGORITHM ShannonEntropy(probabilities)
INPUT: probabilities: list of symbol probabilities (must sum to 1)
OUTPUT: entropy in bits
BEGIN
  H <- 0
  FOR EACH p IN probabilities DO
    IF p > 0 THEN
      H <- H - p * LOG2(p)
    END IF
  END FOR
  RETURN H
END

ALGORITHM HuffmanBuildTree(symbols, probabilities)
INPUT: symbols: list of symbols, probabilities: corresponding probabilities
OUTPUT: root of Huffman tree
BEGIN
  // Create leaf nodes
  heap <- MIN_PRIORITY_QUEUE
  FOR i FROM 0 TO LENGTH(symbols) - 1 DO
    node <- CREATE_LEAF(symbols[i], probabilities[i])
    INSERT node INTO heap WITH PRIORITY probabilities[i]
  END FOR

  // Build tree bottom-up
  WHILE SIZE(heap) > 1 DO
    left <- EXTRACT_MIN(heap)
    right <- EXTRACT_MIN(heap)
    parent <- CREATE_INTERNAL_NODE(left, right)
    parent.probability <- left.probability + right.probability
    INSERT parent INTO heap WITH PRIORITY parent.probability
  END WHILE

  RETURN EXTRACT_MIN(heap)  // root of the tree
END

ALGORITHM HuffmanEncode(message, codeTable)
INPUT: message: sequence of symbols
       codeTable: map from symbol to binary codeword
OUTPUT: compressed binary string
BEGIN
  encoded <- empty bit string
  FOR EACH symbol IN message DO
    APPEND codeTable[symbol] TO encoded
  END FOR
  RETURN encoded
END

ALGORITHM HuffmanDecode(bitString, tree)
INPUT: bitString: compressed binary string
       tree: Huffman tree root
OUTPUT: decoded message
BEGIN
  decoded <- empty list
  node <- tree

  FOR EACH bit IN bitString DO
    IF bit = 0 THEN
      node <- node.left
    ELSE
      node <- node.right
    END IF

    IF node IS LEAF THEN
      APPEND node.symbol TO decoded
      node <- tree  // restart from root
    END IF
  END FOR

  RETURN decoded
END

ALGORITHM VerifySourceCodingTheorem(source, codeTable)
INPUT: source: probability distribution over symbols
       codeTable: map from symbol to codeword
OUTPUT: comparison of expected length vs entropy
BEGIN
  H <- ShannonEntropy(source.probabilities)

  expectedLength <- 0
  FOR EACH (symbol, prob) IN source DO
    codeLength <- LENGTH(codeTable[symbol])
    expectedLength <- expectedLength + prob * codeLength
  END FOR

  ASSERT expectedLength >= H  // source coding theorem lower bound
  ASSERT expectedLength < H + 1  // Huffman guarantee

  RETURN (entropy: H, expectedLength: expectedLength, gap: expectedLength - H)
END

Real-World Applications

  • Every lossless compression algorithm (gzip, zstd, brotli, lz4) is bounded by Shannon's theorem: they can approach entropy but never beat it, and their effectiveness depends on how far the source's entropy is below the raw bit rate
  • Huffman coding is used in JPEG image compression (for encoding quantized DCT coefficients), DEFLATE (the algorithm behind gzip and PNG), and MP3 audio compression
  • Arithmetic coding, which gets closer to entropy than Huffman, is used in H.264/H.265 video codecs (CABAC), JPEG 2000, and modern compression libraries like zstd and ANS
  • Shannon's theorem explains why encrypted data and truly random data cannot be compressed: their entropy is maximal (1 bit per bit), so no compression is possible
  • In communication systems, the source coding theorem determines the minimum bandwidth needed to transmit data from a source, informing the design of modems, satellite links, and streaming protocols
  • The theorem connects to Kolmogorov complexity: Shannon entropy is the expected Kolmogorov complexity for strings from a probabilistic source, bridging the probabilistic and individual views of information

Key Takeaways

  • Shannon's source coding theorem establishes that entropy $H(X)$ is the fundamental lower bound on lossless compression: no code can achieve fewer than $H$ bits per symbol on average
  • Entropy $H(X) = -\sum p_i \log_2 p_i$ measures the average information content per symbol; it is zero for deterministic sources and maximal ($\log_2 n$) for uniform distributions
  • Huffman coding constructs an optimal prefix-free code with expected length between $H$ and $H + 1$ bits per symbol; it matches entropy exactly when probabilities are powers of 2
  • Arithmetic coding and block coding can get arbitrarily close to entropy by encoding longer sequences, approaching the theoretical limit as block length increases
  • The theorem explains why random and encrypted data are incompressible (entropy equals the raw bit rate) and why structured data compresses well (entropy is much lower)
  • Shannon's 1948 paper founded information theory and established the mathematical framework that underlies all modern data compression, communication, and coding theory