Shannon's Source Coding Theorem: The Fundamental Limit of Compression
Understanding Shannon's source coding theorem, entropy as the absolute limit of lossless compression, optimal code lengths, and Huffman coding as the practical realization.
Terminology
| Term | Definition |
|---|---|
| Shannon entropy $H(X)$ | The average information content per symbol from a source: $H(X) = -\sum_{i} p_i \log_2 p_i$ bits |
| Source coding | The process of encoding data from a source into a compressed representation; also called data compression |
| Lossless compression | Compression where the original data can be perfectly reconstructed from the compressed form; no information is lost |
| Prefix-free code | A code where no codeword is a prefix of another, enabling unambiguous decoding without delimiters |
| Huffman coding | A greedy algorithm that constructs an optimal prefix-free code by building a binary tree from the least frequent symbols up |
| Kraft's inequality | For a prefix-free code with codeword lengths $l_1, \ldots, l_n$: $\sum 2^{-l_i} \leq 1$; constrains which code lengths are achievable |
| Optimal code | A prefix-free code that minimizes the expected codeword length $\sum p_i l_i$; Huffman coding achieves this |
| Arithmetic coding | A compression method that encodes an entire message as a single number in $[0, 1)$, achieving near-entropy compression |
| Asymptotic equipartition property (AEP) | For long sequences from a stationary source, almost all sequences have probability close to $2^{-nH}$, where $H$ is the entropy |
What & Why
In 1948, Claude Shannon published "A Mathematical Theory of Communication," founding information theory. His source coding theorem establishes the fundamental limit of lossless data compression: you cannot compress data below its entropy, and you can get arbitrarily close to it.
Entropy $H(X)$ measures the average "surprise" per symbol. A source that always outputs the same symbol has zero entropy (no surprise, infinite compressibility). A source that outputs each of $n$ symbols with equal probability has entropy $\log_2 n$ bits per symbol (maximum surprise, no compressibility beyond $\log_2 n$ bits per symbol).
Shannon's theorem says: for a source with entropy $H$ bits per symbol, any lossless compression scheme must use at least $H$ bits per symbol on average. Conversely, there exist codes that use at most $H + \epsilon$ bits per symbol for any $\epsilon > 0$ (by coding long blocks). Huffman coding achieves within 1 bit of entropy per symbol, and arithmetic coding gets arbitrarily close.
This theorem is the reason gzip, zstd, and every other compressor has a limit. It is the reason truly random data cannot be compressed. It is the mathematical foundation of every compression algorithm ever built.
How It Works
Entropy: The Measure of Information
For a discrete random variable $X$ with possible values $x_1, \ldots, x_n$ and probabilities $p_1, \ldots, p_n$:
Properties of entropy:
- $H(X) \geq 0$, with equality iff $X$ is deterministic.
- $H(X) \leq \log_2 n$, with equality iff all outcomes are equally likely.
- $H(X)$ is measured in bits (when using $\log_2$).
The Source Coding Theorem
Let $X_1, X_2, \ldots$ be i.i.d. random variables from a source with entropy $H$. For any lossless code:
And there exists a code achieving:
when encoding blocks of $n$ symbols at a time. As $n \to \infty$, the rate approaches $H(X)$.
Huffman Coding: The Optimal Symbol Code
Huffman coding assigns shorter codewords to more frequent symbols. It builds a binary tree bottom-up:
- Create a leaf node for each symbol with its probability.
- Repeatedly merge the two nodes with the smallest probabilities.
- Assign 0 to the left branch and 1 to the right branch.
The resulting code is optimal among all prefix-free codes (minimum expected length). The expected length $L$ satisfies:
Why You Cannot Beat Entropy
The proof uses Kraft's inequality. For any prefix-free code with lengths $l_1, \ldots, l_n$:
The expected code length is $L = \sum p_i l_i$. Using the method of Lagrange multipliers (or the log-sum inequality), the minimum of $L$ subject to Kraft's inequality is achieved when $l_i = -\log_2 p_i$, giving $L = H(X)$. Since code lengths must be integers, the actual minimum is at most $H(X) + 1$.
Complexity Analysis
| Algorithm | Compression Ratio | Time Complexity |
|---|---|---|
| Huffman coding | $H \leq L < H + 1$ bits/symbol | $O(n \log n)$ to build tree |
| Arithmetic coding | $L \to H$ as message length $\to \infty$ | $O(n)$ per symbol (with finite precision) |
| Shannon-Fano coding | $H \leq L < H + 1$ bits/symbol | $O(n \log n)$ |
| Block Huffman (blocks of $k$) | $H \leq L < H + 1/k$ bits/symbol | $O(|\Sigma|^k \log |\Sigma|^k)$ to build tree |
Example: a source with $P(A) = 0.5, P(B) = 0.25, P(C) = 0.125, P(D) = 0.125$:
Huffman code: A = 0, B = 10, C = 110, D = 111. Expected length: $0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 1.75$ bits. This matches entropy exactly because the probabilities are all powers of 2.
Implementation
ALGORITHM ShannonEntropy(probabilities)
INPUT: probabilities: list of symbol probabilities (must sum to 1)
OUTPUT: entropy in bits
BEGIN
H <- 0
FOR EACH p IN probabilities DO
IF p > 0 THEN
H <- H - p * LOG2(p)
END IF
END FOR
RETURN H
END
ALGORITHM HuffmanBuildTree(symbols, probabilities)
INPUT: symbols: list of symbols, probabilities: corresponding probabilities
OUTPUT: root of Huffman tree
BEGIN
// Create leaf nodes
heap <- MIN_PRIORITY_QUEUE
FOR i FROM 0 TO LENGTH(symbols) - 1 DO
node <- CREATE_LEAF(symbols[i], probabilities[i])
INSERT node INTO heap WITH PRIORITY probabilities[i]
END FOR
// Build tree bottom-up
WHILE SIZE(heap) > 1 DO
left <- EXTRACT_MIN(heap)
right <- EXTRACT_MIN(heap)
parent <- CREATE_INTERNAL_NODE(left, right)
parent.probability <- left.probability + right.probability
INSERT parent INTO heap WITH PRIORITY parent.probability
END WHILE
RETURN EXTRACT_MIN(heap) // root of the tree
END
ALGORITHM HuffmanEncode(message, codeTable)
INPUT: message: sequence of symbols
codeTable: map from symbol to binary codeword
OUTPUT: compressed binary string
BEGIN
encoded <- empty bit string
FOR EACH symbol IN message DO
APPEND codeTable[symbol] TO encoded
END FOR
RETURN encoded
END
ALGORITHM HuffmanDecode(bitString, tree)
INPUT: bitString: compressed binary string
tree: Huffman tree root
OUTPUT: decoded message
BEGIN
decoded <- empty list
node <- tree
FOR EACH bit IN bitString DO
IF bit = 0 THEN
node <- node.left
ELSE
node <- node.right
END IF
IF node IS LEAF THEN
APPEND node.symbol TO decoded
node <- tree // restart from root
END IF
END FOR
RETURN decoded
END
ALGORITHM VerifySourceCodingTheorem(source, codeTable)
INPUT: source: probability distribution over symbols
codeTable: map from symbol to codeword
OUTPUT: comparison of expected length vs entropy
BEGIN
H <- ShannonEntropy(source.probabilities)
expectedLength <- 0
FOR EACH (symbol, prob) IN source DO
codeLength <- LENGTH(codeTable[symbol])
expectedLength <- expectedLength + prob * codeLength
END FOR
ASSERT expectedLength >= H // source coding theorem lower bound
ASSERT expectedLength < H + 1 // Huffman guarantee
RETURN (entropy: H, expectedLength: expectedLength, gap: expectedLength - H)
END
Real-World Applications
- Every lossless compression algorithm (gzip, zstd, brotli, lz4) is bounded by Shannon's theorem: they can approach entropy but never beat it, and their effectiveness depends on how far the source's entropy is below the raw bit rate
- Huffman coding is used in JPEG image compression (for encoding quantized DCT coefficients), DEFLATE (the algorithm behind gzip and PNG), and MP3 audio compression
- Arithmetic coding, which gets closer to entropy than Huffman, is used in H.264/H.265 video codecs (CABAC), JPEG 2000, and modern compression libraries like zstd and ANS
- Shannon's theorem explains why encrypted data and truly random data cannot be compressed: their entropy is maximal (1 bit per bit), so no compression is possible
- In communication systems, the source coding theorem determines the minimum bandwidth needed to transmit data from a source, informing the design of modems, satellite links, and streaming protocols
- The theorem connects to Kolmogorov complexity: Shannon entropy is the expected Kolmogorov complexity for strings from a probabilistic source, bridging the probabilistic and individual views of information
Key Takeaways
- Shannon's source coding theorem establishes that entropy $H(X)$ is the fundamental lower bound on lossless compression: no code can achieve fewer than $H$ bits per symbol on average
- Entropy $H(X) = -\sum p_i \log_2 p_i$ measures the average information content per symbol; it is zero for deterministic sources and maximal ($\log_2 n$) for uniform distributions
- Huffman coding constructs an optimal prefix-free code with expected length between $H$ and $H + 1$ bits per symbol; it matches entropy exactly when probabilities are powers of 2
- Arithmetic coding and block coding can get arbitrarily close to entropy by encoding longer sequences, approaching the theoretical limit as block length increases
- The theorem explains why random and encrypted data are incompressible (entropy equals the raw bit rate) and why structured data compresses well (entropy is much lower)
- Shannon's 1948 paper founded information theory and established the mathematical framework that underlies all modern data compression, communication, and coding theory
Read More
2025-07-29
P, NP, NP-Hard, and NP-Complete: The Complexity Zoo
Understanding the major complexity classes, what makes a problem polynomial or non-polynomial, the P vs NP question, NP-hardness, NP-completeness, and how they all relate.
2025-06-15
Groups: The Algebra of Symmetry
Understanding groups, their four axioms, and why this single algebraic structure underpins error detection, cryptography, and algorithm design.
2025-06-16
Cyclic Groups and Generators: The Engine Behind Diffie-Hellman
How a single element can generate an entire group, and why this property is the algebraic foundation of modern key exchange.