Shannon's Source Coding Theorem: The Fundamental Limit of Compression
Understanding Shannon's source coding theorem, entropy as the absolute limit of lossless compression, optimal code lengths, and Huffman coding as the practical realization.
Terminology
| Term | Definition |
|---|---|
| Shannon entropy $H(X)$ | The average information content per symbol from a source: $H(X) = -\sum_{i} p_i \log_2 p_i$ bits |
| Source coding | The process of encoding data from a source into a compressed representation; also called data compression |
| Lossless compression | Compression where the original data can be perfectly reconstructed from the compressed form; no information is lost |
| Prefix-free code | A code where no codeword is a prefix of another, enabling unambiguous decoding without delimiters |
| Huffman coding | A greedy algorithm that constructs an optimal prefix-free code by building a binary tree from the least frequent symbols up |
| Kraft's inequality | For a prefix-free code with codeword lengths $l_1, \ldots, l_n$: $\sum 2^{-l_i} \leq 1$; constrains which code lengths are achievable |
| Optimal code | A prefix-free code that minimizes the expected codeword length $\sum p_i l_i$; Huffman coding achieves this |
| Arithmetic coding | A compression method that encodes an entire message as a single number in $[0, 1)$, achieving near-entropy compression |
| Asymptotic equipartition property (AEP) | For long sequences from a stationary source, almost all sequences have probability close to $2^{-nH}$, where $H$ is the entropy |
What & Why
In 1948, Claude Shannon published "A Mathematical Theory of Communication," founding information theory. His source coding theorem establishes the fundamental limit of lossless data compression: you cannot compress data below its entropy, and you can get arbitrarily close to it.
Entropy H(X) measures the average "surprise" per symbol. A source that always outputs the same symbol has zero entropy (no surprise, infinite compressibility). A source that outputs each of n symbols with equal probability has entropy \log_2 n bits per symbol (maximum surprise, no compressibility beyond \log_2 n bits per symbol).
Shannon's theorem says: for a source with entropy H bits per symbol, any lossless compression scheme must use at least H bits per symbol on average. Conversely, there exist codes that use at most H + \epsilon bits per symbol for any \epsilon > 0 (by coding long blocks). Huffman coding achieves within 1 bit of entropy per symbol, and arithmetic coding gets arbitrarily close.
This theorem is the reason gzip, zstd, and every other compressor has a limit. It is the reason truly random data cannot be compressed. It is the mathematical foundation of every compression algorithm ever built.
How It Works
Entropy: The Measure of Information
For a discrete random variable X with possible values x_1, \ldots, x_n and probabilities p_1, \ldots, p_n:
Properties of entropy:
H(X) \geq 0, with equality iffXis deterministic.H(X) \leq \log_2 n, with equality iff all outcomes are equally likely.H(X)is measured in bits (when using\log_2).
The Source Coding Theorem
Let X_1, X_2, \ldots be i.i.d. random variables from a source with entropy H. For any lossless code:
And there exists a code achieving:
when encoding blocks of n symbols at a time. As n \to \infty, the rate approaches H(X).
Huffman Coding: The Optimal Symbol Code
Huffman coding assigns shorter codewords to more frequent symbols. It builds a binary tree bottom-up:
- Create a leaf node for each symbol with its probability.
- Repeatedly merge the two nodes with the smallest probabilities.
- Assign 0 to the left branch and 1 to the right branch.
The resulting code is optimal among all prefix-free codes (minimum expected length). The expected length L satisfies:
Why You Cannot Beat Entropy
The proof uses Kraft's inequality. For any prefix-free code with lengths l_1, \ldots, l_n:
The expected code length is L = \sum p_i l_i. Using the method of Lagrange multipliers (or the log-sum inequality), the minimum of L subject to Kraft's inequality is achieved when l_i = -\log_2 p_i, giving L = H(X). Since code lengths must be integers, the actual minimum is at most H(X) + 1.
Complexity Analysis
| Algorithm | Compression Ratio | Time Complexity |
|---|---|---|
| Huffman coding | $H \leq L < H + 1$ bits/symbol | $O(n \log n)$ to build tree |
| Arithmetic coding | $L \to H$ as message length $\to \infty$ | $O(n)$ per symbol (with finite precision) |
| Shannon-Fano coding | $H \leq L < H + 1$ bits/symbol | $O(n \log n)$ |
| Block Huffman (blocks of $k$) | $H \leq L < H + 1/k$ bits/symbol | $O(|\Sigma|^k \log |\Sigma|^k)$ to build tree |
Example: a source with P(A) = 0.5, P(B) = 0.25, P(C) = 0.125, P(D) = 0.125:
Huffman code: A = 0, B = 10, C = 110, D = 111. Expected length: 0.5(1) + 0.25(2) + 0.125(3) + 0.125(3) = 1.75 bits. This matches entropy exactly because the probabilities are all powers of 2.
Implementation
ALGORITHM ShannonEntropy(probabilities)
INPUT: probabilities: list of symbol probabilities (must sum to 1)
OUTPUT: entropy in bits
BEGIN
H <- 0
FOR EACH p IN probabilities DO
IF p > 0 THEN
H <- H - p * LOG2(p)
END IF
END FOR
RETURN H
END
ALGORITHM HuffmanBuildTree(symbols, probabilities)
INPUT: symbols: list of symbols, probabilities: corresponding probabilities
OUTPUT: root of Huffman tree
BEGIN
// Create leaf nodes
heap <- MIN_PRIORITY_QUEUE
FOR i FROM 0 TO LENGTH(symbols) - 1 DO
node <- CREATE_LEAF(symbols[i], probabilities[i])
INSERT node INTO heap WITH PRIORITY probabilities[i]
END FOR
// Build tree bottom-up
WHILE SIZE(heap) > 1 DO
left <- EXTRACT_MIN(heap)
right <- EXTRACT_MIN(heap)
parent <- CREATE_INTERNAL_NODE(left, right)
parent.probability <- left.probability + right.probability
INSERT parent INTO heap WITH PRIORITY parent.probability
END WHILE
RETURN EXTRACT_MIN(heap) // root of the tree
END
ALGORITHM HuffmanEncode(message, codeTable)
INPUT: message: sequence of symbols
codeTable: map from symbol to binary codeword
OUTPUT: compressed binary string
BEGIN
encoded <- empty bit string
FOR EACH symbol IN message DO
APPEND codeTable[symbol] TO encoded
END FOR
RETURN encoded
END
ALGORITHM HuffmanDecode(bitString, tree)
INPUT: bitString: compressed binary string
tree: Huffman tree root
OUTPUT: decoded message
BEGIN
decoded <- empty list
node <- tree
FOR EACH bit IN bitString DO
IF bit = 0 THEN
node <- node.left
ELSE
node <- node.right
END IF
IF node IS LEAF THEN
APPEND node.symbol TO decoded
node <- tree // restart from root
END IF
END FOR
RETURN decoded
END
ALGORITHM VerifySourceCodingTheorem(source, codeTable)
INPUT: source: probability distribution over symbols
codeTable: map from symbol to codeword
OUTPUT: comparison of expected length vs entropy
BEGIN
H <- ShannonEntropy(source.probabilities)
expectedLength <- 0
FOR EACH (symbol, prob) IN source DO
codeLength <- LENGTH(codeTable[symbol])
expectedLength <- expectedLength + prob * codeLength
END FOR
ASSERT expectedLength >= H // source coding theorem lower bound
ASSERT expectedLength < H + 1 // Huffman guarantee
RETURN (entropy: H, expectedLength: expectedLength, gap: expectedLength - H)
END
Real-World Applications
- Every lossless compression algorithm (gzip, zstd, brotli, lz4) is bounded by Shannon's theorem: they can approach entropy but never beat it, and their effectiveness depends on how far the source's entropy is below the raw bit rate
- Huffman coding is used in JPEG image compression (for encoding quantized DCT coefficients), DEFLATE (the algorithm behind gzip and PNG), and MP3 audio compression
- Arithmetic coding, which gets closer to entropy than Huffman, is used in H.264/H.265 video codecs (CABAC), JPEG 2000, and modern compression libraries like zstd and ANS
- Shannon's theorem explains why encrypted data and truly random data cannot be compressed: their entropy is maximal (1 bit per bit), so no compression is possible
- In communication systems, the source coding theorem determines the minimum bandwidth needed to transmit data from a source, informing the design of modems, satellite links, and streaming protocols
- The theorem connects to Kolmogorov complexity: Shannon entropy is the expected Kolmogorov complexity for strings from a probabilistic source, bridging the probabilistic and individual views of information
Key Takeaways
- Shannon's source coding theorem establishes that entropy $H(X)$ is the fundamental lower bound on lossless compression: no code can achieve fewer than $H$ bits per symbol on average
- Entropy $H(X) = -\sum p_i \log_2 p_i$ measures the average information content per symbol; it is zero for deterministic sources and maximal ($\log_2 n$) for uniform distributions
- Huffman coding constructs an optimal prefix-free code with expected length between $H$ and $H + 1$ bits per symbol; it matches entropy exactly when probabilities are powers of 2
- Arithmetic coding and block coding can get arbitrarily close to entropy by encoding longer sequences, approaching the theoretical limit as block length increases
- The theorem explains why random and encrypted data are incompressible (entropy equals the raw bit rate) and why structured data compresses well (entropy is much lower)
- Shannon's 1948 paper founded information theory and established the mathematical framework that underlies all modern data compression, communication, and coding theory