Information Theory Basics: Entropy, Compression, and Shannon's Theorem
Shannon entropy, the source coding theorem, compression limits, Huffman coding, and why you can never compress truly random data.
Terminology
| Term | Definition |
|---|---|
| Information (Shannon) | The surprise of an event: $I(x) = -\log_2 P(x)$ bits; rare events carry more information than common ones |
| Entropy ($H$) | The expected information content of a random variable: $H(X) = -\sum p(x) \log_2 p(x)$; measures average uncertainty |
| Source coding theorem | Shannon's theorem stating that the average code length for a source cannot be less than its entropy $H$; entropy is the compression limit |
| Prefix-free code | A code where no codeword is a prefix of another, enabling unambiguous decoding; Huffman codes are prefix-free |
| Huffman coding | A greedy algorithm that builds an optimal prefix-free code by assigning shorter codewords to more frequent symbols |
| Mutual information | $I(X;Y) = H(X) - H(X|Y)$: how much knowing $Y$ reduces uncertainty about $X$ |
| Channel capacity | The maximum rate at which information can be reliably transmitted over a noisy channel, measured in bits per use |
| Kolmogorov complexity | The length of the shortest program that produces a given string; incompressible strings have Kolmogorov complexity close to their length |
| Lossless compression | Compression where the original data can be perfectly reconstructed; bounded below by entropy |
What & Why
Claude Shannon's 1948 paper "A Mathematical Theory of Communication" founded information theory and changed the world. It answers a fundamental question: what is the absolute limit of data compression?
The answer is entropy. If a source produces symbols with known probabilities, the entropy H tells you the minimum average number of bits per symbol needed to encode the source losslessly. You cannot beat entropy, no matter how clever your compression algorithm is.
This has immediate practical consequences. When someone claims a compression algorithm that "compresses any file," they are wrong. A counting argument proves it: there are 2^n possible n-bit strings but only 2^{n-1} - 1 shorter strings. Most strings cannot be compressed. Truly random data is incompressible because it has maximum entropy.
How It Works
Shannon Entropy
For a discrete random variable X with possible values \{x_1, x_2, \ldots, x_n\} and probabilities \{p_1, p_2, \ldots, p_n\}:
Example: A fair coin has H = -(\frac{1}{2}\log_2\frac{1}{2} + \frac{1}{2}\log_2\frac{1}{2}) = 1 bit. Maximum uncertainty.
Example: A biased coin with P(\text{heads}) = 0.9 has H = -(0.9\log_2 0.9 + 0.1\log_2 0.1) \approx 0.47 bits. Less uncertainty, more compressible.
Entropy is maximized when all outcomes are equally likely (maximum uncertainty) and minimized (zero) when one outcome is certain.
Shannon's Source Coding Theorem
For a source with entropy H:
- No lossless compression can achieve an average code length less than
Hbits per symbol - There exists a code that achieves average length between
HandH + 1bits per symbol
This means entropy is the fundamental compression limit. Huffman coding gets close to this limit, and arithmetic coding can get arbitrarily close.
Why Random Data Cannot Be Compressed
A string of n truly random bits has entropy H = n bits. The source coding theorem says you need at least n bits to encode it. No compression is possible.
Counting argument: There are 2^n possible n-bit strings. If compression mapped every string to a shorter string, you would need 2^n distinct shorter strings, but there are only \sum_{k=0}^{n-1} 2^k = 2^n - 1 strings shorter than n bits. By the pigeonhole principle, at least one string cannot be compressed.
Huffman Coding
Huffman coding builds an optimal prefix-free code by greedily merging the two least-frequent symbols:
Average code length: 0.4 \times 1 + 0.3 \times 2 + 0.2 \times 3 + 0.1 \times 3 = 1.9 bits/symbol.
Entropy: H \approx 1.85 bits/symbol. Huffman achieves within 0.05 bits of the theoretical limit.
Complexity Analysis
| Algorithm | Compression Ratio | Time | Optimality |
|---|---|---|---|
| Huffman coding | $H \le L < H + 1$ | $O(n \log n)$ build, $O(n)$ encode | Optimal among prefix-free codes |
| Arithmetic coding | $\approx H + \epsilon$ | $O(n)$ | Approaches entropy for long sequences |
| LZ77/LZ78 (gzip family) | Asymptotically optimal | $O(n)$ | Adapts to source statistics |
| No compression (identity) | $n$ bits for $n$-bit input | $O(n)$ | Baseline |
Key formula for the binary symmetric channel with crossover probability $p$:
Implementation
ALGORITHM ShannonEntropy(probabilities)
INPUT: probabilities: list of symbol probabilities summing to 1
OUTPUT: entropy in bits
BEGIN
H <- 0
FOR EACH p IN probabilities DO
IF p > 0 THEN
H <- H - p * LOG2(p)
END IF
END FOR
RETURN H
END
ALGORITHM BuildHuffmanTree(symbols, frequencies)
INPUT: symbols: list of symbols, frequencies: corresponding frequencies
OUTPUT: root of Huffman tree
BEGIN
priorityQueue <- empty min-heap
FOR i <- 0 TO LENGTH(symbols) - 1 DO
node <- MakeLeaf(symbols[i], frequencies[i])
INSERT(priorityQueue, node)
END FOR
WHILE SIZE(priorityQueue) > 1 DO
left <- EXTRACT_MIN(priorityQueue)
right <- EXTRACT_MIN(priorityQueue)
parent <- MakeInternal(left, right, left.freq + right.freq)
INSERT(priorityQueue, parent)
END WHILE
RETURN EXTRACT_MIN(priorityQueue)
END
ALGORITHM HuffmanEncode(root, text)
INPUT: root: Huffman tree root, text: string to encode
OUTPUT: encoded bit string
BEGIN
codeTable <- empty map
BuildCodeTable(root, "", codeTable)
encoded <- ""
FOR EACH symbol IN text DO
encoded <- encoded + codeTable[symbol]
END FOR
RETURN encoded
END
ALGORITHM BuildCodeTable(node, prefix, table)
INPUT: node: tree node, prefix: current code prefix, table: code map
BEGIN
IF node is a leaf THEN
table[node.symbol] <- prefix
ELSE
BuildCodeTable(node.left, prefix + "0", table)
BuildCodeTable(node.right, prefix + "1", table)
END IF
END
Real-World Applications
- File compression: gzip (LZ77 + Huffman), zstd (LZ77 + Huffman + FSE), and brotli all use entropy coding as their final stage; the entropy of the data determines the compression limit
- Image and video codecs: JPEG uses Huffman coding, H.264/H.265 use context-adaptive binary arithmetic coding (CABAC), and PNG uses DEFLATE (LZ77 + Huffman)
- Communication systems: Shannon's channel coding theorem tells engineers the maximum data rate for a given signal-to-noise ratio, guiding the design of 5G, Wi-Fi, and satellite links
- Machine learning: cross-entropy loss is directly derived from information theory; minimizing cross-entropy is equivalent to maximizing the likelihood of the model's predictions
- Cryptography: a good encryption scheme produces ciphertext with maximum entropy (indistinguishable from random), making it incompressible and unpredictable
Key Takeaways
- Shannon entropy $H(X) = -\sum p(x) \log_2 p(x)$ measures the average information content and sets the fundamental limit for lossless compression
- The source coding theorem proves you cannot compress below entropy; Huffman and arithmetic coding approach this limit
- Truly random data has maximum entropy and cannot be compressed; the pigeonhole principle makes this a mathematical certainty
- Huffman coding is optimal among prefix-free codes; arithmetic coding can get arbitrarily close to entropy for long sequences
- Information theory underpins compression, communication, machine learning (cross-entropy loss), and cryptography